🚀 B200 bare metal now at $5.6/hr. The best price you'll find. DC in US West → (Access it from Bare metal button on top after login).

Get Your B200 →
Start Building
Cover image
Engineering

Token Factory: How We Built a 98% Cheaper OpenAI Alternative

OpenAI-compatible, $0.10/M real-time, $0.05/M batch, LoRA fine-tuning from $5. Here's exactly how Token Factory works under the hood.

Author photo
packet.ai Team
January 29, 2025

Token Factory is packet.ai's managed LLM inference API: OpenAI-compatible, $0.10/M tokens real-time and $0.05/M batch, with LoRA fine-tuning and no infrastructure to manage.

Key takeaways

  • Real-time inference: $0.10/M tokens. Batch 1h SLA: $0.07/M. Batch 24h SLA: $0.05/M
  • Drop-in OpenAI replacement: change one line (base_url), keep existing SDK, streaming, and tool calling
  • Batch processing endpoint accepts JSONL files — 50% cheaper for async workloads
  • LoRA fine-tuning: 50–500 examples, 10–60 minutes, $5–50 per run, deploy same-day via extra_body
  • Cheaper because open-source models + vLLM continuous batching + packet.ai’s optimised Blackwell infrastructure

This post covers the technical implementation of Token Factory: the API endpoints, batch processing workflow, LoRA fine-tuning API, and why the pricing works. For the background on why we built it, see Why We Built Token Factory.

Drop-in replacement: using Token Factory with the OpenAI SDK

Token Factory implements the OpenAI Chat Completions API spec. The migration is a single line:

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="your-packet-api-key"
)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain KV cache"}],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

# With LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model="meta-llama/Llama-3.3-70B-Instruct",
    openai_api_base="https://dash.packet.ai/api/v1",
    openai_api_key="your-packet-api-key"
)

Tool calling, JSON mode, and structured outputs are supported on all models. Response format is identical to OpenAI — your existing parsing code works unchanged.

Batch processing: 50% cheaper for async workloads

For offline workloads — dataset labelling, document summarisation, evaluation pipelines, nightly processing jobs — batch mode cuts the price by 50% versus real-time.

Mode Price/M tokens SLA Best for
Real-time$0.10<2sUser-facing APIs, chat
Batch 1h$0.071 hourNear-real-time pipelines
Batch 24h$0.0524 hoursOffline processing, evals

Batch workflow: prepare a JSONL file, POST it, poll for completion, download results:

# 1. Prepare requests.jsonl
# {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions",
#  "body": {"model": "meta-llama/Llama-3.3-70B-Instruct",
#           "messages": [{"role": "user", "content": "Classify: {text}"}]}}

# 2. Submit batch
curl -X POST https://dash.packet.ai/api/v1/batch \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@requests.jsonl" \
  -F "sla=24h"

# 3. Poll status
curl https://dash.packet.ai/api/v1/batch/{batch_id} \
  -H "Authorization: Bearer YOUR_API_KEY"

# 4. Download results when complete
curl https://dash.packet.ai/api/v1/batch/{batch_id}/results \
  -H "Authorization: Bearer YOUR_API_KEY" -o results.jsonl

LoRA fine-tuning: custom adapters in under an hour

Token Factory supports LoRA adapter training directly via the API. No GPU provisioning, no infrastructure setup. Supply 50–500 high-quality examples, pick a base model, and get a deployed adapter in 10–60 minutes.

Typical cost: $5–50 per run depending on dataset size and model. The adapter is stored on packet.ai and served at inference time via a single extra_body parameter.

# Train a LoRA adapter
curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-classifier-v1",
    "base_model": "meta-llama/Llama-3.3-70B-Instruct",
    "training_data": "s3://your-bucket/train.jsonl",
    "epochs": 3,
    "rank": 16
  }'

# Use the adapter at inference time
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Classify this ticket..."}],
    extra_body={"lora_adapter": "lora_support_classifier_v1_abc123"}
)

Why Token Factory costs 98% less than GPT-4o

Three compounding reasons:

1

Open-source models with no margin stack

Llama 3.3 70B, Qwen 2.5 72B, and DeepSeek R1 are free to deploy. OpenAI's pricing includes model development cost recovery, proprietary API infrastructure margins, and enterprise support. None of those apply here.

2

vLLM continuous batching

vLLM's PagedAttention and continuous batching keep GPU utilisation high by dynamically grouping requests. This means the cost per token is spread across many concurrent requests rather than charged at dedicated-GPU rates.

3

hosted·ai’s 5× utilisation advantage

packet.ai runs on hosted·ai’s GPU pooling infrastructure, achieving up to 5× better hardware utilisation than static allocation. That efficiency flows through directly to lower per-token costs.

Frequently asked questions

Yes. Token Factory supports streaming (server-sent events), tool calling, JSON mode, and structured outputs. The response format is identical to OpenAI — existing code that parses OpenAI responses works unchanged with Token Factory.
Prepare a JSONL file of requests, POST it to the batch endpoint with your chosen SLA (1h at $0.07/M or 24h at $0.05/M), poll for completion, then download the results JSONL. Batch processing is 30–50% cheaper than real-time because jobs are queued and processed during off-peak capacity windows.
POST a training job via the API with your base model, training data URL, and LoRA rank. Training takes 10–60 minutes depending on dataset size (50–500 examples) and costs $5–50 per run. The resulting adapter is deployed automatically and can be called at inference time via extra_body: {lora_adapter: "adapter_id"}.
Context window varies by model. Llama 3.3 70B and Qwen 2.5 72B support up to 128K context tokens. DeepSeek R1 Distill 32B supports 32K. Mistral Large 123B supports 128K. All models can be used with the full context window at the standard per-token price.
Yes. Use ChatOpenAI from langchain_openai with openai_api_base set to https://dash.packet.ai/api/v1. For LlamaIndex, use OpenAI class with api_base override. Both work out of the box because Token Factory implements the OpenAI API spec exactly.

Last reviewed: 10 June 2026. Try Token Factory — first 10,000 tokens free →

Waste less compute.

Same models. Same API. Fraction of the cost. Start free — no credit card required.

Start Building →

More from the blog