Does Token Factory support streaming and tool calling?

Yes. Token Factory supports streaming, tool calling, JSON mode, and structured outputs. Response format is identical to OpenAI.

How does Token Factory batch processing work?

POST a JSONL file to the batch endpoint with 1h ($0.07/M) or 24h ($0.05/M) SLA, poll for completion, download results JSONL.

How does LoRA fine-tuning work on Token Factory?

POST a training job with base model and training data. Training takes 10-60 minutes, costs $5-50 per run, and deploys via extra_body lora_adapter parameter.

Can I use Token Factory with LangChain?

Yes. Use ChatOpenAI from langchain_openai with openai_api_base set to https://dash.packet.ai/api/v1. Works out of the box.

Engineering

Token Factory: How We Built a 98% Cheaper OpenAI Alternative

OpenAI-compatible, $0.10/M real-time, $0.05/M batch, LoRA fine-tuning from $5. Here's exactly how Token Factory works under the hood.

packet.ai Team

January 29, 2025

Token Factory is packet.ai's managed LLM inference API: OpenAI-compatible, $0.10/M tokens real-time and $0.05/M batch, with LoRA fine-tuning and no infrastructure to manage.

Key takeaways

Real-time inference: $0.10/M tokens. Batch 1h SLA: $0.07/M. Batch 24h SLA: $0.05/M
Drop-in OpenAI replacement: change one line (base_url), keep existing SDK, streaming, and tool calling
Batch processing endpoint accepts JSONL files — 50% cheaper for async workloads
LoRA fine-tuning: 50–500 examples, 10–60 minutes, $5–50 per run, deploy same-day via extra_body
Cheaper because open-source models + vLLM continuous batching + packet.ai’s optimised Blackwell infrastructure

This post covers the technical implementation of Token Factory: the API endpoints, batch processing workflow, LoRA fine-tuning API, and why the pricing works. For the background on why we built it, see Why We Built Token Factory.

Drop-in replacement: using Token Factory with the OpenAI SDK

Token Factory implements the OpenAI Chat Completions API spec. The migration is a single line:

from openai import OpenAI

client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="your-packet-api-key"
)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain KV cache"}],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

# With LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model="meta-llama/Llama-3.3-70B-Instruct",
    openai_api_base="https://dash.packet.ai/api/v1",
    openai_api_key="your-packet-api-key"
)

Tool calling, JSON mode, and structured outputs are supported on all models. Response format is identical to OpenAI — your existing parsing code works unchanged.

Batch processing: 50% cheaper for async workloads

For offline workloads — dataset labelling, document summarisation, evaluation pipelines, nightly processing jobs — batch mode cuts the price by 50% versus real-time.

Mode	Price/M tokens	SLA	Best for
Real-time	$0.10	<2s	User-facing APIs, chat
Batch 1h	$0.07	1 hour	Near-real-time pipelines
Batch 24h	$0.05	24 hours	Offline processing, evals

Batch workflow: prepare a JSONL file, POST it, poll for completion, download results:

# 1. Prepare requests.jsonl
# {"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions",
#  "body": {"model": "meta-llama/Llama-3.3-70B-Instruct",
#           "messages": [{"role": "user", "content": "Classify: {text}"}]}}

# 2. Submit batch
curl -X POST https://dash.packet.ai/api/v1/batch \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@requests.jsonl" \
  -F "sla=24h"

# 3. Poll status
curl https://dash.packet.ai/api/v1/batch/{batch_id} \
  -H "Authorization: Bearer YOUR_API_KEY"

# 4. Download results when complete
curl https://dash.packet.ai/api/v1/batch/{batch_id}/results \
  -H "Authorization: Bearer YOUR_API_KEY" -o results.jsonl

LoRA fine-tuning: custom adapters in under an hour

Token Factory supports LoRA adapter training directly via the API. No GPU provisioning, no infrastructure setup. Supply 50–500 high-quality examples, pick a base model, and get a deployed adapter in 10–60 minutes.

Typical cost: $5–50 per run depending on dataset size and model. The adapter is stored on packet.ai and served at inference time via a single extra_body parameter.

# Train a LoRA adapter
curl -X POST https://dash.packet.ai/api/dashboard/token-factory/lora \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-classifier-v1",
    "base_model": "meta-llama/Llama-3.3-70B-Instruct",
    "training_data": "s3://your-bucket/train.jsonl",
    "epochs": 3,
    "rank": 16
  }'

# Use the adapter at inference time
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Classify this ticket..."}],
    extra_body={"lora_adapter": "lora_support_classifier_v1_abc123"}
)

Why Token Factory costs 98% less than GPT-4o

Three compounding reasons:

1

Open-source models with no margin stack

Llama 3.3 70B, Qwen 2.5 72B, and DeepSeek R1 are free to deploy. OpenAI's pricing includes model development cost recovery, proprietary API infrastructure margins, and enterprise support. None of those apply here.

2

vLLM continuous batching

vLLM's PagedAttention and continuous batching keep GPU utilisation high by dynamically grouping requests. This means the cost per token is spread across many concurrent requests rather than charged at dedicated-GPU rates.

3

hosted·ai’s 5× utilisation advantage

packet.ai runs on hosted·ai’s GPU pooling infrastructure, achieving up to 5× better hardware utilisation than static allocation. That efficiency flows through directly to lower per-token costs.

Frequently asked questions

POST a training job via the API with your base model, training data URL, and LoRA rank. Training takes 10–60 minutes depending on dataset size (50–500 examples) and costs $5–50 per run. The resulting adapter is deployed automatically and can be called at inference time via extra_body: {lora_adapter: "adapter_id"}.

Last reviewed: 10 June 2026. Try Token Factory — first 10,000 tokens free →

Token Factory: How We Built a 98% Cheaper OpenAI Alternative

Drop-in replacement: using Token Factory with the OpenAI SDK

Batch processing: 50% cheaper for async workloads

LoRA fine-tuning: custom adapters in under an hour

Why Token Factory costs 98% less than GPT-4o

Frequently asked questions

Waste less compute.

More from the blog

Token Factory: How We Built a 98% Cheaper OpenAI Alternative

Drop-in replacement: using Token Factory with the OpenAI SDK

Batch processing: 50% cheaper for async workloads

LoRA fine-tuning: custom adapters in under an hour

Why Token Factory costs 98% less than GPT-4o

Frequently asked questions

Waste less compute.

More from the blog

How Dynamic GPU Placement Enables Lower Prices

How Persistent Workspaces Actually Work (A Deep Dive)

Token Factory: How We Built a 98% Cheaper OpenAI Alternative