🚀 B200 bare metal now at $5.6/hr. The best price you'll find. DC in US West → (Access it from Bare metal button on top after login).

Get Your B200 →
Start Building
Cover image
Announcement

Why We Built Token Factory

$0.10/M tokens for Llama, Qwen, DeepSeek — same OpenAI SDK, one line change. Here's why Token Factory exists and what it costs vs GPT-4o.

Author photo
packet.ai Team
January 30, 2025

Token Factory is packet.ai's OpenAI-compatible managed inference API at $0.10/M tokens — the same models, same SDK, same streaming, but 98% cheaper than GPT-4o for the 80% of production use cases where Llama or Qwen is good enough.

Key takeaways

  • Token Factory real-time inference: $0.10/M tokens. Batch (24h SLA): $0.05/M. GPT-4o: $2.50–$10.00/M
  • A team processing 100M tokens/month pays ~$10 on Token Factory vs ~$6,250 on GPT-4o
  • Migration is one line: change base_url in your existing OpenAI SDK setup
  • Available models include Llama 3.3 70B, Qwen 2.5 72B, DeepSeek R1 Distill 32B, and all 8 RTX PRO 6000 Blackwell models
  • Token Factory runs on packet.ai's RTX PRO 6000 Blackwell infrastructure — same GPUs available for rent at $0.66/hr

Developer after developer came to packet.ai with the same story: inference costs that started manageable and scaled to threatening. Token Factory is the answer.

The inference cost problem at scale

The pattern is consistent across AI product teams. Early stage: OpenAI free tier, everything works. Growing: $500/month, still manageable. Scaling: $5,000/month, starting to hurt. Successful: $50,000/month, now it's a line item that threatens the business model.

At that point the math stops making sense. Some teams tried solving it by self-hosting inference — setting up vLLM, configuring load balancing, handling CUDA driver issues, managing autoscaling. It worked, but they'd traded a cost problem for a DevOps problem with a fully loaded engineering cost that often exceeded the inference bill.

The core insight

For most production use cases — RAG pipelines, classification, summarisation, code generation, customer support — open-source models match GPT-3.5 quality. You shouldn't pay enterprise margins for proprietary model infrastructure when the open-source alternative is good enough.

The numbers: Token Factory vs the market

Provider / Model Input Output 100M tokens/month
Token Factory (real-time)$0.10/M$0.10/M~$10
Token Factory (batch 24h)$0.05/M$0.05/M~$5
OpenAI GPT-4o-mini$0.15/M$0.60/M~$375
OpenAI GPT-4o$2.50/M$10.00/M~$6,250
Anthropic Claude 3.5 Sonnet$3.00/M$15.00/M~$9,000

The 100M token/month comparison is not theoretical. That's a mid-sized SaaS product running a RAG pipeline with 20 queries per active user per day at ~5,000 users. On GPT-4o that's $75,000/year in inference alone. On Token Factory it's $120/year.

Migration takes 30 seconds

Token Factory is OpenAI-compatible. There is no new SDK to install, no new response format to parse, no streaming implementation to rewrite.

from openai import OpenAI

# Before
client = OpenAI(api_key="sk-...")

# After — one line change
client = OpenAI(
    base_url="https://dash.packet.ai/api/v1",
    api_key="your-packet-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True  # streaming works too
)

LangChain, LlamaIndex, the JavaScript SDK, structured output — all supported. Same response shape as OpenAI.

When Token Factory is not the right choice

Token Factory is not replacing GPT-4 for complex multi-step reasoning tasks. For agentic workflows requiring extended thinking, nuanced legal or medical analysis, or cutting-edge frontier capability, OpenAI and Anthropic's proprietary models remain the right call. But for the 80% of production volume — classification, summarisation, structured extraction, RAG retrieval responses, code generation at 7B–70B scale — the open-source models available through Token Factory are good enough, and charging $6,250/month for that workload is indefensible.

Frequently asked questions

Token Factory charges $0.10/M tokens for real-time inference (same input and output price) and $0.05/M for batch processing with a 24-hour SLA. The first 10,000 tokens are free with no credit card required.
Yes. Token Factory is fully OpenAI-compatible — change one line in your existing OpenAI client setup (set base_url to https://dash.packet.ai/api/v1) and your existing code works unchanged. Streaming, LangChain, LlamaIndex, and the JavaScript SDK are all supported.
Token Factory serves all 8 models deployed on packet.ai's RTX PRO 6000 Blackwell 8-GPU server: Llama 3.3 70B Instruct, Qwen 2.5 72B Instruct, Qwen 3 30B MoE, Qwen 3 32B, Qwen 2.5 Coder 32B, Nemotron 70B, DeepSeek R1 Distill 32B, and Mistral Large 123B.
Token Factory charges $0.10/M tokens for Llama 3.3 70B versus $0.88/M on Together.ai and $0.90/M on Fireworks.ai — 8.8–9× cheaper. AWS Bedrock charges $0.72/M for the same model. DeepInfra is the closest alternative at $0.23–$0.40/M.
Token Factory wins unless your volume exceeds ~1 billion tokens/month or you need custom model fine-tuning. Below that threshold, managing your own vLLM cluster costs more in engineering time than the savings. At high volume, renting GPU clusters on packet.ai at $0.65–$0.66/hr and running vLLM directly is the more economical path.

Last reviewed: 10 June 2026. Try Token Factory — first 10,000 tokens free →

Waste less compute.

Same models. Same API. Fraction of the cost. Start free — no credit card required.

Start Building →

More from the blog