Every quarter, someone publishes a "cheapest AI API" breakdown. It lists token prices, formats them in a table, and concludes that whichever model has the lowest number wins. Then teams use it to make purchasing decisions — and end up with surprise bills that are 4× the estimate.

Per-token pricing is not your total cost. It's not even most of your cost. This guide covers what the comparisons leave out: rate limits, context window economics, fine-tuning costs, embedding costs, and the billing lag that turns a spike into a disaster. Then it gives you an actual total-cost-of-ownership view that you can use to make a real decision.

Why Token Pricing Doesn't Tell You What You'll Pay

Token pricing is a useful unit for comparison. It's a poor predictor of your actual bill. Here's what it misses:

Input vs. output token asymmetry. Every provider charges differently for input versus output tokens. On GPT-4o, input tokens cost $2.50 per million and output tokens cost $10 per million — a 4× multiplier. On Claude 3.5 Sonnet, input is $3 per million and output is $15 per million — a 5× multiplier. If your application generates verbose outputs (long summaries, code generation, multi-step reasoning), your actual cost per call scales with output volume, not prompt length.

Context window utilization. A model with a 128K context window that you regularly fill with 80K tokens of history is expensive in ways that per-token pricing doesn't capture. The cost of maintaining long conversation context compounds across every turn — and agentic systems that do multi-step reasoning with full context retention can hit 20× the cost of a single-turn interaction.

Retry patterns and failure modes. Every production system retries on failure. If a provider has a 99.5% success rate and you make 10,000 calls per day, you're making ~50 retries per day for free — plus paying for the failed calls that already consumed tokens before returning an error. Rate limit errors from providers with tight token-per-minute limits generate the same cost without delivering value.

Billing lag. Every major provider's billing data lags 24–72 hours. A spike on Thursday shows up in your dashboard on Saturday. By the time the billing alert fires, you've already spent the money. Real spend monitoring requires tracking at the API response level, not the billing portal level.

Provider-by-Provider Breakdown

OpenAI (GPT-4o, GPT-4o-mini)

OpenAI offers the widest model range in production use, with pricing that spans 33× from the cheapest to most expensive option in the GPT-4 family.

Model Input / 1M tokens Output / 1M tokens Context
GPT-4o $2.50 $10.00 128K
GPT-4o mini $0.15 $0.60 128K
o3-mini $1.10 $4.40 200K

Hidden costs to watch: OpenAI's batch API offers 50% cost reduction for non-time-sensitive workloads — but adds up to 24 hours of latency and requires explicit batch job management. Fine-tuning runs on a separate per-token training cost (currently $25/M for GPT-4o fine-tuning) plus storage. Embeddings (text-embedding-3-large) are $0.13/M tokens — cheap individually, but significant at scale for retrieval-augmented generation pipelines.

Anthropic (Claude 3.5 Sonnet, Claude 3 Haiku)

Anthropic's Claude 3.5 Sonnet has become the default "production quality" choice for many teams due to its reasoning capability at a price point between GPT-4o and GPT-4o-mini.

Model Input / 1M tokens Output / 1M tokens Context
Claude 3.5 Sonnet $3.00 $15.00 200K
Claude 3.5 Haiku $0.80 $4.00 200K
Claude 3 Opus $15.00 $75.00 200K

Hidden costs to watch: Anthropic uses a prepaid credit model for most plans. Credit exhaustion mid-month causes API errors in production — not a graceful fallback. The real billing risk isn't overspend, it's under-provisioned credits killing production traffic. Anthropic doesn't currently offer a self-serve fine-tuning API (available on enterprise contracts only), which is a meaningful gap if model customization is on your roadmap. Prompt caching (available on Claude 3.5 models) offers 90% cost reduction on repeated context — significant for document-heavy workflows but requires implementation.

Google Cloud (Gemini 1.5 Pro, Gemini 1.5 Flash)

Google's pricing model changed significantly in 2025 with the introduction of tiered pricing based on prompt length — making cost calculations more complex than any other provider.

Model Input ≤128K / 1M Input >128K / 1M Output / 1M
Gemini 1.5 Pro $1.25 $2.50 $5.00 / $10.00
Gemini 1.5 Flash $0.075 $0.15 $0.30 / $0.60

Hidden costs to watch: Google charges differently for requests above versus below the 128K context threshold — so the same model costs 2× more if your prompt crosses that line. This creates non-linear cost behavior that's hard to budget without tracking actual prompt lengths. Google also offers a free tier with rate limits that teams frequently exceed in production, causing silent failures rather than clear overage billing. Vertex AI (enterprise) has different pricing than the Gemini Developer API (direct) — they're not the same product despite using the same models.

AWS Bedrock (Claude, Llama, Titan)

AWS Bedrock is a marketplace for models, not a model provider. You're paying AWS's price, not the model developer's price — and AWS adds a margin on top.

Model (via Bedrock) Input / 1M tokens Output / 1M tokens
Claude 3.5 Sonnet (Bedrock) $3.00 $15.00
Llama 3.1 70B (on-demand) $0.99 $1.32
Amazon Titan Text Premier $0.50 $1.50

Hidden costs to watch: Bedrock bills by region — a team with workloads in us-east-1 and eu-west-1 gets two separate line items for the same model. The provisioned throughput (PTU) model commits you to a flat monthly reservation regardless of usage — right-sizing is hard, and underutilized PTUs waste money silently. AWS also charges for inference profile creation, model evaluation jobs, and knowledge base storage (for RAG workflows). These add-ons are easy to provision and easy to forget about.

The Hidden Costs Nobody Puts in the Table

Beyond per-token rates, five cost categories routinely catch teams off guard:

Rate limits as an operational cost. Every provider imposes token-per-minute (TPM) and request-per-minute (RPM) limits. When you hit them, you either queue (adding latency), implement exponential backoff (adding complexity), or pay for a higher tier (adding cost). OpenAI's tier system requires spending thresholds before unlocking higher rate limits — a bootstrapped team might wait weeks to qualify for limits that their production traffic requires from day one.

Embeddings at scale. Text embedding for RAG systems is cheap per call but runs on every document ingestion and every search query. A knowledge base of 500,000 documents re-embedded monthly at OpenAI's text-embedding-3-large rate ($0.13/M tokens, ~500 tokens per document) costs ~$33/month just for the embeddings — before any inference. That compounds if you use a more capable embedding model or have higher document churn.

Fine-tuning storage and hosting. OpenAI charges for training tokens plus monthly model storage. A fine-tuned GPT-4o model costs $25/M tokens to train and $25/M tokens to host for inference — the hosting multiplier means fine-tuned models cost 10× base model inference if utilization is low. Teams fine-tune once to improve quality, then face ongoing per-token costs that are invisible until the invoice arrives.

Batch vs. real-time pricing gaps. OpenAI's batch API is 50% cheaper. Anthropic's Message Batches API is 50% cheaper. Google's batch prediction is significantly cheaper for Vertex AI. If your use case tolerates latency (document processing, classification, evaluation pipelines), not using batch APIs is burning money. Most teams don't implement batch because it adds engineering complexity — but the cost gap compounds at scale.

Cross-provider data transfer and integration overhead. If you're routing the same request to multiple providers for evaluation, redundancy, or fallback, you're multiplying your token costs. A primary/fallback setup where 5% of traffic falls to the backup provider adds 5% of the backup provider's cost on top of the primary. At $100K/month in primary spend, that's $5K/month in fallback overhead — often not tracked because it's coded as error handling, not as an API cost.

Total Cost of Ownership: What Actually Matters

Here's what a genuine TCO comparison looks like for a team spending $5,000/month on LLM inference, across four providers:

Factor OpenAI Anthropic Google Cloud AWS Bedrock
Billing model Post-pay invoice Prepaid credits GCP billing AWS billing
Billing lag 24–48 hrs Near-real-time Up to 24 hrs Up to 24 hrs
Fine-tuning available Yes (self-serve) Enterprise only Yes (Vertex AI) Limited models
Batch discount 50% off 50% off Varies None (on-demand)
Rate limit flexibility Tiered by spend Tiered by usage Quota requests Provisioned TPU
Spend surprise risk Medium Low (credits) Medium High (multi-region)

How to Actually Compare Costs for Your Workload

The right comparison isn't "which provider has the lowest rate." It's "which provider minimizes total cost for our specific workload characteristics." Three questions determine where the real savings are:

What's your input-to-output ratio? Applications that generate short structured outputs (classification, extraction, yes/no) have a very different cost profile from applications that generate long prose (summaries, emails, documentation). At 10:1 input-to-output ratio, output pricing barely matters. At 1:5 input-to-output, it dominates. Calculate your actual ratio before accepting any comparison table at face value.

What's your context utilization? If your average prompt is 2,000 tokens, Google's context tier pricing is irrelevant. If you regularly send 150,000-token prompts (large codebase analysis, legal document review), Google's above-threshold rate doubles the cost while Anthropic's 200K window doesn't penalize you for size.

Can your workload tolerate latency? If any significant fraction of your workload is async — document processing, batch scoring, nightly reports — the 50% batch discount on OpenAI and Anthropic effectively cuts your compute spend in half. That's a bigger lever than switching providers entirely.

How PayMesh Tracks Real Costs Across All Providers

The challenge with multi-provider cost management isn't the initial comparison — it's the ongoing tracking. Token prices change. Your workload changes. A new team spins up a project on a provider you didn't expect. Batch jobs migrate to real-time when they need latency. Each change shifts your effective cost per provider in ways that a spreadsheet comparison from three months ago can't capture.

PayMesh connects to all five providers and pulls live usage data hourly. The dashboard shows actual token consumption and cost by provider, model, and time window — not estimates, not billing-portal approximations, but the usage data from the API itself. When a model upgrade changes your per-token cost, you see it in the next hour's data, not on next month's invoice.

Budget alerts fire when your per-provider or aggregate spend crosses your thresholds — at 50%, 80%, or 100%, not just when the invoice lands. The burn-rate projection shows where you're headed, not just where you've been.

Not sure what your monthly budget should be? The AI API Cost Calculator estimates your spend across all five providers based on your actual request volume and model mix — it takes 30 seconds and doesn't require a signup.

Already using a spreadsheet to track costs? The first PayMesh guide on AI cost tracking walks through the exact migration path from manual tracking to automated monitoring.

The Bottom Line

For most production workloads in 2026:

The cheapest AI API is the one you're actually monitoring. A $0.15/M token model running uncontrolled will cost more than a $5/M model with proper alerting and budget controls. Price per token is the input. Actual spend is the output — and those are only equal if you're watching the meter.