You're paying $2,000/month for GPT-4o. That number sits in your billing dashboard. It's clean. Round. Wrong.

Your actual cost includes a prompt caching discount you didn't know existed, a batch API discount you didn't implement, a Bedrock markup AWS added silently, a fine-tuning run that cost 10× more than your inference, and a 24-hour billing lag that's been hiding last week's spike from you this entire time.

Per-token pricing is the price tag. The bill is something else entirely.

The Obvious Costs (What Everyone Tracks)

Input tokens and output tokens. Everyone knows these. You send a prompt, you get a response, you multiply tokens by rate. That's the starting point — not the destination.

What people miss is that input and output have different rates on every provider. GPT-4o input is $2.50/M tokens. Output is $10/M — a 4× multiplier. Claude 3.5 Sonnet is $3/M input and $15/M output — also 4×. If your application generates long outputs (summaries, code generation, multi-step reasoning), your cost scales with output volume, not prompt length. A comparison that only shows input pricing understates your bill by 3-5× depending on your use case.

Context window economics compound this. A model with a 128K context window that you regularly fill with 80K tokens of history is expensive in ways that per-token pricing doesn't surface. The cost of maintaining long conversation context multiplies across every turn — and agentic systems doing multi-step reasoning with full context retention can hit 20× the cost of a single-turn interaction with the same model.

The Hidden Costs Nobody Talks About

Prompt Caching / Context Caching

AWS Bedrock and Anthropic both offer discounts for repeated context — Anthropic's prompt caching on Claude 3.5 Sonnet gives 90% cost reduction on cached tokens. AWS Bedrock has similar context caching behavior. Most teams have no idea this feature exists, let alone that they're paying full price for context that could be cached.

The requirement: your application needs to reuse system prompts, base contexts, or large document sections across requests. If you do, and you're not implementing prompt caching, you're paying 10× more than you need to for every repeated-context call. This alone can cut Anthropic bills by 40-70% for document-heavy workflows.

Batch API vs. Real-Time Pricing

OpenAI's Batch API costs 50% less than the standard API. Anthropic's Message Batches API is also 50% off. Google's batch prediction is significantly cheaper for Vertex AI workloads.

The catch: up to 24 hours of latency. The use case: document processing, batch classification, evaluation pipelines, nightly reports — anything that doesn't require a real-time response. Teams that can tolerate 24-hour latency are leaving a 50% discount on the table simply because they haven't implemented batch job handling.

At $10,000/month in spend, a workload that migrates to batch pricing saves $5,000/month. That's not a rounding error.

Provisioned Throughput Units (PTU) vs. Pay-as-You-Go on Azure

Azure OpenAI's PTU model charges a flat monthly reservation fee regardless of actual usage. You commit to a throughput tier; Azure charges you for it whether your actual traffic uses 20% or 100% of that throughput.

Teams on PTU often don't track utilization rate. They're paying for capacity they never use. The "effective cost per token" is only meaningful when you divide your PTU reservation cost by actual token throughput — and most dashboards don't do that math by default.

The warning: PTU makes sense when you have consistent, high-volume traffic that regularly hits rate limits on pay-as-you-go. For variable or bursty workloads, pay-as-you-go is almost always cheaper.

Bedrock's Markup Over Direct API Pricing

AWS Bedrock is a marketplace — you're paying AWS's price, not the model developer's price, and AWS adds a margin. For the same Claude 3.5 Sonnet model, Bedrock's effective price is 3-8% higher than calling Anthropic's API directly.

The tradeoff: Bedrock offers unified API access, AWS IAM integration, and compliance certifications. If those things matter to you, the markup may be worth it. But it's a markup, not a feature — and it should be in your TCO calculation.

Fine-Tuning vs. Inference Cost Ratio

OpenAI's fine-tuning runs on a separate pricing schedule from inference. GPT-4o fine-tuning costs $25/M training tokens to train and $25/M tokens/month to host — plus inference per-token costs on top. A fine-tuned model with low utilization can cost 10× base model inference for the same workload.

Anthropic's fine-tuning is available on enterprise contracts only. If you're comparing fine-tuning paths, note that Anthropic doesn't offer a self-serve fine-tuning API — that's an enterprise conversation, not a pricing page data point.

Tokenization Overhead

Not all "1 million tokens" are equivalent. Different tokenizers produce different token counts for the same text — a 1,000-word document might tokenize to 800 tokens on one model and 1,200 tokens on another. Some providers charge differently above context thresholds: Google Vertex AI charges 2× more for prompts over 128K tokens. This non-linear behavior means "per-token" pricing doesn't linearly predict your bill — it assumes you're always below the threshold.

Real Math: The Visible Bill vs. The Real Bill

Here's what this looks like in practice. Team: 12 engineers, running a customer support AI that processes 50,000 tickets/month on GPT-4o.

Visible bill: 500M input tokens + 150M output tokens = (500 × $2.50) + (150 × $10) = $1,250 + $1,500 = $2,750/month. That's what the dashboard shows.

Real bill:

Adjusted total: $1,530/month. That's a 44% reduction from the visible bill — without changing providers, models, or team headcount. The math works. The gap is in the implementation.

How to Audit Your Own AI Costs

Pull your last three months of billing data. For each provider, answer these questions:

  1. What's your input-to-output token ratio? If it's heavily output-weighted, focus on output optimization — model downgrades, truncation, or structured output formats.
  2. Are you using batch APIs for any async workload? Any workload that tolerates 24-hour delay is a candidate. The 50% discount compounds at scale.
  3. Do you have repeated system prompts or large context blocks? If yes, implement prompt caching. Anthropic's cache hits are 90% cheaper.
  4. Are you on a PTU reservation on Azure? If your utilization is below 60%, you're overpaying. Switch to pay-as-you-go.
  5. Are you using Bedrock, or calling providers directly? The markup is real. Calculate whether the compliance/集成 benefits justify the 3-8% premium.
  6. What's your average prompt length vs. your context window? If you're consistently below 128K tokens on Google, the above-threshold pricing doesn't apply — but if you've crossed that line, your costs doubled.
  7. Are any fine-tuned models running in production? Audit utilization. Low-usage fine-tuned models are an expensive habit.

For teams running multiple providers, the audit also includes cross-provider attribution: knowing which cost center drives which portion of your bill. Without that, optimization is guesswork.

The Tools to Find the Hidden Costs

The billing portal shows you the number. The meter is running. The challenge is mapping actual usage to actual cost in a way that surfaces the gap between what's visible and what's real.

PayMesh connects to OpenAI, Anthropic, Google Cloud, AWS Bedrock, and Azure OpenAI, pulling usage data hourly from each provider's API. The dashboard shows per-model cost breakdowns, utilization rates for PTU commitments, and burn-rate projections — so you can see where you're headed, not just where you've been.

The AI API Cost Calculator estimates your monthly spend across all five providers based on your actual request volume and model mix. Run it before your next billing cycle — and run it again after you've implemented one of the changes above. The difference is the hidden cost you were leaving on the table.

Your bill isn't the number in the dashboard. It's everything that number is made of. Time to read the ingredient label.