What AI actually costs to run — API pricing, token math, and realistic monthly estimates.
Most AI APIs charge by token — the unit that models use to process text. A token is roughly 3-4 characters, or about 0.75 words. "The quick brown fox" is approximately 5 tokens.
Pricing is expressed as cost per million tokens, split into two rates: - Input tokens: the text you send to the model (your prompt, context, instructions) - Output tokens: the text the model generates in response
Output tokens are always more expensive than input tokens. Generation is computationally heavier than processing input.
To make this concrete: a typical short message or question is around 50-200 tokens. A detailed prompt with context might be 1,000-5,000 tokens. A long document or code file could be 10,000-100,000 tokens.
Prices as of early 2025. These change frequently — always check provider documentation for current rates.
| Model | Input (per MTok) | Output (per MTok) | |---|---|---| | Claude 3.5 Haiku | $0.80 | $4.00 | | Claude 3.5 Sonnet | $3.00 | $15.00 | | Claude 3.7 Sonnet | $3.00 | $15.00 |
Claude 3.5 Haiku is Anthropic's fast, cost-efficient model. The Sonnet models offer stronger reasoning and are preferred for complex tasks.
| Model | Input (per MTok) | Output (per MTok) | |---|---|---| | GPT-4o mini | $0.15 | $0.60 | | GPT-4o | $2.50 | $10.00 | | o1 | $15.00 | $60.00 | | o3 | $10.00 | $40.00 |
GPT-4o mini is notable for its very low cost. The o1/o3 models are reasoning-optimized and significantly more expensive — they "think" before responding, generating internal reasoning tokens that add to the cost.
| Model | Input (per MTok) | Output (per MTok) | |---|---|---| | Gemini 1.5 Flash | $0.075 | $0.30 | | Gemini 1.5 Pro | $1.25 | $5.00 |
Gemini 1.5 Flash is one of the cheapest capable models on the market. Gemini 1.5 Pro offers a very large context window (up to 2 million tokens), which matters for processing long documents.
Llama models are free to use under Meta's open-source license. However, running them requires infrastructure: - Self-hosted: you pay for compute (GPU cloud instances), not per token. Cost depends on hardware and utilization. - Third-party APIs: providers like Together AI, Groq, and Fireworks offer Llama via API at rates typically $0.10-$0.80 per MTok, often cheaper than closed-model alternatives.
Assumptions: average 800 input tokens per conversation, 400 output tokens per response. - Total: 8M input tokens + 4M output tokens
| Model | Monthly Cost | |---|---| | GPT-4o mini | $1.20 + $2.40 = $3.60 | | Claude 3.5 Haiku | $6.40 + $16.00 = $22.40 | | GPT-4o | $20.00 + $40.00 = $60.00 | | Claude 3.5 Sonnet | $24.00 + $60.00 = $84.00 |
For high-volume, cost-sensitive use cases, model selection has a dramatic impact. GPT-4o mini at $3.60/month versus Claude Sonnet at $84/month for the same workload — a 23x difference.
Assumptions: 50,000 input tokens per document, 2,000 output tokens per summary. Monthly: 150M input tokens, 6M output tokens.
| Model | Monthly Cost | |---|---| | Gemini 1.5 Flash | $11.25 + $1.80 = $13.05 | | GPT-4o | $375 + $60 = $435 | | Claude 3.5 Sonnet | $450 + $90 = $540 |
For large-context document processing, Gemini 1.5 Flash's pricing makes it extremely competitive.
Embeddings are dense vector representations of text, used for semantic search, retrieval-augmented generation (RAG), and similarity matching. They're priced separately from generation.
Embedding costs are usually negligible compared to generation costs unless you're embedding very large document collections frequently.
AI API costs have dropped dramatically and consistently since GPT-3's release. GPT-4 launched at roughly $60 per MTok for output. GPT-4o runs at $10. GPT-4o mini at $0.60. Models that cost $15/MTok for output today will likely cost $1-3/MTok within 18-24 months based on historical trajectory.
This has practical implications for budgeting: your current cost estimates are likely conservative for a 2-3 year horizon. Build your economics around today's prices, but expect significant cost improvement over time.
For any AI-powered feature, estimate: 1. Average tokens per interaction (input + output separately) 2. Volume (interactions per month) 3. Model (cheaper models for high-volume/simple tasks, better models where quality matters) 4. Total monthly cost = (input tokens × input rate) + (output tokens × output rate)
Start with a cheap model and upgrade only where quality falls short. For most business use cases, GPT-4o mini or Claude Haiku handles the majority of work adequately — reserve the expensive models for tasks where the quality difference is demonstrable.
Have a follow-up question about this topic?
Ask AI