What is AI Inference Cost?
AI inference is the process of running a trained AI model to generate predictions or outputs. Unlike training (which happens once), inference happens every time you send a prompt to an AI model. Inference costs include compute (GPU hours), memory, networking, and overhead.
Whether you're using a paid API or self-hosting models, understanding inference costs is critical for pricing your AI product correctly and choosing between build vs. buy decisions.
GPU Cost Reference 2026
| GPU | Cloud $/hr | Tokens/sec (7B) | Tokens/sec (70B) | Best For |
|---|---|---|---|---|
| NVIDIA H100 (80GB) | $2.00–$3.50 | ~50 tok/s | ~15 tok/s | Production LLM serving |
| NVIDIA A100 (80GB) | $1.00–$2.00 | ~30 tok/s | ~8 tok/s | Cost-effective production |
| NVIDIA A10G (24GB) | $0.50–$1.00 | ~20 tok/s | — | Smaller models, dev/staging |
| RTX 4090 (24GB) | $0.40–$0.80 | ~25 tok/s | — | Budget inference, 7B models |
How to Use This Calculator
- Select inference mode: API-based (paid APIs), Self-hosted GPU, or Batch inference
- Enter request volume: Total API calls you expect per month
- Set average tokens: Input + output tokens per typical request
- Configure pricing: API cost per 1M tokens, or GPU cost per hour
- Read results: See monthly costs, daily costs, and margin estimates
Real-World Examples
Example 1: SaaS AI Writing Tool (100K requests/month)
Mode: API-based (GPT-4o mini)
Requests/month: 100,000
Avg tokens: 700 total (500 in + 200 out)
API cost: 700 tokens/1M × 100K requests × $0.75/1M = $52.50/month
At $0.01 per user request: $1,000 revenue, $52.50 cost = 94.75% gross margin
Example 2: Self-Hosted DeepSeek V3 (500K requests/month)
Mode: Self-hosted GPU (A100 on Lambda)
Requests/month: 500,000 → ~694/hour
GPU throughput: 100 requests/hour per A100
GPUs needed: ~7 (always-on)
Monthly cost: 7 × $1.39/hr × 24 × 30 = $6,979/month
vs API cost at $0.27/$1.10 per 1M tokens: ~$2,625/month → self-hosting is 2.7x more expensive at this volume
Example 3: Batch Processing (10M tokens/month)
Mode: Batch inference (50% discount)
Total tokens: 10,000,000
Batch cost: $0.375/1M (vs $0.75 standard)
Monthly cost: 10M × $0.375/1M = $3.75/month
Batch inference is ideal for non-real-time workloads like report generation, batch classification, data enrichment.
Build vs. Buy Decision
Use this decision framework:
| Factor | Use API | Self-Host |
|---|---|---|
| Volume | < 500M tokens/month | > 500M tokens/month |
| Data privacy | Acceptable with BAA | Strict compliance needed |
| Latency SLA | ~200-500ms | Custom optimization |
| Ops complexity | Zero | High (MLOps team needed) |
| Model control | Provider's models only | Any open-weight model |