What is a RAG Pipeline?
RAG (Retrieval-Augmented Generation) is a technique where an AI model retrieves relevant information from a knowledge base before generating a response. It combines three stages:
- Embedding: Your documents are split into chunks and converted to vector embeddings stored in a vector database (Pinecone, Weaviate, Chroma, pgvector)
- Retrieval: When a user asks a question, it's embedded and the most relevant document chunks are retrieved
- Generation: The retrieved context + the user's question are sent to an LLM for a grounded, accurate response
RAG Cost Components Explained
| Component | Typical Cost | % of Total | How to Reduce |
|---|---|---|---|
| Embedding (query) | $0.000002–0.000013 | < 1% | Use small embed models (text-embedding-3-small) |
| Vector Storage | $0.0000001–0.000001 | < 1% | Delete unused embeddings, use quantized storage |
| LLM Generation | $0.0003–0.015 | 95–99% | Use smaller LLMs, limit context tokens, cache responses |
How to Use This Calculator
- Choose embedding model: Smaller models (text-embedding-3-small) are 6.5x cheaper than large ones
- Set token counts: Query tokens (usually 50–200) + retrieved context (your chunk size)
- Select LLM: The generation model — this is where 95%+ of cost lives
- Set query volume: Monthly queries to project costs
Real-World RAG Cost Examples
Example 1: Customer Support Bot (100K queries/month)
Setup: text-embedding-3-small + GPT-4o mini, 2K context, 300 output
Cost per query: $0.000002 (embed) + $0.000375 (LLM) = $0.000377
Monthly cost: 100,000 × $0.000377 = $37.70
Annual cost: $458.77
That's 3.8 cents per 100 queries, or $0.00038 per conversation.
Example 2: Legal Document Assistant (10K queries/month)
Setup: text-embedding-3-large + Claude 3.5 Sonnet, 8K context, 500 output
Cost per query: $0.000013 (embed) + $0.000825 (LLM) = $0.000838
Monthly cost: 10,000 × $0.000838 = $8.38
Annual cost: $101.94
Claude's superior accuracy on legal documents may justify the higher cost per query.
Example 3: Research Assistant (1M queries/month)
Setup: text-embedding-3-small + Gemini Flash-Lite, 2K context, 200 output
Cost per query: $0.000002 + $0.000255 = $0.000257
Monthly cost: 1,000,000 × $0.000257 = $257.00
Annual cost: $3,127.80
At 1M queries/month, even cheap per-query costs add up. Cache frequent queries to reduce this by 30–60%.
How to Reduce RAG Costs
- Use cheap embedding models: text-embedding-3-small ($0.02/1M) vs text-embedding-3-large ($0.13/1M) — save 85%
- Limit retrieved context: Retrieve 2K tokens instead of 8K to cut LLM input costs by 75%
- Use cheaper LLMs for simple queries: Route to Gemini Flash-Lite for straightforward questions, premium models only for complex ones
- Cache responses: For repeated queries (common in support bots), serve cached responses at zero cost
- HyDE or sparse retrieval: Use Hypothetical Document Embeddings to improve retrieval accuracy, reducing the need for large contexts
- Delete stale embeddings: Prune your vector DB regularly to reduce storage costs