RAG Pipeline Cost Calculator 2026

Calculate the full cost of a RAG (Retrieval-Augmented Generation) pipeline: embedding, vector storage, retrieval, and LLM generation costs per query.

Last updated: May 2026 · Embedding pricing: OpenAI, Cohere, Google

Cost Breakdown per Query

Embedding Cost $0.000002
Vector Storage (pro-rata) $0.000000333
LLM Generation Cost $0.000375
Total per Query $0.000377
Embedding % of Total 0.5%
LLM % of Total 99.5%
Monthly Cost (100K queries) $37.70
Annual Cost $458.77

What is a RAG Pipeline?

RAG (Retrieval-Augmented Generation) is a technique where an AI model retrieves relevant information from a knowledge base before generating a response. It combines three stages:

  1. Embedding: Your documents are split into chunks and converted to vector embeddings stored in a vector database (Pinecone, Weaviate, Chroma, pgvector)
  2. Retrieval: When a user asks a question, it's embedded and the most relevant document chunks are retrieved
  3. Generation: The retrieved context + the user's question are sent to an LLM for a grounded, accurate response

RAG Cost Components Explained

Component Typical Cost % of Total How to Reduce
Embedding (query) $0.000002–0.000013 < 1% Use small embed models (text-embedding-3-small)
Vector Storage $0.0000001–0.000001 < 1% Delete unused embeddings, use quantized storage
LLM Generation $0.0003–0.015 95–99% Use smaller LLMs, limit context tokens, cache responses

How to Use This Calculator

  1. Choose embedding model: Smaller models (text-embedding-3-small) are 6.5x cheaper than large ones
  2. Set token counts: Query tokens (usually 50–200) + retrieved context (your chunk size)
  3. Select LLM: The generation model — this is where 95%+ of cost lives
  4. Set query volume: Monthly queries to project costs

Real-World RAG Cost Examples

Example 1: Customer Support Bot (100K queries/month)

Setup: text-embedding-3-small + GPT-4o mini, 2K context, 300 output

Cost per query: $0.000002 (embed) + $0.000375 (LLM) = $0.000377

Monthly cost: 100,000 × $0.000377 = $37.70

Annual cost: $458.77

That's 3.8 cents per 100 queries, or $0.00038 per conversation.

Example 2: Legal Document Assistant (10K queries/month)

Setup: text-embedding-3-large + Claude 3.5 Sonnet, 8K context, 500 output

Cost per query: $0.000013 (embed) + $0.000825 (LLM) = $0.000838

Monthly cost: 10,000 × $0.000838 = $8.38

Annual cost: $101.94

Claude's superior accuracy on legal documents may justify the higher cost per query.

Example 3: Research Assistant (1M queries/month)

Setup: text-embedding-3-small + Gemini Flash-Lite, 2K context, 200 output

Cost per query: $0.000002 + $0.000255 = $0.000257

Monthly cost: 1,000,000 × $0.000257 = $257.00

Annual cost: $3,127.80

At 1M queries/month, even cheap per-query costs add up. Cache frequent queries to reduce this by 30–60%.

How to Reduce RAG Costs

  • Use cheap embedding models: text-embedding-3-small ($0.02/1M) vs text-embedding-3-large ($0.13/1M) — save 85%
  • Limit retrieved context: Retrieve 2K tokens instead of 8K to cut LLM input costs by 75%
  • Use cheaper LLMs for simple queries: Route to Gemini Flash-Lite for straightforward questions, premium models only for complex ones
  • Cache responses: For repeated queries (common in support bots), serve cached responses at zero cost
  • HyDE or sparse retrieval: Use Hypothetical Document Embeddings to improve retrieval accuracy, reducing the need for large contexts
  • Delete stale embeddings: Prune your vector DB regularly to reduce storage costs