RAG Pipeline Cost Calculator

Name: RAG Pipeline Cost Calculator
Author: Andrew Laws

Work out what a Retrieval-Augmented Generation pipeline actually costs to run, before you commit to one. Embedding the corpus, paying the vector database, and paying the LLM each time someone asks a question, all priced out and broken down month by month. Pricing is baked in and dated, so you know how fresh the figures are.

Explain like I'm 5 (what even is this calculator?)

RAG is the trick where you give an AI access to your own documents instead of relying on whatever it half-remembers from training. You pay three different bills: one to turn your documents into vectors, one to keep those vectors in a special database, and one to the LLM every time it answers a question using them. This calculator adds all three up, so you can stop guessing.

Calculate

Embedding

Total corpus tokens (one-off embed)

Net new tokens added per month

Embedding model

Vector database

Vector DB tier

Queries

Queries per day

Avg context tokens injected per query

Avg user prompt tokens

Avg LLM response tokens

LLM

Press Calculate to see the bill.

Cost breakdown

One-off corpus embedding–
Monthly embedding (net new)–
Monthly vector database–
Monthly LLM queries–
Per query cost–
First-month total–
Ongoing monthly total–

Month by month

First month carries the one-off corpus embedding cost. Every month after that is the ongoing run-rate.

Month	Embedding	Vector DB	LLM	Total

Same workload, swap the LLM

Pricing baked in, USD. Last verified: —.

LLM	Per query	Ongoing month	First month

Prove it

Embedding cost is (tokens ÷ 1,000,000) × the embedding model's price per million. LLM per-query cost adds the retrieved context and the user prompt as input tokens, the response as output tokens, and prices each side at the model's list rate. Monthly LLM cost multiplies the per-query figure by queries per day, then by 30.4375 (365.25 ÷ 12) so monthly and annual stay self-consistent. Vector DB is a flat monthly figure at the smallest viable tier per provider. Pricing is the standard non-cached, non-batch list price published by each provider.

Useful? Save this calculator: press Ctrl + D to bookmark it.

What RAG actually is, in plain English

You have a stack of documents (a help centre, a product manual, the last five years of board minutes, whatever it is). You want an AI that can answer questions about that stack without making things up. The trick is Retrieval-Augmented Generation: you slice the documents into chunks, turn each chunk into a vector (a list of numbers that captures its meaning), and store the vectors in a database designed to find similar ones quickly. When someone asks a question, you turn the question into a vector, pull back the chunks closest to it, paste those chunks into the prompt, and let the LLM answer using them as reference material.

The cost shows up in three places. You pay an embedding model once per chunk, every time something gets indexed. You pay a vector database to keep those vectors hot and queryable. And you pay the LLM every single time someone asks a question, because every retrieved chunk lands in the prompt as input tokens. The third one is where most of the bill lives.

When self-hosting pgvector beats a managed vector DB

If you already run Postgres, and your vector count is in the low millions, pgvector is the cheapest sensible answer. A small managed Postgres on Supabase, Neon, or Render runs around fifteen quid a month with pgvector enabled, which is roughly a fifth of what a Pinecone Starter pod costs. You also get to keep retrieval data in the same database as your application data, which makes per-tenant access control and joins trivial instead of a faff.

It stops winning at scale. Once you cross ten million vectors, or you need sub-50ms p99 latency under heavy concurrent load, the dedicated vector databases start pulling away. Pinecone, Qdrant, and Weaviate have spent years tuning ANN indexes for exactly that workload. pgvector has gotten very good with HNSW, but the ops burden of running it well at scale is real, and at some point paying someone else to do that is the right call.

The honest middle path: prototype on pgvector, switch when the metrics tell you to, not before.

Why caching matters more than which LLM you pick

Run any production RAG workload for a week and you will notice something: roughly 80% of queries cluster around a small set of chunks. People ask the same questions, look up the same policy pages, hit the same FAQ entries. Which means the same retrieved context lands in the prompt over and over.

Anthropic, OpenAI, and Google all offer prompt caching. Mark a prefix of the prompt as cacheable (typically the system prompt and any large reference document), and repeat hits cost a fraction of the list price. Anthropic's discount is around 90 percent on cache hits. OpenAI's is 50 percent. Google's varies by model but lands in a similar range.

The implication is concrete: a Claude Sonnet pipeline with caching turned on will routinely outprice a DeepSeek pipeline without it, despite Sonnet being twenty times more expensive on paper. Before you switch LLMs to save money, switch on caching first.

Trimming a RAG bill, in rough order of impact

If the number above made you wince, the high-leverage moves are usually:

Drop your top-k. Most pipelines retrieve more chunks than they need. Going from k=10 to k=5 halves the context tokens and most of the time the answer quality is identical or better (the model has fewer distractions).
Turn on prompt caching. If your system prompt and instructions are stable, this is free money.
Add a reranker. A cheap reranker (Cohere Rerank, Voyage Rerank) lets you retrieve more candidates, then keep only the best three or four. Lower context cost, better answers.
Cap the response. Set max_tokens. RAG models love to ramble at output prices.
Smaller LLM, bigger context. Most retrieval-grounded questions do not need the flagship. Sonnet, Haiku, GPT-5.4 mini, Gemini Flash, and DeepSeek will all happily summarise three relevant chunks for a fraction of Opus money.

Honest caveats

The figures above are list price, non-cached, non-batch. Embedding token counts assume each chunk is sent once; if you re-embed on every change to a document, multiply accordingly. Vector DB cost is the smallest viable tier per provider, which works for the first few million vectors and falls over after that. The LLM column does not model fine-tuned models, image or audio inputs, or any of the volume discounts the bigger providers will quietly offer once your spend justifies a phone call.

Related calculators

RAG cost is the headline. These break it apart and stand alongside it.

Frequently asked questions

What is RAG?

Retrieval-Augmented Generation. You take a body of documents, split them into chunks, turn each chunk into a vector with an embedding model, and store the vectors in a database. When a user asks a question you embed the question, pull the most similar chunks back out, and stuff those chunks into the LLM prompt as context. The model answers using your data instead of guessing from training.

Why is the first month more expensive?

Because you only pay to embed the existing corpus once. After that, the running cost is just embedding new content as it arrives, the fixed monthly vector DB bill, and the per-query LLM cost. The calculator separates first-month and ongoing-month totals so the corpus build does not skew your sense of what the pipeline costs to run.

When does pgvector beat a managed vector DB?

When you already run Postgres, your vector count is in the low millions, and you do not need bleeding-edge ANN performance. pgvector on a small managed Postgres is roughly a fifth of the price of a Pinecone Starter pod, and it keeps your retrieval data in the same store as your application data. It stops winning when you cross ten million vectors or need sub-50ms p99 latency at scale.

Does this calculator include caching savings?

No. List prices only. In practice, prompt caching on the LLM side knocks 50 to 90 percent off repeated input tokens, which matters because most RAG workloads inject a near-identical system prompt every time. Your real bill on production traffic with caching turned on will usually come in 30 to 60 percent below the figure here.

Why does context size move the bill so much?

Because every query pays for every retrieved token as input. If you stuff 10,000 tokens of context into each call instead of 3,000, the input cost more than triples. Top-k tuning, chunk size, and reranking are the three biggest levers on RAG cost, well ahead of which LLM you pick.

Does this calculator send my numbers anywhere?

No. Everything runs in your browser. Nothing gets uploaded.

Last updated 29 April 2026 by Andrew Laws.