Retrieval-Augmented Generation looks deceptively simple in a notebook. You chunk a few PDFs, drop the embeddings into a vector database, retrieve the top-k results for each query, and let an LLM stitch together an answer. A demo built that way will impress a stakeholder for fifteen minutes.
The same architecture in production, with real users, real volumes, and real edge cases, breaks in week three. After delivering 50+ enterprise RAG systems across BFSI, healthcare, retail, and legal, we have learned which patterns survive the trip from notebook to production — and which look fine in demos but quietly destroy projects six months in.
This is what actually works.
Why most RAG demos do not survive production
The honest answer: production RAG is a retrieval problem first and an LLM problem second.
The single biggest cause of poor RAG quality is bad retrieval. If the right context never makes it into the LLM's prompt, no model — not GPT-5, not Claude 4.5, not Gemini Ultra — will generate the right answer. People obsess over which model to use. They should be obsessing over how their data is chunked, indexed, and ranked.
The second biggest cause is evaluation drift. The retrieval that worked on day one degrades silently as your corpus grows. Without continuous evaluation, you only learn there is a problem when a customer complains.
The third is cost surprise. Token usage scales with corpus size, query volume, and reranking aggressiveness. A pilot that cost $200/month at launch becomes a $14K/month line item by month nine if nobody planned for it.
The 4 architecture patterns that work
Pattern 1 — Naive RAG (start here, but do not stop here)
The textbook flow: query → embed → vector search → top-k → prompt → answer. Every team builds this first. It works for ~60% of accuracy on most enterprise corpora and is a perfectly reasonable v0.
When it stops being enough: the moment users ask multi-hop questions, comparative questions, or anything that requires synthesizing multiple sources. Naive RAG retrieves passages independently — it has no way to know they connect.
Pattern 2 — Hybrid retrieval (almost always worth the effort)
Combine dense retrieval (vector similarity, semantic) with sparse retrieval (BM25, keyword). They fail in opposite directions: dense misses literal matches like part numbers and codes, sparse misses paraphrasing.
Production setup: run both in parallel, then merge results with Reciprocal Rank Fusion (RRF) before reranking. On most enterprise corpora this lifts recall@10 by 8–14 points compared to dense-only — for the cost of one extra index.
Pattern 3 — Agentic RAG (when queries are complex)
Instead of one retrieval pass, an LLM acts as a planner: it decomposes the query into sub-queries, retrieves for each, evaluates whether it has enough context, and decides whether to retrieve more. Think of it as ReAct applied to retrieval.
Use this when: queries span multiple documents, require comparison, or have conditional logic ("if X then look for Y"). Skip it when: queries are simple lookups — agentic RAG adds 2–4× latency and cost without benefit on simple queries.
Pattern 4 — GraphRAG (for highly-connected knowledge)
If your data has explicit relationships — entity graphs, regulatory dependency chains, drug-interaction networks — vector similarity throws away the structure. GraphRAG builds a knowledge graph at index time and uses graph traversal alongside vector search at query time.
This is the right pattern for: legal precedent chains, supply chain dependency analysis, drug interaction systems, fraud network investigation. It is the wrong pattern for: customer support FAQs, marketing content libraries, document Q&A on flat content.
For more on when AI infrastructure investment pays off, see our AI Development Cost in 2026 guide.
Choosing your vector database
Decision matrix we use with clients:
The right answer is almost always whichever option requires the fewest new operational competencies for your team. A perfect vector DB nobody on your team knows how to debug at 3am is worse than a good-enough one your DBAs can already manage.
Chunking — the most under-appreciated decision
Chunking is where most RAG quality is won or lost, and where most teams spend the least time.
Fixed-size chunking
Simple. Splits text every N tokens. Works for uniform content (chat logs, transcripts). Destroys structure in formatted documents (PDFs with tables, contracts with clauses).
Semantic chunking
Splits at sentence/paragraph boundaries based on embedding similarity. Works for narrative content. Slow at index time and brittle on heavily-formatted documents.
Document-structure chunking
Respects native document structure — Markdown headings, contract sections, HTML semantic tags. This is what we default to for any structured content. Section headings become metadata; chunks inherit hierarchy.
Late chunking (a 2025 trend worth adopting)
Embed the full document with a long-context embedding model first, then chunk afterwards. Each chunk's embedding carries context from the full document. Significantly better recall on multi-paragraph documents at the cost of higher embedding compute.
We rebuilt our RAG three times before we accepted that chunking was the bottleneck. Once we got chunking right, the same model that was scoring 64% accuracy started scoring 89%.
Reranking — the unsung hero
After retrieval returns the top-50 candidates, a reranker reorders them by true relevance to the query. This single step is the highest-ROI change most teams can make to an existing RAG.
What works in production:
- Cross-encoder rerankers (Cohere Rerank, BGE-reranker-large, Jina Reranker). Hosted ones are simpler; open-source ones are cheaper at scale.
- LLM-based rerankers for high-value queries only. Higher quality but 10–50× the latency. Use selectively.
- Two-stage retrieval: cast a wide net (top-50 from vector), then rerank to top-5. Keeps latency reasonable while improving accuracy.
Adding a reranker typically lifts top-3 precision by 12–25 points — the largest single jump available without changing the base model.
Evaluation — the part that gets skipped
Every team plans to "set up evals later." Most never do, and those teams quietly ship regressions for months.
The minimum viable evaluation harness:
- A held-out test set of 100–300 real queries from your domain, each with the correct answer + the correct source passages
- Retrieval metrics: Recall@10, MRR, NDCG@10
- Generation metrics: faithfulness (does the answer use only the retrieved context?), answer relevance (does it actually answer the question?)
- Automated runs on every code change, every model upgrade, every corpus refresh
We use Ragas or DeepEval for the harness. The choice of tool matters less than committing to the practice. Without continuous evaluation, you cannot tell improvement from luck.
Cost and latency at scale
The two surprises that hit RAG projects in production:
Cost surprise — embedding costs scale with corpus refresh frequency. If you re-embed 1M documents weekly using a paid embedding API, you are quietly running $4K/month just on embeddings. Mitigations: use open-source embedding models (BGE, E5, GTE) hosted on your own GPU; cache embeddings; only re-embed changed documents.
Latency surprise — the LLM is rarely the bottleneck. Vector search at scale plus reranking plus LLM generation plus citation parsing easily reaches 4–6 seconds end-to-end. Mitigations: streaming responses (perceived latency drops dramatically); two-stage retrieval; pre-warming caches for popular queries; smaller reranker models.
Pre-production checklist
Before launching a RAG to real users, we make sure every box is ticked:
- Held-out evaluation set exists, with ground-truth answers and source passages
- Retrieval metrics (Recall@10, MRR) are above the agreed threshold
- Faithfulness metric is above 0.92 — answers stay inside provided context
- p95 latency is under the agreed SLA at projected query volume
- Cost per query is calculated, projected at 10× volume, and approved by finance
- Hallucination behaviour is documented (what happens when no relevant context exists)
- Feedback loop exists — users can flag bad answers, flagged queries enter eval set
- Monitoring is wired — retrieval latency, model latency, cost per query, faithfulness sampling
- Rollback plan exists for model swaps, index rebuilds, prompt changes
- Compliance review complete — PII handling, data residency, retention policy
If a project skips any of items 1–4, it is not production-ready, no matter how impressive the demo is.
What is changing in 2026
A few trends worth tracking:
Long-context LLMs are reducing — but not eliminating — RAG. Models with 1M+ token windows can process entire small corpora directly. For corpora over a few hundred pages, RAG is still cheaper, faster, and more controllable. Hybrid approaches (small RAG into a long-context model) are emerging.
Multimodal RAG is moving from research to production. Retrieving images, charts, and tables alongside text is becoming standard for technical documentation, medical records, and regulatory filings.
Native graph + vector hybrid databases (Neo4j with vector, ApertureDB) are becoming the default for highly-connected knowledge.
Eval-first development is replacing prompt-first development. Teams set up evaluation before writing any prompts, then iterate against measurable targets.
Final word
RAG is not a single technique. It is a stack of decisions — chunking, embedding, retrieval, reranking, generation, evaluation — that compound. Each one is small in isolation. Together, they are the difference between a demo that impresses and a system that ships.
The good news: you do not have to get all of them right at once. Start with naive RAG, measure with a real evaluation set, and improve the weakest link one at a time. After 50+ projects, that disciplined iteration pattern beats every "we rewrote everything from scratch" rebuild we have seen.
If you would like our team to audit an in-flight RAG project, or design one from the ground up, we offer a free 60-minute architecture review.

Written by
Pratik Kantesiya
AI Engineering Lead
Pratik leads AI engineering at Agile Infoways, where he architects production AI systems for enterprises across healthcare, BFSI, and logistics. He writes about practical AI delivery — what works, what does not, and what most teams miss between proof-of-concept and production.



