AI Engineering

Why Most RAG Implementations Fail in Production (And How to Fix Them)

We've watched RAG break in production across three client projects. Same four failure modes every time. Here's what actually goes wrong and the pipeline that fixes it.

Arjun Nayak · Founder, Zosma AI
10 min read
RAGVector DatabasesAI EngineeringProduction AI
A broken pipeline with documents falling through gaps between chunking, embedding, retrieval, and generation stages

A client called us last year about their RAG system. They had spent four months building it. Internal docs, policy PDFs, a support knowledge base — roughly 40,000 documents ingested into Pinecone with text-embedding-3-small, wired up to GPT-4o with a tidy prompt. It worked great in the demo. It worked great for the first two weeks in production.

Then they ran a customer satisfaction review. 38% of the answers their RAG chatbot produced were either wrong or "technically correct but useless." Support tickets didn't go down. They went up, because customers now had to re-ask the same questions to a human after the bot gave them a confident wrong answer first.

This is not unusual. This is the default outcome.

I've now watched the same pattern play out across three client engagements. Different industries, different data, different teams. The failures rhyme. There are roughly four places a RAG pipeline breaks, and every production system I've seen that "doesn't work" is broken in at least two of them.

This post is the post-mortem.

The RAG promise vs. the RAG reality

The pitch is seductive: upload your docs, get a chatbot that knows your business. Vector database vendors sell it this way. Framework tutorials reinforce it. A junior engineer can ship a working demo in a weekend with LangChain and a Jupyter notebook.

The demo works because demos are friendly. You test three or four questions you already know the answers to. The retriever surfaces the right chunk because you phrased the query the same way the document was written. The generator produces a clean, grounded answer. You ship it.

Then real users show up. They phrase questions in ways you didn't anticipate. They ask about edge cases that span three different documents. They ask follow-ups. They use internal jargon the embedding model has never seen. And the system that looked brilliant with your hand-picked test set starts returning chunks that are plausibly related but wrong, and the LLM — because it's designed to be helpful — smooths over the gap with invented details.

The gap between demo RAG and production RAG is not a skill issue. It's a pipeline issue. Let's walk through where it actually breaks.

Four stages of a RAG pipeline: chunking, embedding, retrieval, and generation, each marked with a common failure

Failure mode 1: Bad chunking

Chunking is the first thing every tutorial gets wrong and the first thing every production system regrets.

The default advice — "just use 512-token chunks with 50-token overlap" — exists because it's easy to implement, not because it's correct. Fixed-size chunking treats your documents like plain text. It isn't plain text. It's structured: headings organize sections, tables carry dense information, bulleted lists belong together, code blocks need to stay intact. A 512-token window slicing through your document will happily cut a table in half, split a procedure between steps 3 and 4, or separate a definition from the term it defines.

Here's a real example from one of our engagements. The client had an insurance policy document with a clause like:

Section 4.2 Exclusions. This policy does not cover claims arising from: (a) intentional damage by the policyholder; (b) commercial use of the vehicle; (c) driving under the influence of alcohol or controlled substances; (d) participation in motor racing or speed trials.

A fixed-size chunker split this mid-list. Chunk A ended after item (b). Chunk B started at (c). When a customer asked whether they were covered for damage during a commercial delivery, the retriever surfaced Chunk B — which lists drunk driving and motor racing — and the LLM confidently told the customer their commercial use was covered, because the exclusion for commercial use was literally not in the retrieved context.

The fix is semantic chunking. Respect document structure. Split on section boundaries, not token counts. Keep lists, tables, and code blocks whole even when they exceed your target size. Store the parent heading as metadata on every chunk so the retriever knows what section a chunk came from. For long documents, use hierarchical chunking: retrieve at the section level, then drill down.

On that same engagement, switching from fixed-size to structure-aware chunking moved retrieval precision from 61% to 84% on our evaluation set. Same embedding model. Same retriever. Same LLM. Just better chunks.

Failure mode 2: Wrong embedding model

Generic embedding models are trained on generic internet text. They're extraordinary at generic tasks. They are bad at your domain, and they are bad in ways that are hard to see until they hurt you.

Here's the concrete failure. A medical client was using text-embedding-3-small to index clinical guidelines. The word discharge appears in medicine to mean "release from hospital" and also, separately, to mean "fluid from a wound." In a generic embedding space, these two meanings live close together because they share the token. In the client's context, they're unrelated concepts that should never be retrieved for the same query.

When a nurse asked, "what are the criteria for discharge after a knee replacement?", the retriever returned passages about wound discharge management. The LLM, given wound care content, produced a wound care answer. The nurse got a correct-sounding reply about monitoring fluid color and volume, instead of the actual discharge criteria (gait stability, pain scores, imaging review).

The same problem appears with financial terminology (position, security, execution), legal text (consideration, party, instrument), and basically any technical domain where common words carry domain-specific meaning.

Your options, in order of increasing effort:

  1. Use a stronger general-purpose model. text-embedding-3-large, voyage-3, bge-large-en-v1.5, or nomic-embed-text-v1.5 outperform text-embedding-3-small on most benchmarks and handle polysemy better.
  2. Use a domain-specific model if one exists. BioLORD and SapBERT for medical, FinBERT variants for finance, Legal-BERT for legal.
  3. Fine-tune an embedding model on your data. With 5,000–20,000 query/document pairs you can meaningfully improve domain performance. Sentence Transformers makes this approachable; the hard part is assembling the pair dataset.

You do not need to start at option 3. You do need to stop assuming option 1 is "good enough" without measuring it. Which brings us to the next failure.

Failure mode 3: Retrieval without re-ranking

Dense vector search finds documents with similar embeddings to your query. It does not find the most relevant documents. Those are different things, and the difference is where RAG systems silently bleed quality.

Top-k retrieval gives you the k chunks closest to the query embedding in vector space. Closeness in embedding space is a proxy for relevance. It is not relevance. The top-1 chunk for "how do I cancel a subscription?" might be a chunk that happens to use the words cancel and subscription a lot — an FAQ about why you shouldn't cancel, or a billing section about failed cancellations — when the actual cancellation procedure is ranked seventh.

Re-ranking fixes this. A re-ranker is a cross-encoder model that takes the query and each candidate document together and produces a real relevance score. It's slower than embedding search, which is why you don't use it as your primary retriever. But if you retrieve the top 30 candidates with your vector search and then re-rank them with a cross-encoder, you get dramatically better top-5 results.

The numbers here are not subtle. On our internal evaluation set from one client, adding Cohere Rerank 3 on top of existing vector search moved answer accuracy from 67% to 89%. BGE-Reranker-v2-m3 (open source, self-hosted) got us to 85% at no per-query cost. Either one is a far bigger quality win than upgrading your LLM from GPT-4o to GPT-4.5.

While we're here: stop using pure vector search. Hybrid search — combining dense vector similarity with BM25 keyword search — handles two cases vector search is bad at: exact matches (product SKUs, error codes, proper nouns) and rare terms that embeddings flatten into nearby but wrong concepts. Most production-grade retrievers (Weaviate, Qdrant, Elastic, Vespa) support hybrid out of the box. Use it.

Failure mode 4: No evaluation

This is the failure that causes all the other failures to be invisible.

I have never, not once, walked into a struggling RAG project and found a proper evaluation framework. What I usually find is a folder of 10–20 test questions someone wrote by hand, run occasionally, eyeballed for vibes. That is not evaluation. That is hoping.

Without evaluation, every change is a guess. You switch embedding models — did quality go up or down? You tweak the prompt — did you fix the problem or create three new ones? A junior engineer improves chunk size for one type of query and quietly breaks retrieval for four others. Nobody notices until a customer complains.

A minimally adequate RAG evaluation has three layers:

  • Retrieval metrics. For a set of labeled queries, does the correct chunk appear in the top-k? Measure Recall@5, Recall@10, and Mean Reciprocal Rank. These tell you whether the retriever is finding the right content, independent of what the LLM does with it.
  • Answer metrics. Given the retrieved context, does the generated answer correctly respond to the query? Faithfulness (does the answer stay grounded in the retrieved context, no hallucinated additions?) and answer relevance (does it address what was asked?) are the two you cannot skip. Tools like Ragas and DeepEval automate both using LLM-as-judge with decent reliability if you validate the judge.
  • End-to-end regression tests. A pinned set of 50–200 real queries with known-good answers that runs on every change. If a PR drops faithfulness from 0.91 to 0.84, the PR does not merge. This is the single most valuable thing you can add to a RAG project and almost nobody has it.

Building an evaluation dataset feels like overhead. It is not overhead. It is the only way you ever actually improve. Without it you're doing vibes-driven AI engineering, and vibes do not survive production.

An evaluation feedback loop showing retrieval metrics, faithfulness scoring, and regression tests gating deployment

The fix: a proper pipeline

The fix is not one thing. Every "one weird trick" RAG post on the internet is selling you a component when your problem is architectural. A production RAG pipeline has the following stages, in this order, with this specificity:

  1. Ingestion with structure awareness. Parse PDFs, HTML, and Office docs with a parser that preserves structure — not pdfplumber dumping raw text. Tools like Unstructured, LlamaParse, or Azure Document Intelligence extract headings, tables, and lists as first-class objects.
  2. Semantic chunking. Split on section boundaries. Keep structured elements whole. Attach parent headings and source metadata to every chunk. For long documents, build a hierarchy.
  3. Domain-appropriate embeddings. Start with a strong general-purpose model (text-embedding-3-large, voyage-3-large, bge-large). Measure on your domain. If quality is poor on domain-specific terminology, move to a domain model or fine-tune.
  4. Hybrid retrieval. Dense vector search plus BM25, fused with reciprocal rank fusion. Retrieve 30–50 candidates, not 5.
  5. Re-ranking. A cross-encoder re-ranker (Cohere Rerank, BGE-Reranker, or a fine-tuned one) cuts the candidate set down to the top 5–8 that actually go into the prompt.
  6. Grounded generation with citations. The LLM prompt instructs the model to answer only from the provided context, cite chunk IDs for every factual claim, and say "I don't have enough information" when the context doesn't cover the question. Citations are not decoration — they're how you catch hallucination at inference time.
  7. Evaluation in CI. Retrieval metrics, faithfulness, answer relevance, plus a regression set. Every code change runs the suite. Every prompt change runs the suite. Every data update runs the suite.

A pipeline like this will not win you a demo contest. It will quietly work in production, which is the actual goal.

A seven-stage production RAG pipeline from structured ingestion through evaluation in CI

Three questions to evaluate your current RAG

If you have a RAG system running today, ask these three questions. The answers will tell you whether you have a product or a prototype in a costume.

1. Can you name your top-10 retrieval failures from the last 30 days?

Not your top complaints — your top retrieval failures, specifically. Queries where the right chunk existed in your index but was not in the retrieved top-k. If you cannot answer this, you are not logging retrieval traces, which means you cannot diagnose the system when it misbehaves. Turn on full retrieval logging before you do anything else.

2. If I swap your embedding model tomorrow, can you prove whether quality went up or down?

If the answer involves "we'd test it with some queries and see how it feels," you don't have evaluation. You have a hope-based deployment process. A real answer sounds like: "We'd run our 180-query evaluation set, check Recall@10 and faithfulness, and compare to our baseline. If any metric drops more than 2 points, we roll back."

3. What does your system do when it doesn't know the answer?

Ask it ten questions you know aren't covered in your corpus. Does it say "I don't have information on that"? Or does it generate something that sounds right? A RAG system without a proper refusal behavior is a liability, full stop. The failure mode isn't theoretical — it's Air Canada inventing a bereavement policy, which cost them in court. Ungrounded confidence is a production bug, not a personality.

Where OpenZosma fits

We've been building RAG systems for clients long enough to stop starting from scratch. The patterns above — structure-aware ingestion, hybrid retrieval with re-ranking, evaluation as a first-class pipeline stage — are packaged into OpenZosma's Knowledge Module. It handles the boring-but-critical parts: parsers that preserve structure, chunking strategies per document type, hybrid retrieval with re-rank, citation-grounded generation, and a built-in evaluation harness so you can measure changes before you deploy them.

It's not magic. It's the pipeline we wish every RAG project started with, instead of rebuilding it, badly, three months in.

Closing thought

The reason RAG is hard in production is not that any single component is hard. Vector search is easy. Chunking is easy. LLM prompts are easy. The hard part is that the errors compound. A chunker that's 90% good, an embedding model that's 90% good, a retriever that's 90% good, and a generator that's 90% good stack up to a system that's 66% good. Users notice that missing 34%. They notice it fast.

Every production RAG that actually works got there by treating the pipeline as a system, measuring each stage, and refusing to ship changes that can't be evaluated. There is no shortcut. The teams that pretend there is ship demos; the teams that accept it ship products.

If your RAG system isn't working, it's not because the technology doesn't work. It's because somebody skipped a stage. Find the stage. Fix it. Measure it. Move to the next one.

That's the whole playbook. It's not exciting. It ships.