background-shape
Why Naive RAG Fails in Production, A 2024 Reality Check
February 2, 2024 · 7 min read · by Muhammad Amal ai

TL;DR — The 30-line LangChain RAG demo is misleading. Production RAG fails on retrieval recall, chunk boundaries, freshness, and access control — none of which the demo touches. The 2024 stack has answers; the work is integrating them.

I built my first production RAG system in mid-2023. The demo took an afternoon. Hardening it for actual users took four months. The gap between “loads a PDF and answers questions” and “supports 50 teams with security boundaries and sub-second latency” is wider than the LangChain README suggests.

This month I’m going to walk through that gap. Vector databases, embedding choices, chunking strategies, hybrid search, security, re-ranking, evaluation. But first, the framing piece: why naive RAG breaks and what the 2024 industry has actually learned.

The five-line RAG that lies

# The demo — looks great, ships nothing
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

text = open("doc.pdf").read()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_text(text)
db = Chroma.from_texts(chunks, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model="gpt-4-turbo"), retriever=db.as_retriever())
print(qa.run("What does this document say about pricing?"))

This demo works on a fifteen-page PDF. It fails on a thousand-page corpus. It fails harder on a multi-tenant system. The reason is that every line of it papers over a real production problem.

Failure 1: chunk-level retrieval, document-level questions

The chunker splits at character boundaries. The vector DB stores those chunks. A user asks “what does this document say about pricing?” The top-3 chunks come back. Two of them mention “price” because they happen to share that word. The actual pricing section, four paragraphs long, scores lower than a passing mention in the executive summary.

This is the most common naive-RAG failure: the retrieval is technically working, but the chunks it pulls don’t match what a human would consider relevant. In 2024 the workarounds are:

  • Hierarchical chunking — parent/child splits where the child is indexed for retrieval but the parent is returned to the LLM.
  • Semantic chunking — split on semantic boundaries (heading changes, topic shifts) rather than character counts.
  • Sentence-window retrieval — index single sentences, return the sentence plus N neighbors.

LlamaIndex 0.9 ships all three out of the box. LangChain 0.1 has partial coverage. Neither is automatic — you pick the strategy based on document type. PDFs of contracts want hierarchical. Markdown wikis want semantic. Email threads want sentence-window.

I dig into this in detail in next week’s chunking post.

Failure 2: pure vector search has bad recall on lexical queries

Embedding similarity is fantastic for paraphrases. It’s bad at exact-match queries. A user asks “what’s the SLA for tier-3 support?” If “tier-3” appears in the corpus exactly once, the embedding might rank it below a chunk discussing “third-level service guarantees.” The LLM gets garbage context and confidently hallucinates the wrong number.

Pure vector search misses what BM25 (or any keyword index) would have nailed. The 2024 fix is hybrid search: run both and combine the rankings. Most production-grade vector DBs now ship native hybrid search:

# Qdrant v1.7 hybrid search, January 2024
from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient(url="http://localhost:6333")

results = client.query_points(
    collection_name="docs",
    prefetch=[
        models.Prefetch(
            query=embedding_vector,
            using="dense",
            limit=20,
        ),
        models.Prefetch(
            query=models.SparseVector(indices=sparse_indices, values=sparse_values),
            using="sparse",
            limit=20,
        ),
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=5,
)

Reciprocal Rank Fusion (RRF) is the canonical merging strategy. I’ll dedicate a full post to hybrid search later this month.

Failure 3: stale embeddings

Documents change. The source-of-truth corpus updates daily. The naive RAG pipeline re-indexes everything on a cron, takes hours, and serves stale data in between.

In 2024, three approaches dominate:

  • Document-level deltas. A pipeline watches the source store, embeds only new or modified documents, and upserts them into the vector DB. The vector DB needs reliable per-document delete plus upsert.
  • Source-time-aware queries. Each chunk carries a last_modified timestamp. The retriever filters or boosts based on freshness.
  • Cache the embedding model output by content hash. Re-indexing is cheap when most chunks haven’t changed. Hash the chunk text; skip the embedding API call on a cache hit.

OpenAI’s text-embedding-3-small, released January 25 2024, costs $0.02 per million tokens. It’s cheap enough that the “embed everything nightly” approach is tempting. Don’t. Latency to ingest is still your bottleneck if your corpus is large, and consistency suffers if you can’t atomically swap collections.

Failure 4: no per-user access control

The naive RAG demo assumes one user, one corpus. Real systems have multi-tenant data. User A should never see chunks from User B’s documents.

Bolting access control onto an already-indexed vector DB is painful. The two clean patterns:

  • Per-tenant indexes. Each tenant gets a separate collection / namespace. Strict isolation, easy to reason about, expensive at scale (every tenant is a fixed cost).
  • Metadata filtering on a shared index. Each chunk carries an acl metadata field. Queries filter by acl IN (user.allowed_groups). Cheap, but the filter must be enforced before retrieval, not after, and your DB needs efficient filtered ANN.

Pinecone serverless (launched January 2024) and Qdrant v1.7 both have indexed metadata that makes the second approach perform. pgvector 0.5 with the HNSW filter pushdown is the open-source option. The full security post comes mid-month.

Failure 5: no evaluation, no feedback loop

“It seems to work” is the default mode for naive RAG, because LLM outputs look authoritative even when wrong. Without evaluation, you have no idea how often retrieval is failing.

The 2024 evaluation stack has matured. Ragas, TruLens, and DeepEval all let you compute metrics like:

  • Faithfulness — does the generated answer follow from the retrieved chunks?
  • Answer relevance — does the answer address the question?
  • Context precision — were the retrieved chunks actually relevant?
  • Context recall — were any relevant chunks missed?

Wire these into CI so a regression in retrieval quality fails the build, not the user. Last post of the month covers this in depth.

The 2024 stack, at a glance

If I were starting a production RAG project this month, the defaults I’d reach for:

  • Embeddings: text-embedding-3-small for most cases; text-embedding-3-large for high-stakes accuracy; Cohere embed-multilingual-v3 for non-English. Open-source: bge-m3 for hybrid (dense + sparse + colbert).
  • Vector DB: Qdrant for self-hosted, Pinecone serverless for managed, pgvector if you’re already on Postgres and the corpus is under ~10M chunks.
  • Chunking: LlamaIndex semantic chunker for text-heavy docs, custom logic for structured docs.
  • Retrieval: Hybrid (dense + sparse) with RRF, top-50 → re-ranker → top-5 to LLM.
  • LLM: gpt-4-turbo for synthesis, gpt-3.5-turbo or Claude 2.1 for cheaper paths.
  • Evaluation: Ragas in CI, sampled user queries replayed weekly.
  • Orchestration: LangChain 0.1 if you need the ecosystem, plain Python if you don’t.

None of this is exotic. All of it is integration work.

Common Pitfalls

  • Trusting the demo numbers. A 90% accuracy on your test set doesn’t mean 90% on user queries. Test set drift is real.
  • Starting with a single chunk size for everything. Different document types want different chunkers. One-size-fits-all underperforms.
  • Skipping re-ranking. Top-5 from vector search alone is rarely the right top-5. A re-ranker on top-50 fixes most issues for cheap.
  • Letting the LLM “synthesize” from low-quality context. GPT-4 will confidently invent. Better to refuse than fabricate. Build a “no relevant context found” path.
  • Indexing without security from day one. Adding ACLs after the fact is a migration. Build it in.

Wrapping Up

The 30-line demo isn’t the product. Production RAG is closer to a search engine with an LLM bolt-on than to a chat interface. This month I’ll walk through the layers — vector DB choice, embeddings, chunking, hybrid search, security, re-ranking, evaluation — with concrete code for each. By the end you’ll have a more honest mental model of what shipping a RAG system in 2024 actually costs.

For deeper reading on the broader retrieval research, the Anthropic Contextual Retrieval writeup is worth a careful pass.