Chunking Strategies for RAG That Survive Real Documents
TL;DR — Fixed-size chunking is the worst default. Use document-aware splitting: semantic for prose, structural for code and Markdown, hierarchical for long structured docs. Test on real queries, not chunk-count vibes.
Earlier this month I said embedding choice was more consequential than DB choice. Walking that back slightly — chunking is consequential too, and easier to get wrong. A 5-percentage-point bump in retrieval quality from the right chunking strategy is achievable. Most teams don’t get it.
The default RecursiveCharacterTextSplitter with chunk_size=1000, chunk_overlap=200 is fine for the README demo and bad for everything else. This post walks through what to use instead for the document types you actually have.
What chunking is doing
Retrieval at the chunk level means each chunk is its own atomic unit. The embedding model encodes a chunk; the vector DB indexes the embedding; queries retrieve the top-K chunks; those chunks are stuffed into the LLM context.
Two failure modes from bad chunking:
- Loss of coherence. A chunk ends mid-sentence or splits an example from its explanation. The retrieved chunk doesn’t carry enough context to answer.
- Loss of specificity. A chunk is too large and contains many topics. The embedding averages them, so neither topic ranks well on focused queries.
Good chunking maintains semantic coherence per chunk while keeping chunks small enough that each one is about one thing.
Document type drives strategy
There isn’t one chunker. Pick by document type:
- Long prose (PDF reports, articles, books) — semantic or hierarchical chunking.
- Code — split by AST or function boundaries.
- Markdown / docs — split by header sections.
- Tables — special handling; convert to row-level units or markdown.
- Email threads — split by reply boundaries.
- Chat transcripts — split by speaker turns or topic shifts.
Trying to handle all of these with one chunker is what produces mediocre retrieval. Detect the type, route to the right chunker.
Fixed-size, when it’s OK
Fixed-size character chunking is the baseline. It’s fast and predictable. It’s also where most teams stop, which is the problem.
# RecursiveCharacterTextSplitter — LangChain 0.1
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)
The recursive variant is better than naive — it tries to split on paragraph, then sentence, then word, before resorting to character. For uniform prose with no structure, this is fine.
Where it falls down: any document with structure. PDFs of legal contracts. Markdown wikis with nested headers. Anything where the document author put effort into the hierarchy.
Semantic chunking
The 2024 favorite for prose. Split where the embedding similarity between adjacent sentences drops, signaling a topic shift.
# Semantic chunking — LlamaIndex 0.9
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
nodes = splitter.get_nodes_from_documents(documents)
How it works: embed each sentence, compute consecutive cosine distances, mark a split where the distance exceeds the 95th percentile. The result is variable-length chunks whose boundaries fall on semantic shifts.
Cost is the catch. You’re paying for embeddings during chunking, on top of paying for chunk embeddings. For large corpora, batch the chunking embeddings against a cheaper model than your retrieval embeddings — the chunking only needs to detect shifts, not retrieve.
Hierarchical chunking
For long structured documents — manuals, books, multi-section reports — small chunks lose document-level context, large chunks lose specificity. Hierarchical chunking gives you both.
The idea: index small (sentence or paragraph) chunks for retrieval; return the parent (section or chapter) chunk to the LLM. The LLM gets coherent context; the retriever maintains specificity.
# Hierarchical — LlamaIndex 0.9
from llama_index.core.node_parser import HierarchicalNodeParser, SentenceSplitter
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128],
node_parser_ids=["sec_2048", "para_512", "sent_128"],
)
nodes = parser.get_nodes_from_documents(documents)
# At index time, only index the leaf nodes (sentence-level).
# At query time, retrieve leaves, then walk up to the parent for context.
LlamaIndex’s AutoMergingRetriever automates the “walk up to parent” step at query time. Once enough sibling leaves are retrieved, the parent is returned instead.
I use this for any document over ~10 pages. The quality gain is meaningful.
Markdown and structured docs
Markdown ships with structural cues — headers, lists, code blocks. Use them.
# Markdown header splitter — LangChain 0.1
from langchain.text_splitter import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
)
docs = splitter.split_text(markdown_text)
# Each doc carries metadata with the header hierarchy
The metadata attached to each chunk — {"h1": "API Reference", "h2": "Authentication"} — is gold. You can re-embed it as a prefix to the chunk content, or use it for filtering, or surface it as the citation when displaying results.
Code chunking
Code is its own thing. Splitting Python by character count will split inside a function or class. The retrieved chunk has half a def and is useless.
The right approach: split by AST nodes (function, class, top-level statement).
# Code-aware splitting — LangChain 0.1
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1500,
chunk_overlap=0,
)
chunks = splitter.split_text(source_code)
LangChain ships language-aware separators for Python, JS, Go, Java, C, Rust, and others. The recursive splitter prefers to split on language-specific structural tokens (\nclass , \ndef , \nfunc ).
For richer code chunking, tree-sitter gives you a real AST. Walk it, emit one chunk per function or class, include the file path and class context as metadata. This is what Cursor and a few other code-aware tools do under the hood.
Tables
Tables in PDFs are the worst case. Embedding a table’s text dump gives you garbage — the embedding sees comma-separated noise, not relations.
Two approaches that work:
- Row-level chunks. Each table row becomes its own chunk, with the column headers and table title as prefix metadata. Useful when queries are about specific rows.
- Markdown table dumps. Convert the table to markdown, treat the whole table as one chunk. Useful when queries are about the table as a whole.
unstructured.io detects tables in PDFs reliably and emits both versions:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="report.pdf",
extract_image_block_types=["Table"],
infer_table_structure=True,
)
for el in elements:
if el.category == "Table":
print(el.metadata.text_as_html) # HTML/markdown-convertible
For LLM ingestion, markdown is usually the better format. GPT-4 reads markdown tables fluently.
Sentence-window retrieval
A trick for surgical-precision retrieval: index single sentences, but return a window of N neighbors as context.
# Sentence-window — LlamaIndex 0.9
from llama_index.core.node_parser import SentenceWindowNodeParser
parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
nodes = parser.get_nodes_from_documents(documents)
At query time the retriever matches on the embedded sentence but returns the 3-sentence window. Best for FAQ-style or fact-extraction queries where the exact fact lives in one sentence but needs surrounding context to be coherent.
Common Pitfalls
- One chunker for all doc types. Detect the type and route.
- Chunk overlap as a fix for bad boundaries. Overlap is a partial mitigation. It’s not a substitute for chunking where it matters.
- Embedding the metadata badly. If you prepend
{"h1": "..", "h2": ".."}literally to chunk text, you’re embedding “h1 h2” tokens. Prefer natural language: “Section: API Reference > Authentication\n\n”. - Skipping the cleanup pass. Trailing whitespace, broken hyphens from PDF extraction, stray page numbers, table-of-contents fragments — all of this poisons embeddings. Clean before chunking.
- Treating chunking as one-and-done. It isn’t. Re-chunk when you change embedding models, when queries reveal mismatch, or when document types shift.
The mistake I made: I chunked a thousand PDFs at character size 1000, indexed everything, and shipped. Six weeks in I realized 30% of the chunks ended mid-sentence on broken column boundaries. The fix was switching to unstructured.io for layout-aware extraction, then chunking semantically. Re-indexing took two days.
Wrapping Up
Chunking is where you spend ten times more thought than you expect to. Default chunkers work for default documents — meaning none of yours. Detect document types, route to the right strategy, evaluate on real queries, iterate. The 2024 stack (LlamaIndex 0.9, LangChain 0.1, unstructured.io) has all the primitives. Wire them.
Next post in the series brings BM25 back from the dead — hybrid search and why pure-vector retrieval is leaving recall on the table.