Shipping an Internal RAG Chatbot with LlamaIndex 0.8, What Actually Matters
TL;DR — LlamaIndex 0.8 is the fastest path to a working internal RAG chatbot, but the defaults will lie to you in production. / Spend your time on chunking, metadata, and retrieval evaluation — not on prompt engineering. / Treat the vector index as a cache around your source-of-truth documents, never as the source itself.
I spent most of October building an internal chatbot for a mid-sized company that wanted their engineers to stop pinging the platform team every time someone forgot which Vault path holds the staging Postgres credentials. The brief was simple: ingest the internal wiki, the runbooks, and a few hundred ADRs, then answer questions in Slack. The brief is always simple. The execution is where you bleed.
This post is the unvarnished version of what I learned shipping the v1 with LlamaIndex 0.8.68 against gpt-3.5-turbo-16k. I’m writing it now because LlamaIndex 0.9 is about to land and the API surface is going to shift again. If you’re starting a project this month, you’ll want to know what’s actually load-bearing versus what’s a demo trick.
The short version: the retrieval quality determines everything. The LLM is the cheap part.
Why LlamaIndex Over Rolling Your Own
I’ve built RAG pipelines with raw embeddings calls and a Postgres + pgvector setup before. It works. It’s also a lot of code to maintain that doesn’t differentiate your product. LlamaIndex earns its keep by giving you reasonable defaults for the boring scaffolding: document loaders, node parsers, retriever abstractions, response synthesizers. You can swap any layer.
What you should not do is treat VectorStoreIndex.from_documents() as a finished product. It’s a quickstart. Production starts when you split that one line into five.
# llama-index==0.8.68, openai==1.3.5, psycopg2-binary==2.9.9
from llama_index import (
ServiceContext,
StorageContext,
VectorStoreIndex,
SimpleDirectoryReader,
)
from llama_index.vector_stores import PGVectorStore
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.embeddings import OpenAIEmbedding
from llama_index.llms import OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
llm = OpenAI(model="gpt-3.5-turbo-16k", temperature=0.0)
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
node_parser=node_parser,
)
vector_store = PGVectorStore.from_params(
database="rag",
host="localhost",
password=os.environ["PG_PASSWORD"],
port=5432,
user="rag",
table_name="wiki_nodes",
embed_dim=1536,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
A few things to notice. I’m using SentenceWindowNodeParser instead of the default sentence splitter. The window approach embeds a single sentence but retrieves the surrounding context — this dramatically improved answer quality on our runbook content, where the relevant signal is often one sentence but the LLM needs the paragraph around it to write a useful answer.
Temperature is zero. Always zero for internal Q&A. You are not writing poetry.
Chunking Is Where the Project Is Won or Lost
If you remember one thing from this post: chunking strategy matters more than embedding model choice, more than reranker choice, more than prompt template. I’ve watched teams spend two weeks tuning prompts when their problem was that they were embedding 2000-token chunks and the retriever could only ever return three of them at a time, half of which were the wiki sidebar boilerplate.
For our wiki corpus, I ended up with three different parsers for three different document types:
- Runbooks: split on H2 boundaries with a 200-token overlap. Runbooks have a stable structure and you want to keep the “Symptoms → Diagnosis → Fix” section intact.
- ADRs: one chunk per ADR. They’re short and the decision context matters as a whole.
- Wiki pages: sentence window with window size 3.
You cannot do this with one parser. Route documents to parsers by metadata at ingest time. This sounds obvious. Almost no one does it.
Strip your boilerplate before embedding. The header that says “Last edited by Bob on 2023-04-15” is going to embed similarly to every other header that says “Last edited by…” and your retriever will start returning headers when users ask vague questions. I dropped about 18% of our corpus volume to boilerplate stripping and recall went up.
Metadata Filtering Is Not Optional
Pure semantic search is a toy. Real systems combine vector similarity with metadata filters. If a user asks “what’s the on-call rotation for the payments team?”, you should be filtering to team=payments before you do the vector search, not hoping the embeddings sort it out.
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(filters=[
ExactMatchFilter(key="team", value="payments"),
ExactMatchFilter(key="doc_type", value="runbook"),
])
retriever = index.as_retriever(
similarity_top_k=8,
filters=filters,
)
The hard part is extracting the filter values from the user’s question. For v1, I cheated: a Slack slash command takes /ask payments how does refund retry work and the first word is the team. Crude, fast, works. The next iteration uses a small function-calling LLM step to extract structured filters from natural language, but I’d ship the crude version first and only upgrade when users complain — and they usually don’t.
If you’re choosing a vector store this month, check the LlamaIndex vector store docs for current metadata filter support. Pinecone, Weaviate 1.22, and pgvector all handle this well. Chroma 0.4 has gotten there too.
Retrieval Evaluation, Not Vibes
Here is the part everyone skips. You need an eval set. Not for the LLM output — for the retriever. Before you ever look at a generated answer, you need to know: for question X, did the retriever surface the right chunk in the top-k? If the answer is no, no amount of prompt tweaking will save you.
I built a 60-question eval set by sitting with the platform team for two hours and asking them what questions they actually get. For each question, they pointed me at the canonical document. I then computed hit-rate and MRR for k=5 and k=10 across three retriever configurations: pure vector, vector + BM25 hybrid, and hybrid + cohere reranker.
The numbers that mattered:
| Retriever | Hit@5 | MRR@10 |
|---|---|---|
| Vector only | 0.62 | 0.48 |
| Hybrid (vector + BM25) | 0.78 | 0.61 |
| Hybrid + reranker | 0.85 | 0.71 |
The reranker is worth it. The hybrid retrieval is worth it more. If you’re only going to do one, do hybrid. BM25 catches the cases where the user uses an exact internal term — a service name, an error code — that doesn’t have great semantic neighbors.
Common Pitfalls
A few things I tripped over so you don’t have to.
The default response synthesizer concatenates chunks naively. If you retrieve eight chunks and stuff them in a single prompt, the LLM will weight the first one heavily and ignore the last. Use ResponseMode.COMPACT or ResponseMode.TREE_SUMMARIZE for anything beyond three chunks. The compact mode in particular is doing real work — it re-packs chunks to fit context efficiently.
Streaming changes the UX more than anything else. A 4-second response feels slow. A 4-second response that starts streaming at 400ms feels fast. LlamaIndex’s query_engine.query() supports streaming via streaming=True on the synthesizer. Use it.
Source attribution is mandatory for internal tools. No one will trust the bot if it can’t cite. Surface the source document title and URL with every answer. I made this the top of the response, not the bottom — users want to know where it came from before they read the answer.
Token budgets are real even at 16K. Eight chunks at 1000 tokens each plus the system prompt plus the conversation history will eat your context. Budget explicitly. Log when you truncate.
Embedding cost adds up. Re-embedding the entire wiki on every change is wasteful. Hash documents at ingest and only re-embed changed nodes. With ada-002 at $0.0001/1K tokens this isn’t a fortune, but at scale it adds up and it slows iteration.
What’s Next
GPT-4 Turbo just dropped at OpenAI DevDay with 128K context and substantially lower pricing, and that changes the calculus. With 128K you can stop being so precious about retrieval and start passing more candidate chunks to the LLM with confidence. But you still need retrieval to be good — the lost in the middle effect does not go away just because the context window grew.
Next post I’ll walk through migrating this stack from gpt-3.5-turbo-16k to gpt-4-turbo and the surprises that came with it. The cost math is more interesting than you’d think.
Wrapping Up
LlamaIndex 0.8 gets you to a credible v1 fast. The work that makes the difference between a demo and a tool people actually use is in the chunking, the metadata, the hybrid retrieval, and the eval discipline. None of that is glamorous. All of it is mandatory.
If you’re starting a build this month, build the eval set first. You will thank yourself.