RAG Systems for Technical Support Teams in 2025
TL;DR — Production RAG for support isn’t about picking the fanciest embedding model. It’s about clean ingestion from ticket systems, chunking that respects how engineers actually read docs, hybrid retrieval with a reranker, and an evaluation harness that runs every deploy.
I’ve spent the last eight months helping two support orgs replace their first-generation chatbots with proper retrieval-augmented systems, and the pattern is always the same. Somebody on the leadership side saw a vendor demo, somebody else stitched together a notebook with text-embedding-3-small and FAISS, and within four weeks the bot was hallucinating product features that don’t exist and citing a 2019 knowledge base article that’s been deprecated for two years. The model wasn’t the problem. The pipeline was.
This tutorial is the version of the project I wish I’d had on day one. It assumes you’re a senior support engineer or a tech support manager who can read Python, has access to your ticketing system’s API, and wants to ship something that L1 agents will actually trust. We’ll build incrementally, starting with ingestion from Zendesk and Jira Service Management, moving through chunking and embedding, into hybrid retrieval and reranking, and finally an evaluation harness that you can plug into CI.
I’m pinning everything to November 2025 versions because the ecosystem still churns enough that a tutorial older than six months is usually broken. LangChain 0.3, Qdrant 1.13, pgvector 0.8 on Postgres 17, and OpenAI’s gpt-4o plus text-embedding-3-large for embeddings. If your org has a Claude contract, swap in claude-3.7-sonnet and Anthropic SDK 0.42; the structure doesn’t change.
What support RAG actually needs to do
The thing nobody tells you in vendor demos is that support RAG has a different shape than the generic “chat with your docs” demo. You have at least four distinct corpora that all need to be searchable together but weighted differently. You’ve got your public knowledge base, your internal runbooks, your closed ticket history, and your engineering wiki. Each one has different freshness, different authority, and different access control.
L1 agents need fast answers to common questions. L2 needs deep links into runbooks and similar past tickets. L3 wants to grep across the engineering wiki and pull source code references. A single retrieval pipeline that treats all four sources equally will fail all three audiences.
+-----------------------+
ticket | query rewriter (LLM) |
text --> | + intent classifier |
+-----------+-----------+
|
+-----------v-----------+
| hybrid retriever |
| (BM25 + dense + RRF) |
+-----------+-----------+
|
+-----------v-----------+
| cross-encoder |
| reranker |
+-----------+-----------+
|
+-----------v-----------+
| answer synthesis |
| with inline cites |
+-----------------------+
That’s the topology we’re going to build. It’s boring on purpose. The win in 2025 isn’t novel architecture, it’s getting every box in that diagram tuned to your data.
Step 1, ingestion that doesn’t lie
Start with the source you trust most. For most orgs that’s a curated public knowledge base, not the ticket archive. Tickets contain workarounds that became obsolete, customer-specific configurations that don’t generalize, and angry engineers venting in internal notes. You’ll ingest tickets later, but they need different handling.
Here’s a minimal ingestion shape for Zendesk articles. I’m using the Zendesk Help Center API directly because the official Python wrappers lag behind feature releases, and the REST surface is small enough that direct calls are fine.
import os
import httpx
from datetime import datetime, timezone
from typing import Iterator
ZD_SUBDOMAIN = os.environ["ZD_SUBDOMAIN"]
ZD_EMAIL = os.environ["ZD_EMAIL"]
ZD_TOKEN = os.environ["ZD_TOKEN"]
def iter_articles(locale: str = "en-us") -> Iterator[dict]:
url = f"https://{ZD_SUBDOMAIN}.zendesk.com/api/v2/help_center/{locale}/articles.json"
auth = (f"{ZD_EMAIL}/token", ZD_TOKEN)
with httpx.Client(timeout=30.0) as client:
while url:
r = client.get(url, auth=auth, params={"per_page": 100})
r.raise_for_status()
data = r.json()
for art in data["articles"]:
if art["draft"] or art["outdated"]:
continue
yield {
"id": art["id"],
"title": art["title"],
"body_html": art["body"],
"url": art["html_url"],
"section_id": art["section_id"],
"labels": art["label_names"],
"updated_at": art["updated_at"],
"source": "zendesk_kb",
}
url = data.get("next_page")
Two things to notice. The outdated and draft filters are non-negotiable. I’ve seen teams ingest everything and then wonder why the bot recommends features that shipped to beta in 2022 and got pulled. The Zendesk API tells you when an article is stale; respect it.
The other thing is the source field. Tag every document with where it came from, with what permission model applies, and with when you ingested it. Future-you, debugging why a confidential runbook leaked into a customer-facing answer, will thank past-you.
For Jira Service Management, the equivalent is pulling resolved tickets with public-facing resolutions. We’ll cover the full ticket ingestion pipeline in the knowledge base from Zendesk and Jira tutorial coming up in this series, including how to filter out PII and customer-specific config.
Step 2, chunking that respects engineer reading patterns
Default chunking is where most RAG pipelines die. The naive recipe of “split on 1000 characters with 200 overlap” works for blog posts and fails for technical docs, because it shreds code blocks and breaks the relationship between a heading and its body.
For support docs, I use a three-pass strategy. First, parse the HTML structure and split on <h2> and <h3> boundaries. Second, within each section, keep code blocks atomic and never split them. Third, if a section is still over the token budget, fall back to sentence splitting with a generous overlap.
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken
ENC = tiktoken.encoding_for_model("text-embedding-3-large")
MAX_TOKENS = 512
def token_len(text: str) -> int:
return len(ENC.encode(text))
def chunk_article(article: dict) -> list[dict]:
soup = BeautifulSoup(article["body_html"], "html.parser")
chunks = []
current_heading = article["title"]
buffer = []
def flush():
if not buffer:
return
text = "\n\n".join(buffer).strip()
if not text:
return
if token_len(text) <= MAX_TOKENS:
chunks.append({"heading": current_heading, "text": text})
else:
splitter = RecursiveCharacterTextSplitter(
chunk_size=1800,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "],
)
for sub in splitter.split_text(text):
chunks.append({"heading": current_heading, "text": sub})
buffer.clear()
for el in soup.find_all(["h2", "h3", "p", "pre", "ul", "ol"]):
if el.name in ("h2", "h3"):
flush()
current_heading = el.get_text(strip=True)
elif el.name == "pre":
buffer.append(f"```\n{el.get_text()}\n```")
else:
buffer.append(el.get_text("\n", strip=True))
flush()
for c in chunks:
c["text"] = f"# {article['title']}\n## {c['heading']}\n\n{c['text']}"
c["metadata"] = {
"article_id": article["id"],
"url": article["url"],
"source": article["source"],
"updated_at": article["updated_at"],
"labels": article["labels"],
}
return chunks
The crucial detail is the last loop. Every chunk gets the article title and section heading prepended. This gives the embedding model context that the chunk alone wouldn’t have, and it gives the LLM at synthesis time a clear citation anchor. Without this, two chunks from different articles that happen to discuss “step 3” become indistinguishable in the vector space.
Step 3, hybrid retrieval with Qdrant and BM25
Pure dense retrieval misses keyword matches that domain experts care about. If a customer types “ERR_CERT_AUTHORITY_INVALID”, the embedding model might helpfully retrieve documents about TLS certificates in general, when what you needed was the single runbook that mentions that exact error code. Hybrid retrieval, dense plus sparse plus reciprocal rank fusion, fixes this.
Qdrant 1.13 supports sparse vectors natively, which means you can do hybrid in a single query without bolting on Elasticsearch. Here’s the indexing side.
from qdrant_client import QdrantClient, models
from qdrant_client.models import (
Distance, VectorParams, SparseVectorParams, SparseIndexParams
)
from openai import OpenAI
from fastembed import SparseTextEmbedding
oai = OpenAI()
qdrant = QdrantClient(url=os.environ["QDRANT_URL"], api_key=os.environ["QDRANT_KEY"])
bm25 = SparseTextEmbedding("Qdrant/bm25")
COLLECTION = "support_kb_v3"
qdrant.recreate_collection(
collection_name=COLLECTION,
vectors_config={"dense": VectorParams(size=3072, distance=Distance.COSINE)},
sparse_vectors_config={"sparse": SparseVectorParams(index=SparseIndexParams())},
)
def embed_dense(texts: list[str]) -> list[list[float]]:
resp = oai.embeddings.create(model="text-embedding-3-large", input=texts)
return [d.embedding for d in resp.data]
def index_chunks(chunks: list[dict]):
texts = [c["text"] for c in chunks]
dense_vecs = embed_dense(texts)
sparse_vecs = list(bm25.embed(texts))
points = []
for i, (chunk, dv, sv) in enumerate(zip(chunks, dense_vecs, sparse_vecs)):
points.append(models.PointStruct(
id=f"{chunk['metadata']['article_id']}-{i}",
vector={"dense": dv, "sparse": sv.as_object()},
payload={"text": chunk["text"], **chunk["metadata"]},
))
qdrant.upsert(collection_name=COLLECTION, points=points)
Query time uses Qdrant’s query_points with the prefetch pattern to fuse results via RRF.
def hybrid_search(query: str, k: int = 20, source_filter: list[str] | None = None):
dense_q = embed_dense([query])[0]
sparse_q = next(bm25.embed([query])).as_object()
flt = None
if source_filter:
flt = models.Filter(must=[
models.FieldCondition(key="source", match=models.MatchAny(any=source_filter))
])
results = qdrant.query_points(
collection_name=COLLECTION,
prefetch=[
models.Prefetch(query=dense_q, using="dense", limit=k * 2, filter=flt),
models.Prefetch(query=sparse_q, using="sparse", limit=k * 2, filter=flt),
],
query=models.FusionQuery(fusion=models.Fusion.RRF),
limit=k,
with_payload=True,
)
return results.points
The source_filter argument is what lets you route L1 queries to public KB only while letting L3 see everything. Don’t skip this; access control as an afterthought is how confidential data ends up in customer-facing answers.
Step 4, reranking and answer synthesis
Twenty hybrid hits is still too many for a synthesis prompt. A cross-encoder reranker cuts it to the four or five that actually answer the question. Cohere’s rerank-3.5 is the cheapest production-grade option in November 2025, but a self-hosted bge-reranker-v2-m3 runs fine on a single A10G if you want to keep data in-house.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
def rerank(query: str, hits, top_n: int = 5):
docs = [h.payload["text"] for h in hits]
resp = co.rerank(model="rerank-3.5", query=query, documents=docs, top_n=top_n)
return [hits[r.index] for r in resp.results]
Synthesis is where you finally call the LLM. The prompt template matters more than the model choice. Put the citation rules in the system message, the retrieved chunks in the user message with explicit IDs, and demand inline citations in the output.
SYSTEM = """You are a senior support engineer. Answer using ONLY the provided
documents. Cite each claim with [doc_id]. If the documents don't contain the
answer, say so plainly and suggest which team to escalate to. Never invent
product features, version numbers, or configuration keys."""
def synthesize(query: str, ranked_hits) -> str:
docs_block = "\n\n".join(
f"[doc_{i}] source={h.payload['source']} url={h.payload['url']}\n{h.payload['text']}"
for i, h in enumerate(ranked_hits)
)
resp = oai.chat.completions.create(
model="gpt-4o",
temperature=0.1,
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f"Question: {query}\n\nDocuments:\n{docs_block}"},
],
)
return resp.choices[0].message.content
Temperature 0.1, not zero. Pure zero on gpt-4o makes the model parrot the docs verbatim, which sounds fine until a customer asks the same question two different ways and gets identical robotic responses.
Step 5, evaluation that runs in CI
If you can’t measure regression, you can’t ship. Build an eval set of fifty to two hundred real support questions with hand-graded gold answers. Run it on every PR that touches the pipeline.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
def evaluate_pipeline(eval_set: list[dict]) -> dict:
samples = []
for ex in eval_set:
hits = hybrid_search(ex["question"], k=20)
ranked = rerank(ex["question"], hits, top_n=5)
answer = synthesize(ex["question"], ranked)
samples.append({
"question": ex["question"],
"answer": answer,
"contexts": [h.payload["text"] for h in ranked],
"ground_truth": ex["gold_answer"],
})
result = evaluate(samples, metrics=[faithfulness, answer_relevancy, context_precision])
return result
Run this nightly against production data and on every PR. The day your faithfulness score drops below 0.85 is the day you stop the deploy. I’ve also got a longer piece on measuring support engineering effectiveness that covers the human-side metrics that should sit next to your RAG scores.
For more on the eval framework itself, the Ragas docs are the canonical reference and they update faster than any blog post can.
Common Pitfalls
Treating embeddings as one-size-fits-all. text-embedding-3-large is fine for English prose. It’s bad at code, mediocre at multilingual, and useless for short error codes. If your corpus has any of those, you need either a fine-tuned embedding or a sparse component pulling its weight. Test before you commit.
Ingesting every closed ticket. Tickets contain workarounds that became wrong, customer configs that don’t generalize, and outdated version references. Filter aggressively. I trust resolved tickets older than ninety days only if a human has flagged them as canonical.
Skipping access control at the chunk level. If your runbooks contain customer-specific notes, the chunk metadata has to carry that, and your retriever has to enforce it. Doing this at the LLM prompt layer is too late; by then the secret is already in the context window and one prompt injection away from leaking.
Choosing chunk size by intuition. Sweep it. I’ve seen the same corpus hit peak recall at 256 tokens for one query distribution and 768 for another. Run your eval set across at least three chunk sizes before you pick one.
Logging only the final answer. Log the query, the rewritten query, all twenty hybrid hits with their scores, the reranker’s reordering, and the synthesis prompt. Without that, debugging a bad answer is guesswork.
Troubleshooting
Symptom, the bot cites a doc that doesn’t contain the claim. This is almost always a reranker problem, not a synthesis problem. The synthesis model picked the most plausible citation from a set where none actually answered the question. Lower the reranker top_n, raise the relevance threshold, and add a fallback path that explicitly returns “I don’t have a confident answer” when the top reranker score is below 0.6.
Symptom, recall is great in eval, terrible in production. Your eval set is too clean. Real users misspell error codes, paste stack traces with PII, and ask three questions in one sentence. Add a query rewriting step with gpt-4o-mini that normalizes the input before retrieval, and regenerate eval samples from actual production traffic, not from KB articles.
Symptom, latency spikes to ten seconds on first query. Cold start on the embedding API or the reranker. Pre-warm both at process startup with a synthetic query, and put a 200ms timeout on the reranker with a fallback to dense-only results. Users will forgive a slightly worse answer; they won’t forgive a ten-second wait.
Wrapping Up
A working support RAG isn’t impressive. It’s boring. Ingestion that filters aggressively, chunking that preserves structure, hybrid retrieval with a reranker, and an eval harness that catches regressions before customers do. None of those steps require a research paper to implement. All of them require discipline and a willingness to delete code that isn’t earning its keep.
Next in this series I’ll go deeper on the ingestion side, specifically how to pull resolved tickets from Zendesk and Jira Service Management without leaking PII or pulling in stale workarounds. After that we’ll get into embedding strategy choices that you can’t easily reverse later. If you’re a tech support manager trying to scope this work, the rough order of operations is the order of the headings above; resist the urge to skip ahead to synthesis tuning before the ingestion is clean.