background-shape
Local RAG with SLMs, Private Knowledge Without the Cloud
January 27, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Build local RAG with a small embedding model (BGE-small or nomic-embed-text), a local vector DB (ChromaDB or Qdrant), and a 3B SLM behind Ollama. Hybrid search + a reranker beats vanilla dense retrieval. The bottleneck is chunking, not the model.

Most “local RAG” tutorials end at “and now you have a vector database.” That’s the easy part. The hard parts are chunking sensibly, retrieving the right thing, getting the SLM to use the retrieved context instead of its priors, and knowing when retrieval failed. None of these require the cloud, but they do require thought.

I run a private RAG system over my own notes (Hugo blog posts, like the one you’re reading) and over internal docs for a few small teams. The whole thing runs on a single Linux box with a 12GB GPU. It’s not as fast as a frontier API, but it’s correct, private, and free at the margin. Plus, I can tune every layer instead of negotiating with someone else’s defaults.

This tutorial builds the system end to end. We’re using ChromaDB for storage, BGE-small-en for embeddings, and Llama-3.2-3B-Instruct through Ollama for generation. Versions: Python 3.12, ChromaDB 0.5.23, Ollama 0.5.4, sentence-transformers 3.3.1. The output is a working CLI you can point at a folder of documents.

What RAG Actually Costs

Three running costs you commit to when you build RAG:

Indexing. Every document has to be chunked, embedded, and stored. For a typical 10k document corpus, this takes 10-30 minutes on consumer hardware and around 200MB of disk.

Storage. Vectors are float arrays. A 384-dim embedding (BGE-small) is 1.5KB per chunk. A million chunks is 1.5GB plus index overhead. Comfortable on any machine.

Retrieval latency. Each query embeds (5-20ms), searches (5-20ms), optionally reranks (50-200ms), and generates. The retrieval pieces are fast; generation dominates.

Compared with calling an embedding API and a hosted LLM, the operational cost trades dollars for VRAM. The win is data never leaves your box.

Architecture

+--------------------+     +-----------------+     +-----------------+
|  Documents         |     |  Chunker        |     |  Embedder       |
|  (md, pdf, html)   |---->|  (recursive,    |---->|  (BGE-small)    |
+--------------------+     |   ~512 tokens)  |     +--------+--------+
                           +-----------------+              |
                                                            v
                                                   +-----------------+
                                                   |  Vector store   |
                                                   |  (ChromaDB)     |
                                                   +-----------------+
                                                            ^
                                                            |
+--------------------+     +-----------------+     +--------+--------+
|  User query        |     |  Embedder       |     |  Top-k search   |
|                    |---->|  (BGE-small)    |---->|                 |
+--------------------+     +-----------------+     +--------+--------+
                                                            |
                                                            v
                                                   +-----------------+
                                                   |  Reranker       |
                                                   |  (bge-reranker) |
                                                   +--------+--------+
                                                            |
                                                            v
                                                   +-----------------+
                                                   |  SLM (Llama 3.2)|
                                                   |  via Ollama     |
                                                   +-----------------+

Read the diagram top-to-bottom for indexing, bottom-to-top for query. Same embedding model in both paths, which is non-negotiable.

Step 1, Pick Your Models

Embedding model choice matters more than people admit. For English content I default to BGE-small-en-v1.5 — 384 dimensions, fast on CPU, excellent retrieval quality for its size. For multilingual I switch to BGE-m3 (1024 dims, slower, broader coverage).

Pull through Ollama for convenience:

ollama pull nomic-embed-text:v1.5
ollama pull llama3.2:3b-instruct-q4_K_M

Or use sentence-transformers directly:

pip install \
  "sentence-transformers==3.3.1" \
  "chromadb==0.5.23" \
  "ollama==0.4.4" \
  "pypdf==5.1.0" \
  "markdown-it-py==3.0.0"

Step 2, Chunking That Doesn’t Suck

Chunking is where RAG quality is won or lost. Tiny chunks lose context; huge chunks dilute signal. The right size is 200-500 tokens with 10-20% overlap, broken on semantic boundaries (paragraphs, headings) rather than fixed character counts.

# chunker.py
from dataclasses import dataclass
from typing import Iterable
import re

@dataclass
class Chunk:
    doc_id: str
    chunk_id: int
    text: str
    metadata: dict

def chunk_markdown(text: str, doc_id: str, target_tokens: int = 400,
                   overlap_tokens: int = 60) -> Iterable[Chunk]:
    # naive token approximation: 1 token ~= 4 chars for English
    target_chars  = target_tokens * 4
    overlap_chars = overlap_tokens * 4

    # split on heading boundaries first
    sections = re.split(r"(?m)^(#{1,6}\s+.*$)", text)
    blocks = []
    current = ""
    for piece in sections:
        if re.match(r"^#{1,6}\s+", piece or ""):
            if current.strip():
                blocks.append(current.strip())
            current = piece + "\n"
        else:
            current += piece or ""
    if current.strip():
        blocks.append(current.strip())

    chunk_id = 0
    for block in blocks:
        if len(block) <= target_chars:
            yield Chunk(doc_id, chunk_id, block, {})
            chunk_id += 1
            continue
        start = 0
        while start < len(block):
            end = min(start + target_chars, len(block))
            # try to break at a paragraph boundary
            if end < len(block):
                nl = block.rfind("\n\n", start, end)
                if nl > start + target_chars // 2:
                    end = nl
            yield Chunk(doc_id, chunk_id, block[start:end].strip(), {})
            chunk_id += 1
            start = max(end - overlap_chars, end)

This is intentionally simple. It breaks on headings, falls back to paragraph boundaries, and only uses character-based sizing as a last resort. For PDFs and HTML you preprocess to markdown first; the chunker stays the same.

Step 3, Indexing

# index.py
import os, glob, hashlib
from sentence_transformers import SentenceTransformer
import chromadb

EMBED_MODEL = "BAAI/bge-small-en-v1.5"
CHROMA_PATH = "./chroma_db"
COLLECTION = "docs"

embedder = SentenceTransformer(EMBED_MODEL, device="cuda")
client = chromadb.PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection(
    name=COLLECTION,
    metadata={"hnsw:space": "cosine"},
)

def doc_id(path: str) -> str:
    return hashlib.sha1(path.encode()).hexdigest()[:12]

def index_folder(folder: str):
    paths = glob.glob(f"{folder}/**/*.md", recursive=True)
    for path in paths:
        text = open(path).read()
        chunks = list(chunk_markdown(text, doc_id(path)))
        if not chunks:
            continue
        texts = [c.text for c in chunks]
        # BGE asks for a "passage" prefix for indexed content
        passages = [f"passage: {t}" for t in texts]
        vectors = embedder.encode(passages, batch_size=32, show_progress_bar=False)
        collection.upsert(
            ids=[f"{c.doc_id}:{c.chunk_id}" for c in chunks],
            documents=texts,
            embeddings=vectors.tolist(),
            metadatas=[{"path": path, "chunk_id": c.chunk_id} for c in chunks],
        )
        print(f"indexed {path}: {len(chunks)} chunks")

if __name__ == "__main__":
    index_folder("./docs")

The passage: prefix is BGE-specific. Their training added it to disambiguate queries from passages. Drop it for other embedding models. Always read your embedder’s model card for required prefixes.

Step 4, Retrieval with Reranking

Pure dense retrieval finds documents that are semantically similar to the query, which is usually close to but not exactly what you want. A reranker — a small cross-encoder that scores (query, candidate) pairs — fixes the top-of-list quality issue.

# retrieve.py
from sentence_transformers import SentenceTransformer, CrossEncoder
import chromadb

embedder = SentenceTransformer("BAAI/bge-small-en-v1.5", device="cuda")
reranker = CrossEncoder("BAAI/bge-reranker-base", device="cuda")

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("docs")

def retrieve(query: str, k_dense: int = 30, k_final: int = 5) -> list[dict]:
    q_vec = embedder.encode(f"query: {query}").tolist()
    res = collection.query(query_embeddings=[q_vec], n_results=k_dense)
    candidates = [
        {"text": doc, "meta": meta, "score": 1 - dist}
        for doc, meta, dist in zip(res["documents"][0],
                                    res["metadatas"][0],
                                    res["distances"][0])
    ]
    if not candidates:
        return []
    pairs = [(query, c["text"]) for c in candidates]
    rerank_scores = reranker.predict(pairs)
    for c, s in zip(candidates, rerank_scores):
        c["rerank"] = float(s)
    candidates.sort(key=lambda c: c["rerank"], reverse=True)
    return candidates[:k_final]

Pull 30 candidates with dense search, rerank, keep the top 5. The reranker call is the expensive part (50-200ms for 30 pairs on a small GPU). It’s worth it; my measured top-1 accuracy went from 67% to 89% on a hand-labeled eval set by adding the reranker.

Step 5, Grounded Generation

Now we feed the retrieved chunks to the SLM with a strict instruction to stay grounded.

# answer.py
from ollama import Client
from retrieve import retrieve

client = Client()

SYSTEM = """You answer questions using only the provided context. If the
answer is not in the context, say "I don't know based on the provided
documents." Quote document paths in square brackets like [path/to/doc.md].
Be terse."""

def answer(question: str) -> str:
    chunks = retrieve(question, k_final=5)
    if not chunks:
        return "No relevant documents found."

    context_parts = []
    for i, c in enumerate(chunks, 1):
        context_parts.append(f"[{c['meta']['path']}]\n{c['text']}")
    context = "\n\n---\n\n".join(context_parts)

    resp = client.chat(
        model="llama3.2:3b-instruct-q4_K_M",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user",
             "content": f"Context:\n\n{context}\n\nQuestion: {question}"},
        ],
        options={"temperature": 0.1, "num_ctx": 8192},
    )
    return resp["message"]["content"]

if __name__ == "__main__":
    import sys
    print(answer(" ".join(sys.argv[1:])))

A few choices to notice:

  • temperature=0.1 keeps generation conservative. Higher values invite the model to “fill in” from its priors, which defeats RAG.
  • num_ctx=8192 is sized to fit five 500-token chunks plus instructions plus answer.
  • The system prompt explicitly grants the “I don’t know” escape hatch. Without it, 3B models confabulate.

The output looks like:

$ python answer.py "What temperature should I use for extraction?"
For extraction tasks, use a temperature between 0.0 and 0.3 [docs/structured-output.md].
Higher temperatures are appropriate only for generative tasks.

Dense retrieval misses keyword queries. “What did we say about Q3 2024 EBITDA?” might not match dense neighbors. Add BM25 for keyword recall.

from rank_bm25 import BM25Okapi
# build BM25 over the same chunks at index time
bm25 = BM25Okapi([chunk.text.split() for chunk in all_chunks])

def hybrid_retrieve(query: str, k: int = 30):
    dense = retrieve(query, k_dense=k)
    bm25_scores = bm25.get_scores(query.split())
    # fuse with reciprocal rank fusion
    # ...

For a corpus over 50k chunks, I’d use Qdrant for native hybrid search instead of building it on Chroma. Below that, the manual approach works.

Common Pitfalls

  1. Embedding model mismatch between index and query. If you index with bge-small-en-v1.5 and query with bge-small-en (no version), embeddings live in different spaces. Cosine similarity is meaningless. Fix: pin the exact model string in one place and import it everywhere.

  2. Skipping the query prefix. BGE, nomic, and E5 all require prefixes (query:, passage:, Represent this sentence for searching: etc.). Without them, retrieval quality silently drops 5-15%. Fix: read the model card. There is no universal answer.

  3. Chunking too small. 100-token chunks lose all context. The model can’t tell what document it’s looking at. Fix: 300-500 tokens, broken on semantic boundaries.

  4. Believing the model when it answers without retrieved support. SLMs will happily answer from priors when retrieval returns garbage. Without the explicit “I don’t know” escape, you get authoritative-sounding hallucinations. Fix: build the escape into the system prompt and audit refusal rate as a metric.

Troubleshooting

Symptom: Retrieval returns irrelevant chunks. Diagnose: Probably chunking. Inspect the actual chunks coming back. If they’re tiny fragments missing context, fix the chunker. If they’re huge mixed-topic dumps, fix the chunker the other way.

Symptom: Generation cites sources but the citation is hallucinated. Diagnose: The model invented a path. SLMs do this when the context is too long and they lose track. Fix: shorten context, or use constrained decoding to limit citations to a closed list of paths.

Symptom: ChromaDB queries get slow after a few hundred thousand chunks. Diagnose: Chroma’s HNSW index isn’t tuned for huge corpora. Either move to Qdrant or pre-filter by metadata before vector search. Pass where={"namespace": "..."} if you can partition.

Evaluating Your RAG Pipeline

You cannot improve a RAG system you can’t measure. Three metrics I track on every pipeline change.

Retrieval recall@k. Given a hand-labeled set of (question, relevant_doc) pairs, what fraction of the time does the relevant doc appear in the top-k retrieved chunks? This isolates retrieval quality from generation quality.

def recall_at_k(questions, k=5):
    hits = 0
    for q in questions:
        chunks = retrieve(q["question"], k_final=k)
        paths = {c["meta"]["path"] for c in chunks}
        if q["relevant_path"] in paths:
            hits += 1
    return hits / len(questions)

If recall@5 is below 80%, fix retrieval before touching anything else. The generator can’t summarize a doc it never sees.

Answer faithfulness. Does the generated answer actually use information from the retrieved chunks, or does it lean on the model’s pretrained knowledge? You can measure this crudely by checking whether the answer’s claims appear in the source chunks. Better: use a second SLM call with a faithfulness prompt.

Refusal rate. Track how often the model correctly answers “I don’t know” on questions where the retrieved chunks genuinely don’t contain the answer. A refusal-rate-of-zero is a red flag — it means the model is confabulating on hard cases.

A Word on Privacy

People build local RAG specifically because their data is sensitive. A few things to verify:

  • Embedding model. Some HF models phone home on first load (telemetry). Check with pip-audit and run training/inference offline if your threat model requires it.
  • Vector DB persistence. ChromaDB writes plain SQLite. Encrypt the disk if your data warrants it.
  • Logs. Your application probably logs query strings. Treat those as sensitive — they often contain user PII.
  • Model context. When you pass retrieved chunks to a model, you’re committing them to GPU memory temporarily. If you’re on shared hardware (cloud GPU instance), this matters.

Nothing about local RAG is inherently more private than a managed service unless you treat it that way operationally. The architecture is private-capable; the deployment is what makes it private-actual.

Wrapping Up

Local RAG is no longer hard. The pieces — embedders, vector DBs, SLMs, rerankers — all run on a single machine and are robust enough for real use. What stays hard is making them all agree on what “relevant” means for your domain, and tuning chunking and prompts until the model stops confabulating. Spend your engineering effort there, not on swapping vector databases. The last post in this series covers benchmarking, because if you can’t measure your RAG system, you can’t improve it. See the ChromaDB docs and the BGE model card for further reading.