Building Semantic Search From Scratch, A Production Walkthrough
TL;DR — A real semantic search system has four layers: ingestion, embedding, indexing, and serving. Most failures happen at the ingestion and chunking layer, not at the vector DB. / Treat embedding as a batch job with retries and dead-lettering, not a fire-and-forget call. / Serve queries through a thin layer that handles query embedding, retrieval, and reranking, with caching at every stage.
I’ve now built three semantic search systems in the last year, and they all looked roughly the same. The vector database changed, the embedding model changed, the framing of “what gets searched” changed, but the architecture didn’t. This post is that architecture, written down with code that runs.
I’ll use Pinecone for the vector store and OpenAI’s text-embedding-ada-002 for the embeddings, because that combination is the fastest path to a working system. The patterns apply to any vector DB and any embedding model. If you want to compare those choices first, the embedding models deep-dive earlier this week has the tradeoffs.
The example corpus is documentation pages, because that’s the cleanest case. The same shape applies to chat logs, knowledge base articles, code snippets, or product catalogs.
The architecture
+-----------+ +------------+
documents --> | ingestion | -> | chunker |
+-----------+ +------------+
|
v
+-----------+ +-----------+
| embedder | -> | vector DB |
+-----------+ +-----------+
^
|
+-----------+
query | retriever |
----> +-----------+
|
v
+-----------+
| reranker |
+-----------+
|
v
[ranked results]
Four jobs, each independently scalable. The ingestion and chunker are coupled in practice but logically separable. The reranker is optional but almost always worth adding once basic retrieval works.
Ingestion and chunking
The hardest part of every system I’ve built has been getting documents into a clean canonical form. HTML, PDF, Markdown, Confluence exports, Notion exports - they all have their own quirks, and the worst bugs are silent quality degradations from bad parsing.
A pattern that’s worked for me: ingest into an intermediate JSON representation with a strict schema, then chunk from that.
from dataclasses import dataclass, asdict
from typing import Iterator
import hashlib
import json
import re
@dataclass
class Document:
id: str
source: str
title: str
text: str
url: str
updated_at: str
@dataclass
class Chunk:
id: str
doc_id: str
text: str
position: int
metadata: dict
def normalize_html(html: str) -> str:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text(separator="\n")
return re.sub(r"\n{3,}", "\n\n", text).strip()
The chunker itself uses the recursive strategy from LangChain because it’s the best general-purpose approach I’ve found. The key parameters are chunk_size (in tokens, not characters - this matters) and chunk_overlap:
import tiktoken
ENC = tiktoken.encoding_for_model("text-embedding-ada-002")
def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
tokens = ENC.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunk = ENC.decode(tokens[start:end])
chunks.append(chunk)
if end == len(tokens):
break
start = end - overlap
return chunks
def chunks_for_doc(doc: Document) -> Iterator[Chunk]:
pieces = chunk_text(doc.text)
for i, piece in enumerate(pieces):
chunk_id = hashlib.sha256(f"{doc.id}:{i}".encode()).hexdigest()[:16]
yield Chunk(
id=chunk_id,
doc_id=doc.id,
text=piece,
position=i,
metadata={
"source": doc.source,
"title": doc.title,
"url": doc.url,
"updated_at": doc.updated_at,
},
)
The chunk_id is deterministic on (doc_id, position), which is critical for idempotent reingestion. If you re-run ingestion, you get the same IDs, and upserts overwrite cleanly. The non-deterministic version of this bug has cost me a week of debugging.
Embedding
Embedding looks simple but has three production concerns: batching, retry on rate limits, and dead-lettering for inputs that fail repeatedly.
import time
import openai
from openai.error import RateLimitError, APIError
def embed_batch(texts: list[str], max_retries: int = 5) -> list[list[float]]:
for attempt in range(max_retries):
try:
response = openai.Embedding.create(
input=texts,
model="text-embedding-ada-002",
)
return [item["embedding"] for item in response["data"]]
except RateLimitError:
wait = 2 ** attempt
time.sleep(wait)
except APIError as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
raise RuntimeError("embedding failed after retries")
def embed_chunks(chunks: list[Chunk], batch_size: int = 100):
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [c.text for c in batch]
vectors = embed_batch(texts)
for chunk, vector in zip(batch, vectors):
yield chunk, vector
The exponential backoff is critical. OpenAI’s rate limits in April 2023 are organization-wide and shared across calls, so a noisy neighbor in your org can push you into 429s.
For corpora above a few hundred thousand chunks, I run embedding as a queue-backed worker job with at-least-once delivery, where each message is a batch of chunks. Failed batches go to a dead-letter queue I inspect manually. The naive “loop through everything in one process” approach is fine up to maybe 50K chunks; past that, queue it.
Indexing into Pinecone
This is the part covered in the Pinecone walkthrough, so I’ll keep it short:
import pinecone
pinecone.init(api_key="...", environment="us-west1-gcp")
INDEX_NAME = "docs"
if INDEX_NAME not in pinecone.list_indexes():
pinecone.create_index(
name=INDEX_NAME,
dimension=1536,
metric="cosine",
pod_type="p1.x1",
metadata_config={"indexed": ["source", "doc_id"]},
)
index = pinecone.Index(INDEX_NAME)
def upsert(chunks_with_vectors):
batch = []
for chunk, vector in chunks_with_vectors:
batch.append((
chunk.id,
vector,
{**chunk.metadata, "doc_id": chunk.doc_id, "text": chunk.text},
))
if len(batch) >= 100:
index.upsert(vectors=batch)
batch = []
if batch:
index.upsert(vectors=batch)
Storing text in metadata is a tradeoff. It costs RAM in Pinecone, but it saves a round trip to a separate document store at query time. For corpora under a few million chunks, I keep it. Past that, store text in S3 or Postgres and fetch lazily.
The serving layer
The serving layer is a thin FastAPI service. The endpoint takes a query, embeds it, retrieves from Pinecone, optionally reranks, and returns results.
from fastapi import FastAPI, Query
from pydantic import BaseModel
app = FastAPI()
class SearchResult(BaseModel):
chunk_id: str
doc_id: str
text: str
score: float
metadata: dict
class SearchResponse(BaseModel):
query: str
results: list[SearchResult]
@app.get("/search", response_model=SearchResponse)
async def search(q: str = Query(..., min_length=1), top_k: int = 10):
q_vec = embed_batch([q])[0]
response = index.query(
vector=q_vec,
top_k=top_k * 3, # over-fetch for reranking
include_metadata=True,
)
candidates = [
SearchResult(
chunk_id=m["id"],
doc_id=m["metadata"]["doc_id"],
text=m["metadata"]["text"],
score=m["score"],
metadata=m["metadata"],
)
for m in response["matches"]
]
reranked = rerank(q, candidates)[:top_k]
return SearchResponse(query=q, results=reranked)
Two design choices worth calling out. First, over-fetching by 3x and reranking is consistently better than fetching top_k and trusting the vector similarity ordering. Second, I dedupe by doc_id in the reranker, because returning three chunks from the same document is usually worse UX than one chunk each from three documents.
Reranking
The cheapest useful reranker is a cross-encoder from sentence-transformers. It’s slower than the bi-encoder retrieval (it processes (query, doc) pairs), but on top-30 candidates it’s perfectly fast enough:
from sentence_transformers import CrossEncoder
CROSS = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[SearchResult]) -> list[SearchResult]:
pairs = [(query, c.text) for c in candidates]
scores = CROSS.predict(pairs)
seen_docs = set()
out = []
for c, s in sorted(zip(candidates, scores), key=lambda x: -x[1]):
if c.doc_id in seen_docs:
continue
seen_docs.add(c.doc_id)
c.score = float(s)
out.append(c)
return out
The MS MARCO cross-encoder is trained on query-passage pairs and is the right model for retrieval reranking. Don’t use a general-purpose sentence similarity model here; the quality difference is large.
Caching
Three caches worth adding, in order of impact:
- Query embedding cache. Same query string, same model = same vector. Use Redis with a 24-hour TTL.
- Search result cache. Same query, same filters = same results until the index changes. Invalidate on index updates.
- Reranker cache. Same (query, candidate) pair = same score. This one matters less because the cross-encoder is fast, but at high QPS it helps.
Common Pitfalls
- Not measuring latency end-to-end. A typical breakdown for a single query is 30-80ms for query embedding (OpenAI round trip), 20-50ms for Pinecone retrieval, 50-150ms for cross-encoder reranking on top-30. Cache the first one and you cut p50 by a third.
- Embedding queries without a timeout. OpenAI calls occasionally hang for 30+ seconds. Set a 5-second timeout and fall back to a degraded response.
- Confusing reindexing with reembedding. Reindexing means rebuilding the ANN index over the same vectors. Reembedding means running the embedding model again. Reembedding is the expensive one; design your pipeline so you can reindex without reembedding.
- Returning chunks instead of documents. Users want documents. Always group by
doc_idin the response or you’ll get complaints about duplicates. - Forgetting to handle empty results. A high-selectivity filter or an out-of-domain query can return zero matches. The frontend needs to know about this case, and the API should distinguish “no matches” from “error.”
Wrapping Up
The skeleton above is roughly 200 lines of real code, and it’s enough to run semantic search over a corpus of millions of chunks. The next 80% of the work is on quality - chunking, query reformulation, evaluation, and reranking - not on adding more services.
The next post in this series takes a sharp turn into self-hosted territory, walking through Milvus deployment for teams that have outgrown managed offerings.