April 13, 2023 · 7 min read · by Muhammad Amal ai

TL;DR — A real semantic search system has four layers: ingestion, embedding, indexing, and serving. Most failures happen at the ingestion and chunking layer, not at the vector DB. / Treat embedding as a batch job with retries and dead-lettering, not a fire-and-forget call. / Serve queries through a thin layer that handles query embedding, retrieval, and reranking, with caching at every stage.

I’ve now built three semantic search systems in the last year, and they all looked roughly the same. The vector database changed, the embedding model changed, the framing of “what gets searched” changed, but the architecture didn’t. This post is that architecture, written down with code that runs.

I’ll use Pinecone for the vector store and OpenAI’s text-embedding-ada-002 for the embeddings, because that combination is the fastest path to a working system. The patterns apply to any vector DB and any embedding model. If you want to compare those choices first, the embedding models deep-dive earlier this week has the tradeoffs.

The example corpus is documentation pages, because that’s the cleanest case. The same shape applies to chat logs, knowledge base articles, code snippets, or product catalogs.

The architecture

              +-----------+    +------------+
documents --> | ingestion | -> | chunker    |
              +-----------+    +------------+
                                     |
                                     v
                              +-----------+    +-----------+
                              | embedder  | -> | vector DB |
                              +-----------+    +-----------+
                                                     ^
                                                     |
                                              +-----------+
                                       query  | retriever |
                                       ----> +-----------+
                                                     |
                                                     v
                                              +-----------+
                                              | reranker  |
                                              +-----------+
                                                     |
                                                     v
                                              [ranked results]

Four jobs, each independently scalable. The ingestion and chunker are coupled in practice but logically separable. The reranker is optional but almost always worth adding once basic retrieval works.

Ingestion and chunking

The hardest part of every system I’ve built has been getting documents into a clean canonical form. HTML, PDF, Markdown, Confluence exports, Notion exports - they all have their own quirks, and the worst bugs are silent quality degradations from bad parsing.

A pattern that’s worked for me: ingest into an intermediate JSON representation with a strict schema, then chunk from that.

from dataclasses import dataclass, asdict
from typing import Iterator
import hashlib
import json
import re

@dataclass
class Document:
    id: str
    source: str
    title: str
    text: str
    url: str
    updated_at: str

@dataclass
class Chunk:
    id: str
    doc_id: str
    text: str
    position: int
    metadata: dict

def normalize_html(html: str) -> str:
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "nav", "footer"]):
        tag.decompose()
    text = soup.get_text(separator="\n")
    return re.sub(r"\n{3,}", "\n\n", text).strip()

The chunker itself uses the recursive strategy from LangChain because it’s the best general-purpose approach I’ve found. The key parameters are chunk_size (in tokens, not characters - this matters) and chunk_overlap:

import tiktoken

ENC = tiktoken.encoding_for_model("text-embedding-ada-002")

def chunk_text(text: str, max_tokens: int = 500, overlap: int = 50) -> list[str]:
    tokens = ENC.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk = ENC.decode(tokens[start:end])
        chunks.append(chunk)
        if end == len(tokens):
            break
        start = end - overlap
    return chunks

def chunks_for_doc(doc: Document) -> Iterator[Chunk]:
    pieces = chunk_text(doc.text)
    for i, piece in enumerate(pieces):
        chunk_id = hashlib.sha256(f"{doc.id}:{i}".encode()).hexdigest()[:16]
        yield Chunk(
            id=chunk_id,
            doc_id=doc.id,
            text=piece,
            position=i,
            metadata={
                "source": doc.source,
                "title": doc.title,
                "url": doc.url,
                "updated_at": doc.updated_at,
            },
        )

The chunk_id is deterministic on (doc_id, position), which is critical for idempotent reingestion. If you re-run ingestion, you get the same IDs, and upserts overwrite cleanly. The non-deterministic version of this bug has cost me a week of debugging.

Embedding

Embedding looks simple but has three production concerns: batching, retry on rate limits, and dead-lettering for inputs that fail repeatedly.

import time
import openai
from openai.error import RateLimitError, APIError

def embed_batch(texts: list[str], max_retries: int = 5) -> list[list[float]]:
    for attempt in range(max_retries):
        try:
            response = openai.Embedding.create(
                input=texts,
                model="text-embedding-ada-002",
            )
            return [item["embedding"] for item in response["data"]]
        except RateLimitError:
            wait = 2 ** attempt
            time.sleep(wait)
        except APIError as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    raise RuntimeError("embedding failed after retries")

def embed_chunks(chunks: list[Chunk], batch_size: int = 100):
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c.text for c in batch]
        vectors = embed_batch(texts)
        for chunk, vector in zip(batch, vectors):
            yield chunk, vector

The exponential backoff is critical. OpenAI’s rate limits in April 2023 are organization-wide and shared across calls, so a noisy neighbor in your org can push you into 429s.

For corpora above a few hundred thousand chunks, I run embedding as a queue-backed worker job with at-least-once delivery, where each message is a batch of chunks. Failed batches go to a dead-letter queue I inspect manually. The naive “loop through everything in one process” approach is fine up to maybe 50K chunks; past that, queue it.

Indexing into Pinecone

This is the part covered in the Pinecone walkthrough, so I’ll keep it short:

import pinecone

pinecone.init(api_key="...", environment="us-west1-gcp")

INDEX_NAME = "docs"
if INDEX_NAME not in pinecone.list_indexes():
    pinecone.create_index(
        name=INDEX_NAME,
        dimension=1536,
        metric="cosine",
        pod_type="p1.x1",
        metadata_config={"indexed": ["source", "doc_id"]},
    )

index = pinecone.Index(INDEX_NAME)

def upsert(chunks_with_vectors):
    batch = []
    for chunk, vector in chunks_with_vectors:
        batch.append((
            chunk.id,
            vector,
            {**chunk.metadata, "doc_id": chunk.doc_id, "text": chunk.text},
        ))
        if len(batch) >= 100:
            index.upsert(vectors=batch)
            batch = []
    if batch:
        index.upsert(vectors=batch)

Storing text in metadata is a tradeoff. It costs RAM in Pinecone, but it saves a round trip to a separate document store at query time. For corpora under a few million chunks, I keep it. Past that, store text in S3 or Postgres and fetch lazily.

The serving layer

The serving layer is a thin FastAPI service. The endpoint takes a query, embeds it, retrieves from Pinecone, optionally reranks, and returns results.

from fastapi import FastAPI, Query
from pydantic import BaseModel

app = FastAPI()

class SearchResult(BaseModel):
    chunk_id: str
    doc_id: str
    text: str
    score: float
    metadata: dict

class SearchResponse(BaseModel):
    query: str
    results: list[SearchResult]

@app.get("/search", response_model=SearchResponse)
async def search(q: str = Query(..., min_length=1), top_k: int = 10):
    q_vec = embed_batch([q])[0]
    response = index.query(
        vector=q_vec,
        top_k=top_k * 3,  # over-fetch for reranking
        include_metadata=True,
    )

    candidates = [
        SearchResult(
            chunk_id=m["id"],
            doc_id=m["metadata"]["doc_id"],
            text=m["metadata"]["text"],
            score=m["score"],
            metadata=m["metadata"],
        )
        for m in response["matches"]
    ]

    reranked = rerank(q, candidates)[:top_k]
    return SearchResponse(query=q, results=reranked)

Two design choices worth calling out. First, over-fetching by 3x and reranking is consistently better than fetching top_k and trusting the vector similarity ordering. Second, I dedupe by doc_id in the reranker, because returning three chunks from the same document is usually worse UX than one chunk each from three documents.

Reranking

The cheapest useful reranker is a cross-encoder from sentence-transformers. It’s slower than the bi-encoder retrieval (it processes (query, doc) pairs), but on top-30 candidates it’s perfectly fast enough:

from sentence_transformers import CrossEncoder

CROSS = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[SearchResult]) -> list[SearchResult]:
    pairs = [(query, c.text) for c in candidates]
    scores = CROSS.predict(pairs)
    seen_docs = set()
    out = []
    for c, s in sorted(zip(candidates, scores), key=lambda x: -x[1]):
        if c.doc_id in seen_docs:
            continue
        seen_docs.add(c.doc_id)
        c.score = float(s)
        out.append(c)
    return out

The MS MARCO cross-encoder is trained on query-passage pairs and is the right model for retrieval reranking. Don’t use a general-purpose sentence similarity model here; the quality difference is large.

Caching

Three caches worth adding, in order of impact:

Query embedding cache. Same query string, same model = same vector. Use Redis with a 24-hour TTL.
Search result cache. Same query, same filters = same results until the index changes. Invalidate on index updates.
Reranker cache. Same (query, candidate) pair = same score. This one matters less because the cross-encoder is fast, but at high QPS it helps.

Common Pitfalls

Not measuring latency end-to-end. A typical breakdown for a single query is 30-80ms for query embedding (OpenAI round trip), 20-50ms for Pinecone retrieval, 50-150ms for cross-encoder reranking on top-30. Cache the first one and you cut p50 by a third.
Embedding queries without a timeout. OpenAI calls occasionally hang for 30+ seconds. Set a 5-second timeout and fall back to a degraded response.
Confusing reindexing with reembedding. Reindexing means rebuilding the ANN index over the same vectors. Reembedding means running the embedding model again. Reembedding is the expensive one; design your pipeline so you can reindex without reembedding.
Returning chunks instead of documents. Users want documents. Always group by doc_id in the response or you’ll get complaints about duplicates.
Forgetting to handle empty results. A high-selectivity filter or an out-of-domain query can return zero matches. The frontend needs to know about this case, and the API should distinguish “no matches” from “error.”

Wrapping Up

The skeleton above is roughly 200 lines of real code, and it’s enough to run semantic search over a corpus of millions of chunks. The next 80% of the work is on quality - chunking, query reformulation, evaluation, and reranking - not on adding more services.

The next post in this series takes a sharp turn into self-hosted territory, walking through Milvus deployment for teams that have outgrown managed offerings.

The architecture

Ingestion and chunking

Embedding

Indexing into Pinecone

The serving layer

Reranking

Caching

Common Pitfalls

Wrapping Up

Related posts

Pinecone in Production, Pod Sizing, Upserts, and the Cost Math That Surprises Teams

The Vector Database Landscape in 2023, Pinecone, Milvus, Weaviate, and Chroma Compared

LangChain 0.0.13x, The Framework, the Hype, and the Real Engineering Tradeoffs

Chroma 0.3, The Local-First Vector Database for Notebook-Scale Prototyping

Weaviate 1.18 and Hybrid Search, When Keyword and Vector Search Are Both Right

Milvus 2.2 in Production, Self-Hosting the Heavyweight Open-Source Vector Database

Embedding Models in 2023, ada-002, sentence-transformers, and What Actually Matters

Choosing a Vector Database, Pinecone vs Qdrant vs pgvector

Let’s Start a Project