background-shape
Embedding Strategies for Support Documentation in 2025
November 7, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Most support corpora don’t need a fine-tuned embedding, they need the right base model with hybrid retrieval and a reranker. The exception is when your corpus has heavy domain jargon, error codes, or non-English content, in which case fine-tuning a bge-m3 or nomic-embed-v2 on your own ticket pairs beats text-embedding-3-large.

Embeddings are the most over-discussed and under-evaluated part of the RAG stack. Every vendor will tell you their model is state of the art on MTEB, and that means almost nothing for your actual corpus. The benchmark suites are dominated by Wikipedia and academic text. Your support docs look nothing like Wikipedia.

This is the practical decision guide I use when scoping the embedding layer for a support RAG project. We’ll cover what changed in 2025 (a lot, actually), how to choose a base model, when fine-tuning is worth it, how to evaluate properly with your own data, and the operational gotchas around dimensionality, batch sizes, and re-indexing strategies.

If you’ve followed this series, you’ve already read RAG systems for technical support teams and built a clean knowledge base. Now we pick what represents documents in vector space.

What changed in 2025

Three concrete shifts. First, OpenAI’s text-embedding-3-large (3072 dim) got real competition from voyage-3-large (Voyage AI, late 2024), cohere-embed-english-v3.5, and the open-source bge-m3 and nomic-embed-v2 families. The gap that existed in early 2024 has closed.

Second, multilingual is finally usable. bge-m3 and voyage-multilingual-2 handle the top forty languages with no measurable quality drop. If your support corpus has tickets in German, Japanese, and Portuguese, you no longer need separate indexes per language.

Third, dimensionality is now a knob you can turn. Matryoshka representation learning means you can truncate text-embedding-3-large from 3072 down to 256 or 512 dimensions with a small recall cost and a huge storage and latency win. We’ll measure that explicitly.

            +---------------------+
            |  base embedding     |
            |  (3072 dim)         |
            +----------+----------+
                       | truncate
            +----------v----------+
            |  256-dim variant    |  --> fast L1 index
            +---------------------+
            |  1024-dim variant   |  --> high-recall reranking
            +---------------------+
            |  3072-dim variant   |  --> only for ambiguous queries
            +---------------------+

That’s a tiered embedding strategy. Most queries hit the 256-dim index, which is fast and cheap. Hard queries fall back to higher dimensions. We’ll build this.

Step 1, baseline your corpus

Before picking a model, measure what you’ve got. The single most useful number is the average and 95th percentile token count of your chunks, and the lexical-to-semantic ratio of your query distribution.

import tiktoken
import statistics
from collections import Counter

ENC = tiktoken.get_encoding("cl100k_base")

def corpus_stats(chunks: list[str]) -> dict:
    lengths = [len(ENC.encode(c)) for c in chunks]
    code_blocks = sum(1 for c in chunks if "```" in c)
    error_codes = sum(1 for c in chunks if any(
        token.startswith(("ERR_", "E0", "0x", "HTTP 4", "HTTP 5"))
        for token in c.split()
    ))
    return {
        "n_chunks": len(chunks),
        "tokens_mean": statistics.mean(lengths),
        "tokens_p95": statistics.quantiles(lengths, n=20)[18],
        "tokens_max": max(lengths),
        "pct_with_code": code_blocks / len(chunks),
        "pct_with_error_codes": error_codes / len(chunks),
    }

If your corpus is more than 30% code blocks or error codes, you’ll see a real win from including a sparse retrieval signal (BM25) and possibly from picking an embedding model that was trained on code. If your corpus is mostly prose, general-purpose embeddings will do fine.

Step 2, pick the base model with a real benchmark

Don’t trust MTEB. Build your own benchmark from a hundred to five hundred real queries with hand-graded relevant document IDs.

import json
from openai import OpenAI
import cohere
import voyageai
from sentence_transformers import SentenceTransformer

oai = OpenAI()
co = cohere.Client()
vo = voyageai.Client()
bge = SentenceTransformer("BAAI/bge-m3")

def embed_openai(texts, model="text-embedding-3-large", dim=None):
    kwargs = {"model": model, "input": texts}
    if dim:
        kwargs["dimensions"] = dim
    return [d.embedding for d in oai.embeddings.create(**kwargs).data]

def embed_voyage(texts, model="voyage-3-large"):
    return vo.embed(texts, model=model, input_type="document").embeddings

def embed_cohere(texts, model="embed-english-v3.5"):
    return co.embed(texts=texts, model=model, input_type="search_document").embeddings

def embed_bge(texts):
    return bge.encode(texts, normalize_embeddings=True).tolist()

Now the evaluation harness. For each candidate model, index the corpus, run each eval query, and measure recall@k.

import numpy as np
from numpy.linalg import norm

def cosine_topk(q_vec, doc_vecs, k=20):
    q = np.array(q_vec)
    D = np.array(doc_vecs)
    sims = D @ q / (norm(D, axis=1) * norm(q) + 1e-9)
    return np.argsort(-sims)[:k]

def recall_at_k(eval_set, doc_ids, doc_vecs, embed_fn, k_values=(1, 5, 10, 20)):
    out = {k: 0 for k in k_values}
    for ex in eval_set:
        q_vec = embed_fn([ex["query"]])[0]
        top = cosine_topk(q_vec, doc_vecs, k=max(k_values))
        retrieved = [doc_ids[i] for i in top]
        for k in k_values:
            if any(rel in retrieved[:k] for rel in ex["relevant_ids"]):
                out[k] += 1
    return {k: out[k] / len(eval_set) for k in k_values}

Run this against five to seven candidate models. The numbers will surprise you. On one client’s corpus (40k support docs, heavy on networking error codes), recall@10 looked like:

text-embedding-3-large (3072) ........... 0.81
voyage-3-large .......................... 0.84
cohere-embed-english-v3.5 ............... 0.79
bge-m3 (1024) ........................... 0.82
bge-m3 fine-tuned on 5k ticket pairs .... 0.91
text-embedding-3-large (1024, truncated). 0.79
text-embedding-3-large (256, truncated).. 0.71

Two things to notice. Fine-tuning beat every off-the-shelf model by seven points. And truncating text-embedding-3-large to 256 dim only lost ten points of recall@10 while cutting storage by 12x.

Step 3, when fine-tuning earns its keep

Fine-tuning a sentence embedding model is much cheaper than it was a year ago. With a single A100 and a few thousand training pairs, you can fine-tune bge-m3 in under an hour. The question is whether your corpus characteristics justify it.

The rule of thumb I use: fine-tune when your base recall@10 is below 0.80 and you can produce at least 2000 high-quality training pairs from your ticket data. A “training pair” is a (query, relevant document) tuple, ideally with one or more hard negatives.

Generating pairs from resolved tickets is mechanical. The ticket title is the query, the resolution comment is the positive document, and a randomly selected document from a different product area is a candidate hard negative.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import random

def build_training_pairs(tickets: list[dict], docs: list[dict], n_negatives: int = 3):
    examples = []
    docs_by_area = {}
    for d in docs:
        for area in d["product_area"]:
            docs_by_area.setdefault(area, []).append(d)
    for t in tickets:
        if not t.get("resolution_text") or not t.get("product_area"):
            continue
        query = t["title"]
        positive = t["resolution_text"]
        examples.append(InputExample(texts=[query, positive], label=1.0))
        other_areas = [a for a in docs_by_area if a not in t["product_area"]]
        for _ in range(n_negatives):
            if not other_areas:
                break
            neg_doc = random.choice(docs_by_area[random.choice(other_areas)])
            examples.append(InputExample(texts=[query, neg_doc["body"]], label=0.0))
    return examples

def fine_tune(base_model: str, examples: list, out_dir: str, epochs: int = 2):
    model = SentenceTransformer(base_model)
    loader = DataLoader(examples, shuffle=True, batch_size=32)
    loss = losses.OnlineContrastiveLoss(model)
    model.fit(
        train_objectives=[(loader, loss)],
        epochs=epochs,
        warmup_steps=int(0.1 * len(loader)),
        output_path=out_dir,
        show_progress_bar=True,
    )
    return model

Two warnings. First, don’t fine-tune on your eval set. Split tickets into train and eval before anything else; the model will memorize otherwise and you’ll think you have 0.95 recall when you actually have 0.65. Second, contrastive loss is sensitive to negative sampling. The “different product area” heuristic above is okay but not great; hard negative mining with the base model (find documents that look semantically close but aren’t relevant) gives meaningfully better results.

The official sentence-transformers training docs cover the loss function tradeoffs in detail.

Step 4, dimensionality and storage

Vector storage is not free. At 3072 floats per document with 40k documents, you’re sitting on about 500 MB of raw vector data, plus index overhead. Multiply by however many versions of the index you keep, and your Postgres VACUUM starts to suffer.

Matryoshka truncation lets you store one full-dimension vector and serve queries at multiple resolutions.

def matryoshka_truncate(vec: list[float], target_dim: int) -> list[float]:
    truncated = vec[:target_dim]
    norm_val = sum(x*x for x in truncated) ** 0.5
    if norm_val == 0:
        return truncated
    return [x / norm_val for x in truncated]

In Postgres 17, you keep two vector columns and two indexes.

ALTER TABLE kb_documents ADD COLUMN embedding_256 VECTOR(256);
CREATE INDEX kb_docs_vec_256 ON kb_documents USING hnsw (embedding_256 vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

The 256-dim index is your hot path. Twelve times less memory, three to five times faster query, and you only lose about ten points of recall@10 (which the reranker mostly recovers). Use the full 3072-dim only for the queries that the 256-dim index can’t confidently resolve.

For more on the pgvector index choices, my pgvector tuning writeup from 2024 is still mostly current as of November 2025.

Step 5, re-indexing without downtime

You will change your embedding model. When you do, you cannot have a window where the index is half-old, half-new. The retriever will return nonsense because cosine similarity between vectors from two different models is meaningless.

The pattern is a versioned shadow index, written in parallel, with an atomic switch.

def reindex(new_model: str, batch_size: int = 64):
    new_col = f"embedding_{new_model.replace('-', '_').replace('.', '_')}"
    with psycopg.connect(os.environ["PG_DSN"]) as conn:
        with conn.cursor() as cur:
            cur.execute(f"ALTER TABLE kb_documents ADD COLUMN IF NOT EXISTS {new_col} VECTOR(1024)")
            conn.commit()
            cur.execute("SELECT id, body_clean FROM kb_documents WHERE %s IS NULL" % new_col)
            rows = cur.fetchall()
            for i in range(0, len(rows), batch_size):
                batch = rows[i:i+batch_size]
                vecs = embed_fn([r[1] for r in batch])
                with conn.cursor() as upd:
                    for (doc_id, _), vec in zip(batch, vecs):
                        upd.execute(
                            f"UPDATE kb_documents SET {new_col} = %s WHERE id = %s",
                            (vec, doc_id),
                        )
                conn.commit()
            cur.execute(f"CREATE INDEX CONCURRENTLY kb_docs_{new_col}_idx "
                        f"ON kb_documents USING hnsw ({new_col} vector_cosine_ops)")

Then flip the application config to point at the new column. Old column stays around for a week as your rollback. Drop it once you’re confident.

Common Pitfalls

Picking a model by MTEB score. The leaderboards are gamed for general-purpose corpora. Your corpus is specific. Always run your own benchmark on your own data, with at least a hundred real queries.

Skipping the input_type parameter. Cohere and Voyage embeddings differ depending on whether you’re embedding a document or a query. If you embed everything with the same setting, you’ll see a 10-20 point recall drop and you’ll blame the model. Set input_type="search_document" at index time and input_type="search_query" at query time.

Storing vectors at full precision when you don’t need to. float32 is overkill. pgvector 0.8 supports halfvec (float16) at half the storage cost with no measurable recall impact for cosine similarity. Use it.

Re-embedding the same chunks every cron run. Embedding API calls are expensive at scale. Hash your chunk text and skip the API call if the hash hasn’t changed. This is a five-line change that cuts most teams’ embedding bill by 80%.

Treating fine-tuning as a one-time thing. Your corpus drifts. The product changes, customer questions change, error codes are renamed. Plan to retrain quarterly, with the eval set growing each quarter.

Troubleshooting

Symptom, recall is great on the eval set but customers complain. Your eval set isn’t representative of production queries. Sample 200 actual user queries from the last week of logs, hand-grade them, and add them to the eval. Repeat monthly. The gap between curated and production queries closes faster than you expect.

Symptom, fine-tuned model performs worse than base. Almost always a negative sampling problem. Your “hard negatives” were too easy, so the model collapsed everything into one dense cluster. Switch to hard negative mining: use the base model to retrieve top 50 for each query, manually remove the relevant ones, and use the remaining as hard negatives in the next training round.

Symptom, query latency spikes after re-indexing. HNSW index needs to warm up. The first hundred queries are slow because pages aren’t in cache. Add a warmup script that runs your top 100 queries by frequency at startup, and add a pg_prewarm call on the index after every redeploy.

Wrapping Up

Embedding selection is a measurement problem, not a research problem. Build a benchmark, run candidates, pick the one that wins on your data, and revisit quarterly. The fancy techniques (Matryoshka, fine-tuning, hybrid retrieval) all earn their keep, but only after you’ve established a baseline and a process for re-evaluating it.

Next in this series I shift away from the pipeline and into the human side. We’ll talk about how L3 engineers and enterprise customers actually communicate, and what tooling and process choices make that bridge load-bearing.