Embedding Models in 2023, ada-002, sentence-transformers, and What Actually Matters
TL;DR —
text-embedding-ada-002is the default for a reason: 1536 dimensions, decent quality across domains, and $0.0001 per 1K tokens. / sentence-transformers models likeall-MiniLM-L6-v2give you 384-dim vectors at zero marginal cost if you have a GPU. / Your retrieval quality is dominated by chunking strategy and query formulation, not by which embedding model you pick, within reason.
I spent most of March benchmarking embedding models for an internal RAG project, and the takeaway surprised me. The gap between OpenAI’s text-embedding-ada-002 and a well-chosen open-source sentence-transformer is real but smaller than the gap between good chunking and bad chunking. The model is a knob; the pipeline around it is the rest of the engine.
This post is a survey of what’s available in April 2023, where each option fits, and how to actually evaluate them on your own data. If you’ve already picked a vector database and are looking for what to put in it, this is the post. If you haven’t, the Pinecone walkthrough from earlier this week covers the storage side.
I’ll focus on English text embeddings for retrieval. Multilingual and multimodal embeddings exist and are improving fast, but they’re a different conversation.
What an embedding actually is
An embedding model is a function that maps text to a fixed-dimensional vector in a space where semantic similarity correlates with vector distance. The mathematical guarantee is weak - cosine similarity in the embedding space is a noisy proxy for “these two pieces of text mean similar things” - but it’s good enough in practice that the whole semantic search ecosystem is built on it.
Two properties matter most for retrieval:
- Symmetric versus asymmetric. Some models are trained for symmetric similarity (sentence A vs sentence B, same distribution). Others are trained for asymmetric retrieval (short query vs long passage). Using the wrong one for your task will silently hurt recall.
- Domain coverage. A model trained on web text will underperform on legal documents, medical records, or code, even if its benchmark numbers look great.
The default: text-embedding-ada-002
OpenAI’s text-embedding-ada-002, released in December 2022, is the path of least resistance. It’s currently $0.0001 per 1K tokens, which is roughly 10x cheaper than the previous-generation embedding models from OpenAI. It returns 1536-dimensional normalized vectors and handles up to 8191 input tokens per request.
import openai
openai.api_key = "..."
def embed(texts: list[str]) -> list[list[float]]:
response = openai.Embedding.create(
input=texts,
model="text-embedding-ada-002",
)
return [item["embedding"] for item in response["data"]]
vectors = embed(["the quick brown fox", "a lazy dog sleeps"])
print(len(vectors[0])) # 1536
A few things worth knowing about ada-002:
- The vectors are L2-normalized, so cosine similarity equals dot product. Pick whichever your vector DB indexes faster.
- It’s a single model for both queries and documents (symmetric in deployment), even though OpenAI hasn’t published the training details.
- You can batch up to ~2048 inputs per request, but the latency advantage saturates around 100. I run batches of 100 in production.
- The token limit is 8191, but quality degrades for very long inputs. I chunk to 500-800 tokens for retrieval work.
- Rate limits are organization-wide. The default tier is 350K tokens per minute on embeddings, which is a soft ceiling people hit fast on backfills.
The biggest operational concern with ada-002 is the cost at scale for query embeddings, not document embeddings. Documents you embed once. Queries you embed on every search request, and at 5 requests per second per user, costs add up faster than you’d expect.
The open-source path: sentence-transformers
The sentence-transformers library wraps a catalog of pre-trained models that you can run locally. The standout ones for retrieval in April 2023:
all-MiniLM-L6-v2- 384 dimensions, 80MB on disk, ~14k QPS on a single GPU. Great quality for its size.all-mpnet-base-v2- 768 dimensions, 420MB, slower but consistently the best general-purpose retrieval model in open source right now.multi-qa-mpnet-base-dot-v1- asymmetric retrieval model trained on question-answer pairs. The right choice if your queries look like questions and your documents look like passages.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
vectors = model.encode(
["the quick brown fox", "a lazy dog sleeps"],
normalize_embeddings=True,
batch_size=64,
)
print(vectors.shape) # (2, 768)
A few practical notes:
- Always pass
normalize_embeddings=Trueif you’re storing in a vector DB indexed for cosine or dot product. The library doesn’t do it by default. - On CPU,
all-MiniLM-L6-v2is genuinely fast (~200 sentences/sec on a modern laptop). The bigger MPNet model is unpleasantly slow on CPU. - GPU memory is usually the constraint at high batch sizes. A T4 with 16GB handles batch sizes around 256 for the MPNet models.
The cost calculation versus OpenAI is non-obvious. An AWS g4dn.xlarge (T4 GPU) is about $0.50/hour. At that price you’d break even versus ada-002 at roughly 5M tokens per hour of throughput. If you’re below that, ada-002 is cheaper. If you’re above, self-hosting wins, but you also pay in operational complexity.
The middle ground: hosted open-source
A few providers now host open-source embedding models as APIs - Hugging Face Inference Endpoints, Replicate, Cohere’s embeddings tier. I haven’t gotten any of them to feel as polished as the OpenAI experience yet, but Cohere’s embed-multilingual-v2.0 is genuinely useful if you need anything beyond English. The economics shift workload-by-workload, so I treat them as escape hatches rather than defaults.
How to actually evaluate
The single biggest mistake I see in embedding model evaluation is using public benchmarks as the decision criterion. MTEB (the Massive Text Embedding Benchmark) is great context but a poor proxy for your specific corpus.
The minimum viable evaluation:
- Build a small held-out test set of (query, expected-document) pairs from your actual data. Even 50 pairs is enough to see signal.
- For each model, embed both queries and documents.
- For each query, compute the rank of the expected document among all documents.
- Report mean reciprocal rank (MRR) and recall at k=5 and k=10.
def evaluate(model_encode_fn, queries, documents, gold_doc_ids):
doc_ids = list(documents.keys())
doc_texts = [documents[i] for i in doc_ids]
doc_vecs = model_encode_fn(doc_texts)
doc_vecs = doc_vecs / (doc_vecs ** 2).sum(axis=1, keepdims=True) ** 0.5
reciprocal_ranks = []
for query, gold in zip(queries, gold_doc_ids):
q_vec = model_encode_fn([query])[0]
q_vec = q_vec / (q_vec ** 2).sum() ** 0.5
scores = doc_vecs @ q_vec
ranking = sorted(zip(doc_ids, scores), key=lambda x: -x[1])
rank = next(i for i, (did, _) in enumerate(ranking, 1) if did == gold)
reciprocal_ranks.append(1.0 / rank)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
What I’ve found, consistently:
- The spread between best and worst general-purpose model on a domain-specific corpus is typically 5-15% MRR.
- The spread between good and bad chunking on the same model is 20-40% MRR.
- The spread between literal-keyword queries and natural-language-paraphrase queries on the same chunking and model is often 30%+ MRR.
This is why I keep saying the model is a knob.
Chunking matters more than the model
The actual hard problem in retrieval is deciding what to embed. A few patterns that work:
- Fixed-size chunking with overlap. 500 tokens with 50 tokens of overlap is a reasonable default. Use the embedding model’s tokenizer, not whitespace.
- Recursive splitting. Split on paragraphs first, then sentences, then words, only if needed to fit the chunk size. LangChain’s
RecursiveCharacterTextSplitteris a good implementation. - Semantic chunking. Split on topic boundaries detected by a model. More expensive, but for long-form content it’s worth experimenting with.
I’ll cover this in detail in the semantic search end-to-end post later this week.
Common Pitfalls
- Mixing models in the same index. Vectors from different models live in different spaces. They’re not comparable, even if the dimensions happen to match. I’ve seen this happen during migrations and it produces baffling retrieval failures.
- Forgetting to normalize. If your model returns unnormalized vectors and your vector DB is indexed for cosine, you’ll usually be fine because cosine handles it. If it’s indexed for dot product, you’ll get nonsense rankings.
- Embedding queries with a model trained for documents. Asymmetric models like
multi-qa-mpnet-base-dot-v1will silently underperform if you embed both sides identically. Use the prefix or the explicit query encoding method the model documents. - Trusting cosine similarity scores in absolute terms. A score of 0.85 doesn’t mean “highly relevant.” It means “more relevant than something at 0.75.” Always calibrate thresholds on your data.
- Re-embedding on every query because of caching mistakes. Query embeddings are pure functions of the query text and the model. Cache them aggressively. I use a small Redis LRU keyed on
(model_id, normalize, text_hash).
Wrapping Up
In April 2023, my default recipe is text-embedding-ada-002 for almost everything, with a sentence-transformers fallback for high-volume workloads or air-gapped deployments. The decision matters less than getting the chunking and the evaluation harness right. Build the eval first, embed second.
The next post pulls all of this together into an end-to-end semantic search build that you can actually deploy.