Embedding Models in 2024, OpenAI vs Cohere vs Open Source

Embeddings article cover illustration on a gradient background

February 7, 2024 · 6 min read · by Muhammad Amal ai

TL;DR — text-embedding-3-small is the new default. Cohere v3 wins for cross-language. bge-m3 is the open-source pick if you can run a GPU and want dense plus sparse plus colbert. Avoid text-embedding-ada-002 for new work.

OpenAI dropped text-embedding-3-small and text-embedding-3-large on January 25, 2024. It’s the most consequential RAG release of the quarter and quietly resets the embedding default for most production work. Cohere shipped v3 last November. The BGE family kept moving. We have real choices now — not all of them obvious.

In the vector DB comparison I said embedding choice is more consequential than database choice. This post is the why and how.

The candidates

OpenAI

text-embedding-ada-002 — the old default. Don’t pick this in February 2024. The newer models are cheaper and better.
text-embedding-3-small — 1536-dim, $0.02 per 1M tokens. Beats ada-002 on every benchmark.
text-embedding-3-large — 3072-dim, $0.13 per 1M tokens. State-of-the-art commercial accuracy.

Cohere

embed-english-v3.0 — 1024-dim, $0.10 per 1M tokens. Strong on English.
embed-multilingual-v3.0 — 1024-dim, $0.10 per 1M tokens. Excellent cross-lingual retrieval.
embed-english-light-v3.0 — 384-dim, $0.10 per 1M tokens. Cheaper to store and query, slightly lower accuracy.

Open source (top of MTEB leaderboard, late January 2024)

bge-m3 — BAAI, January 2024. Multi-functional: dense (1024-dim), sparse, and colbert-style multi-vector in one model. Multi-lingual.
bge-large-en-v1.5 — 1024-dim, English only. Still excellent.
nomic-embed-text-v1 — 768-dim, 8192 context. Permissive license. Released February 2024.
e5-mistral-7b-instruct — 4096-dim, instruction-tuned. State of the art on benchmarks; expensive to run.

The honest comparison

There’s a lot of misleading “BGE beats OpenAI” content. Looking at MTEB scores in isolation is misleading because production work is rarely on the same domains as the benchmark, and benchmark dimensions don’t reflect cost.

What I’ve actually found, running real queries on real corpora:

For most general-domain English text, text-embedding-3-small and bge-large-en-v1.5 are within a few percentage points of each other on the metrics that matter (top-10 precision/recall).
text-embedding-3-large produces noticeably better recall on technical documents (engineering wikis, API specs, codebases). The gap is meaningful when accuracy matters.
Cohere v3 multilingual is genuinely the best at cross-lingual retrieval. If your queries are in language A and your docs are in language B, this is what you want.
bge-m3 is the only model that gives you dense, sparse, and multi-vector outputs from a single inference. For hybrid search systems, this is a big deal.

# text-embedding-3-small, January 2024
from openai import OpenAI

client = OpenAI()

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=["chunk text here", "another chunk"],
)
vectors = [item.embedding for item in resp.data]  # list of 1536-dim floats

# bge-m3 via sentence-transformers (CPU/GPU)
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

out = model.encode(
    ["chunk text here", "another chunk"],
    return_dense=True,
    return_sparse=True,
    return_colbert_vecs=False,
)
dense = out["dense_vecs"]      # (N, 1024) numpy
sparse = out["lexical_weights"]  # list of dicts (token_id -> weight)

The Matryoshka trick

text-embedding-3-large ships with Matryoshka representation learning. You can truncate the 3072-dim vector to a smaller dimension and still get coherent embeddings. Truncate to 1024 and you lose a few percentage points of accuracy but pay a third of the storage cost.

import numpy as np

resp = client.embeddings.create(
    model="text-embedding-3-large",
    input=texts,
    dimensions=1024,  # OpenAI does the Matryoshka truncation + L2 normalization
)

This is the move I make most often in 2024: use text-embedding-3-large with dimensions=1024 (or 512 for cost-sensitive cases). You get most of the accuracy upside of the large model at the storage cost of a small one. The dimensions parameter does the L2 normalization for you, so the truncated vectors are directly usable.

text-embedding-3-small also supports dimensions, but the model wasn’t trained with Matryoshka in mind. Truncating it degrades faster.

When to use Cohere v3

Three scenarios push me to Cohere:

Cross-lingual retrieval. Spanish queries on English docs, English queries on Japanese docs, etc. The multilingual model handles this far better than the OpenAI line.
Re-ranking pipelines. Cohere’s separate Rerank API (covered in next month’s hybrid search post in detail) pairs naturally with their embeddings. Same vendor, consistent latency.
Tier mixing. embed-english-light-v3.0 at 384 dimensions is genuinely cheap to store. For huge corpora where you’re willing to lose a little accuracy, it’s an option text-embedding-3-small doesn’t quite match.

The catch with Cohere: rate limits are tighter than OpenAI’s by default, and the model surface area is smaller. You’ll occasionally hit quota issues that don’t show up with OpenAI.

When to run an open-source model

Three scenarios push me to open source:

Data can’t leave your network. Compliance, regulated industries, internal-only deployments. No API call is a hard constraint.
The corpus is very large. Hundreds of millions of chunks at $0.02/1M tokens is real money. Self-hosted embedding inference on rented GPUs can be cheaper at scale.
You need the dense plus sparse plus multi-vector pipeline. bge-m3 is the only model that does this in one inference call. Building a comparable pipeline with API embeddings means two API calls plus a separate sparse generator.

Operating an embedding model is non-trivial. You need GPU capacity, autoscaling, batching, and the discipline to track which model version produced which embeddings. The numbers below give you a sense of what’s involved:

# A minimal embedding service — production version would batch + autoscale
from fastapi import FastAPI
from FlagEmbedding import BGEM3FlagModel

app = FastAPI()
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

@app.post("/embed")
async def embed(req: dict):
    out = model.encode(req["texts"], return_dense=True, return_sparse=True)
    return {
        "dense": out["dense_vecs"].tolist(),
        "sparse": [{int(k): float(v) for k, v in d.items()} for d in out["lexical_weights"]],
    }

On an A10G GPU, this serves ~200 texts/second for typical chunk sizes. Cost will depend on utilization. For a steady-state workload of >5M tokens/day, the math often favors self-hosting; below that, the API wins.

The model swap problem

Every embedding is tied to the model that produced it. Switching embedding models means re-embedding the entire corpus. This is non-trivial:

API costs to re-embed millions of chunks.
Index downtime or dual-index complexity during the swap.
Different output dimensions may require schema changes in the vector DB.
Quality regression if you’re A/B testing — you can’t compare across embedding generations without both indexes live.

The pragmatic approach: pick the best embedding you can justify on day one and resist the urge to swap. The accuracy improvements from each new model generation are real but rarely 10x. The migration cost can be.

Common Pitfalls

Picking by MTEB leaderboard score. MTEB is averaged across diverse tasks. Your domain may rank models very differently. Build a small evaluation set and score the candidates on it.
Ignoring max input length. text-embedding-3-small accepts up to 8191 tokens. bge-m3 accepts 8192. Older models cap at 512. Match your chunk size to the model.
Not normalizing. OpenAI returns L2-normalized vectors. Most open-source models do not. Cosine similarity assumes normalization. Failing to normalize after a model swap kills accuracy quietly.
Mixing model versions in one index. Re-embed everything when you swap, or use two indexes. Don’t half-migrate.
Skipping the eval set. Pick the model on a representative eval set, not vibes.

The mistake I personally made: I swapped from ada-002 to text-embedding-3-small mid-project without re-embedding the existing corpus. The new query vectors were searching against old corpus vectors. Accuracy quietly tanked for a week before I caught it.

Wrapping Up

For most teams in February 2024, text-embedding-3-small is the default. Pay extra for text-embedding-3-large with Matryoshka truncation when accuracy matters. Reach for Cohere v3 for cross-lingual. Run bge-m3 open-source when compliance or scale demands it. None of these are dramatically better than the others on most workloads — the differences add up, but rarely flip a project.

Next post in the series digs into chunking — arguably more consequential than the embedding model choice. The OpenAI embeddings docs cover the API details if you need them.

The candidates

The honest comparison

The Matryoshka trick

When to use Cohere v3

When to run an open-source model

The model swap problem

Common Pitfalls

Wrapping Up

Related posts

Re-ranking and Reciprocal Rank Fusion in RAG Pipelines

Embedding Models in 2023, ada-002, sentence-transformers, and What Actually Matters

Local RAG with SLMs, Private Knowledge Without the Cloud

Evaluating RAG, Beyond Vibes-Based Testing

Securing RAG, Per-User Document Access Without Re-indexing

Hybrid Search, BM25 Plus Vectors for Better RAG Recall

Chunking Strategies for RAG That Survive Real Documents

Choosing a Vector Database, Pinecone vs Qdrant vs pgvector

Let’s Start a Project