Load Testing a Vector Search Pipeline Before It Breaks
TL;DR — Model realistic embedding query shapes, not synthetic noise / Drive Qdrant with k6 ramping VUs and watch p99 plus recall together / Saturation shows up in segment merges and disk I/O long before CPU pegs.
The first time I shipped a semantic search feature, it sailed through staging. Twelve concurrent users, sub-50ms responses, recall that matched the offline eval. Three weeks later a marketing email put 4,000 people on the page in ten minutes and the search box started returning 504s. The vector database wasn’t down. It was alive, technically responding, just doing so at 9 seconds per query while background segment merges fought the query threads for the same disk.
That failure mode is specific to vector search and most generic load tests miss it. An HTTP benchmark that hammers /search with a fixed payload tells you almost nothing, because the cost of an approximate nearest neighbour (ANN) query depends on the index state, the filter cardinality, the ef search parameter, and whether a merge is running. You need a test that exercises the pipeline — embedding generation, the ANN lookup, payload filtering, and rehydration — under load that looks like real traffic.
This post walks through building that test against Qdrant 1.13 with k6 as the driver and Prometheus 3.x scraping the database. We’ll find the breaking point deliberately, in a controlled environment, so it never finds us. If you also want continuous monitoring after the test, pair this with vector database performance monitoring .
Why Vector Search Load Tests Are Different
A relational query has a fairly stable cost curve. A vector search query does not. The HNSW graph that Qdrant uses has a tunable ef parameter that trades recall for latency, and that trade shifts under concurrency. At low load, a high ef is cheap. Under contention, every extra graph hop is a cache miss competing with other threads.
Three properties make this hard to test naively:
- Index state matters. A freshly loaded collection with one giant segment behaves differently from one that has absorbed 50,000 upserts and is mid-merge. Your test must run against a realistically aged index.
- Query distribution matters. Real queries cluster. Some embeddings land in dense regions of the space and are cheap; some land in sparse regions and force long traversals. A test that replays one vector 100,000 times measures cache behaviour, not search.
- Filters change everything. A search with
filter: {must: [{key: "tenant", match: {value: "acme"}}]}over a high-cardinality field can be dramatically slower than an unfiltered search, because Qdrant may fall back to a different search strategy when the filtered subset is small.
So the test fixture is half the work. Get the data shape wrong and you’ll get a green dashboard followed by a production incident.
Building a Realistic Test Collection
Start by loading a collection that mirrors production scale. Don’t test on 10,000 vectors if you’ll run 5 million. Here’s a loader that seeds Qdrant with realistic dimensionality, payload filters, and — importantly — applies the upserts in batches so the index goes through real segment creation.
# seed_collection.py — Qdrant 1.13, qdrant-client 1.13.x
import random
import uuid
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct, OptimizersConfigDiff,
HnswConfigDiff, PayloadSchemaType,
)
DIM = 768
TOTAL = 5_000_000
BATCH = 2_000
TENANTS = [f"tenant-{i:03d}" for i in range(200)]
client = QdrantClient(url="http://localhost:6333", timeout=120)
client.recreate_collection(
collection_name="docs",
vectors_config=VectorParams(size=DIM, distance=Distance.COSINE),
hnsw_config=HnswConfigDiff(m=16, ef_construct=128),
optimizers_config=OptimizersConfigDiff(
default_segment_number=4,
memmap_threshold=200_000,
),
)
# Index the filter field, otherwise filtered queries scan payloads.
client.create_payload_index(
collection_name="docs",
field_name="tenant",
field_schema=PayloadSchemaType.KEYWORD,
)
rng = np.random.default_rng(42)
def gen_batch(n: int) -> list[PointStruct]:
# Mix dense clusters with uniform noise so query cost varies.
centers = rng.normal(size=(20, DIM))
pts = []
for _ in range(n):
if random.random() < 0.7:
base = centers[random.randrange(20)]
vec = base + rng.normal(scale=0.15, size=DIM)
else:
vec = rng.normal(size=DIM)
vec = vec / np.linalg.norm(vec)
pts.append(PointStruct(
id=str(uuid.uuid4()),
vector=vec.tolist(),
payload={
"tenant": random.choice(TENANTS),
"lang": random.choice(["en", "id", "ja", "de"]),
"published": random.random() < 0.85,
},
))
return pts
loaded = 0
while loaded < TOTAL:
n = min(BATCH, TOTAL - loaded)
client.upsert(collection_name="docs", points=gen_batch(n), wait=False)
loaded += n
if loaded % 100_000 == 0:
print(f"loaded {loaded:,}/{TOTAL:,}")
# Block until the optimizer has settled before testing.
client.update_collection(
collection_name="docs",
optimizers_config=OptimizersConfigDiff(indexing_threshold=20_000),
)
print("seed complete — wait for green status before load testing")
The wait=False on upsert is deliberate. It lets the optimizer batch segment work the way it does in production. Before you run the load test, poll the collection until its status is green — testing during initial indexing measures the wrong thing.
# Wait for the optimizer to finish before testing.
until curl -s localhost:6333/collections/docs | grep -q '"status":"green"'; do
echo "indexing... $(curl -s localhost:6333/collections/docs | python3 -c 'import sys,json; print(json.load(sys.stdin)["result"]["status"])')"
sleep 10
done
echo "collection green"
Generating Query Vectors That Look Like Traffic
The k6 script needs query vectors. Generating random 768-dim vectors inside k6 on every iteration is slow and produces uniform queries. Instead, precompute a pool of query vectors offline — sampled from the same distribution as the data, plus some genuinely out-of-distribution ones — and load them as a JSON array.
# build_query_pool.py
import json
import numpy as np
rng = np.random.default_rng(7)
DIM = 768
POOL = 3_000
centers = rng.normal(size=(20, DIM))
queries = []
for _ in range(POOL):
r = rng.random()
if r < 0.6: # near a cluster: cheap
v = centers[rng.integers(20)] + rng.normal(scale=0.2, size=DIM)
elif r < 0.9: # mid-distance
v = rng.normal(size=DIM)
else: # sparse region: expensive
v = rng.normal(scale=2.5, size=DIM)
v = v / np.linalg.norm(v)
queries.append(v.tolist())
with open("query_pool.json", "w") as f:
json.dump(queries, f)
print(f"wrote {POOL} query vectors")
The k6 Load Test
Now the driver. This k6 script (k6 docs ) ramps virtual users through four stages, picks a random query vector per iteration, randomly applies a tenant filter on 40% of requests, and records both latency and recall-relevant metadata as custom metrics.
// search_load.js — k6 0.58.x
import http from 'k6/http';
import { check } from 'k6';
import { Trend, Rate, Counter } from 'k6/metrics';
import { SharedArray } from 'k6/data';
const pool = new SharedArray('queries', () =>
JSON.parse(open('./query_pool.json'))
);
const searchLatency = new Trend('search_latency_ms', true);
const emptyResults = new Rate('empty_result_rate');
const httpErrors = new Counter('search_http_errors');
const TENANTS = Array.from({ length: 200 }, (_, i) => `tenant-${String(i).padStart(3, '0')}`);
const BASE = __ENV.QDRANT_URL || 'http://localhost:6333';
export const options = {
scenarios: {
ramp: {
executor: 'ramping-vus',
startVUs: 0,
stages: [
{ duration: '2m', target: 50 }, // warm caches
{ duration: '3m', target: 200 }, // expected peak
{ duration: '3m', target: 600 }, // 3x peak — find the wall
{ duration: '2m', target: 0 }, // ramp down
],
gracefulRampDown: '30s',
},
},
thresholds: {
'search_latency_ms': ['p(95)<250', 'p(99)<800'],
'search_http_errors': ['count<1'],
'empty_result_rate': ['rate<0.01'],
},
};
export default function () {
const vector = pool[Math.floor(Math.random() * pool.length)];
const useFilter = Math.random() < 0.4;
const body = {
vector: vector,
limit: 10,
with_payload: true,
params: { hnsw_ef: 128 },
};
if (useFilter) {
body.filter = {
must: [{ key: 'tenant', match: { value: TENANTS[Math.floor(Math.random() * TENANTS.length)] } }],
};
}
const res = http.post(`${BASE}/collections/docs/points/search`, JSON.stringify(body), {
headers: { 'Content-Type': 'application/json' },
tags: { filtered: String(useFilter) },
});
searchLatency.add(res.timings.duration, { filtered: String(useFilter) });
const ok = check(res, {
'status 200': (r) => r.status === 200,
'has body': (r) => r.body && r.body.length > 2,
});
if (!ok) {
httpErrors.add(1);
return;
}
let hits = [];
try {
hits = JSON.parse(res.body).result || [];
} catch (e) {
httpErrors.add(1);
return;
}
emptyResults.add(hits.length === 0 ? 1 : 0);
}
Two design choices worth calling out. The latency Trend is tagged with filtered, so in the summary you can compare filtered versus unfiltered p99 directly — that gap is often where the surprise lives. And empty_result_rate is a cheap proxy for correctness regression: if a filtered query starts returning zero hits under load, something is timing out internally and Qdrant is returning a partial result rather than erroring.
Run it and ship the metrics to Prometheus with k6’s output extension:
k6 run \
--out experimental-prometheus-rw \
-e K6_PROMETHEUS_RW_SERVER_URL=http://localhost:9090/api/v1/write \
-e K6_PROMETHEUS_RW_TREND_STATS="p(95),p(99),max" \
search_load.js
Watching the Database, Not Just the Client
k6 tells you what the client experienced. It does not tell you why. For that, scrape Qdrant’s own /metrics endpoint with Prometheus 3.x.
# prometheus.yml — Prometheus 3.x
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: qdrant
static_configs:
- targets: ['localhost:6333']
metrics_path: /metrics
- job_name: k6
static_configs:
- targets: ['localhost:5656']
The metrics that actually predict the cliff:
| Metric | What it tells you |
|---|---|
qdrant_collection_pending_operations | Optimizer is behind. Rising = merges starving queries. |
rate(qdrant_rest_responses_total{status!~"2.."}[1m]) | Server-side error rate, independent of k6. |
qdrant_collection_hardware_search_io_read | Disk reads per search — climbs when the working set spills off the page cache. |
process_resident_memory_bytes | RSS approaching memmap threshold means cold reads ahead. |
The single most useful PromQL expression during a ramp:
# Server-observed search p99, by collection.
histogram_quantile(
0.99,
sum by (le, collection) (
rate(qdrant_collection_search_duration_seconds_bucket[30s])
)
)
Overlay that on the k6 client p99. When the two diverge — client latency climbing while server latency stays flat — you have a connection-pool or queueing problem on the client side, not a database problem. When they climb together, the database is genuinely saturated.
Reading the Breaking Point
A clean run has three phases. During warm-up, latency drops as the page cache fills. During the expected-peak stage, latency is flat — boring is good. During the 3x stage, one of two things happens.
If qdrant_collection_pending_operations spikes and disk read I/O climbs while CPU sits at 60%, you are I/O bound. The fix is more RAM (so the HNSW graph stays in page cache), faster disks, or raising memmap_threshold so smaller segments stay in memory. If CPU pegs at 100% across all cores and pending operations stay near zero, you are compute bound — scale out with replication or shard the collection.
The filtered-versus-unfiltered split usually tells its own story. If filtered p99 is 5x unfiltered, your filter field either isn’t indexed or its cardinality is high enough that Qdrant is doing per-point payload checks. Re-check that create_payload_index call ran.
Common Pitfalls
- Testing a cold collection. First run after a restart hits disk for everything. Always run a warm-up stage and discard it from analysis.
- One query vector, replayed. This benchmarks the page cache. Use the precomputed pool with varied distances.
wait=trueon the seed upserts. Forces synchronous indexing and produces an unrealistically tidy single-segment index. Production indexes are messier.- Ignoring
gracefulRampDown. Without it, k6 kills in-flight requests at stage boundaries and you get phantom error spikes that aren’t real failures. - Co-locating k6 and Qdrant. The load generator steals CPU from the database. Run k6 on a separate box, or at minimum pin it to different cores.
- Trusting client latency alone. k6 p99 includes network and connection setup. Always cross-check against the server-side histogram.
Troubleshooting
Symptom: k6 reports rising p99 but qdrant_collection_search_duration_seconds stays flat.
Cause: Client-side connection pool exhaustion — k6 VUs queueing for sockets.
Fix: Raise the file descriptor limit (ulimit -n 65535), enable HTTP keep-alive (k6 does by default, but a proxy may break it), and confirm with ss -s that connections aren’t piling up in TIME_WAIT.
Symptom: empty_result_rate climbs above zero only during the 3x stage.
Cause: Internal search timeout — Qdrant returns a partial result instead of erroring when a segment search exceeds its deadline.
Fix: Check qdrant_collection_pending_operations. If it’s high, the optimizer is starving query threads; lower the upsert rate during tests or raise indexing_threshold.
Symptom: Latency is fine until it suddenly steps up 4x and stays there.
Cause: A segment merge kicked in mid-test and the working set no longer fits in page cache.
Fix: Watch qdrant_collection_hardware_search_io_read. Provision RAM to at least 1.5x the on-disk index size, or set on_disk: false for the HNSW graph if you have the memory.
Symptom: search_http_errors spikes exactly at stage transitions.
Cause: gracefulRampDown too short — VUs cancelled mid-request.
Fix: Raise gracefulRampDown to 30s or more and re-run.
Wrapping Up
A vector search load test that mirrors real index state, query distribution, and filter behaviour will surface the I/O cliff and the merge-contention failure mode before a traffic spike does. Run it as part of your release pipeline against a production-scale collection, gate merges on the k6 thresholds, and keep the server-side Prometheus histograms next to the client metrics so you always know whether the bottleneck is yours or the database’s. Once you trust the numbers, wire the same Prometheus scrape into continuous vector database performance monitoring so regressions never reach a marketing email.