Monitoring Vector Database Performance Under Heavy Load

Qdrant article cover illustration on a gradient background

March 13, 2026 · 8 min read · by Muhammad Amal programming

TL;DR — Scrape Qdrant’s /metrics endpoint and watch search latency percentiles per collection / HNSW recall degrades silently under load, so track it, don’t assume it / unoptimized segments and high RAM pressure are the usual tail-latency culprits.

A retrieval-augmented feature I ran was fine in every load test and fell over the first real Monday morning. Search p99 went from 18ms to 600ms with no error, no alert, no crash. The vector database just got slow. Our load test had used a static dataset; production had a live ingestion stream writing into the same collection we were querying, and that changed everything.

Vector databases have a failure mode that traditional databases mostly don’t: they can return worse answers under load without returning wrong answers in any way a status code reveals. HNSW search is approximate. When the index is mid-rebuild, or segments are unoptimized, or vectors got paged to disk, you still get a 200 and a plausible-looking result set — it’s just less accurate, or much slower, or both. Vector database performance monitoring has to catch quality and latency regressions that never surface as errors.

This post is the monitoring setup for Qdrant 1.13 with Prometheus 3.x. We’ll scrape the right metrics, understand the HNSW and segment internals behind them, and build alerts that fire on the symptoms that actually predict an incident. If you instrument the LLM calls around retrieval — see instrumenting LLM calls with OpenTelemetry — this is the layer that explains the retrieval span when it goes red.

What Qdrant Exposes

Qdrant serves Prometheus metrics on /metrics at the same port as the REST API, no exporter needed. Point Prometheus at it:

# prometheus.yml
scrape_configs:
  - job_name: qdrant
    metrics_path: /metrics
    scrape_interval: 15s
    static_configs:
      - targets: ["qdrant-0:6333", "qdrant-1:6333", "qdrant-2:6333"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

The metrics that matter, grouped by what they tell you:

app_info — build version. Useful to confirm a rolling upgrade completed.
collections_total, collections_vector_total — inventory and dataset size.
rest_responses_total, rest_responses_fail_total — request volume and failures by endpoint.
rest_responses_duration_seconds — a histogram of REST latency. This is your latency SLI source.
grpc_responses_* — the same for the gRPC interface, if your clients use it.

Qdrant’s metric set and the /metrics endpoint are documented in the Qdrant monitoring docs . Start there for the authoritative list — it grows between releases.

Latency Percentiles Per Collection

A single global latency number averages a slow collection into a fast one and hides the problem. Compute percentiles from the histogram, broken down by endpoint.

# qdrant-rules.yaml
groups:
  - name: qdrant_latency
    interval: 30s
    rules:
      - record: qdrant:search_latency_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (le, instance) (
              rate(rest_responses_duration_seconds_bucket{
                method=~".*points/search.*"}[5m])
            )
          )

      - record: qdrant:search_latency_seconds:p50
        expr: |
          histogram_quantile(0.50,
            sum by (le, instance) (
              rate(rest_responses_duration_seconds_bucket{
                method=~".*points/search.*"}[5m])
            )
          )

      - record: qdrant:search_qps
        expr: |
          sum by (instance) (
            rate(rest_responses_total{
              method=~".*points/search.*"}[5m])
          )

Watch the spread between p50 and p99, not just p99 alone. A widening gap means tail latency is degrading while the median looks fine — that’s the early warning. By the time p50 moves, you’re already in trouble.

The HNSW Internals Behind the Numbers

To monitor a vector database you have to understand what it’s doing when it’s slow. Qdrant stores each collection as a set of segments, each with its own HNSW graph. A search runs against every segment and merges results.

Two parameters dominate search behavior:

m — edges per node in the HNSW graph. Higher m means better recall and more RAM.
ef (search-time hnsw_ef) — the size of the candidate list during search. Higher ef means better recall and slower searches. This is the dial you trade latency for accuracy on.

Here’s the part the load test missed. When you write into a collection, Qdrant creates new small segments and indexes them in the background. While a segment’s HNSW index is still building, searches against it fall back to a brute-force scan. A few unindexed segments under steady write load is exactly the “slow but correct” failure. So configure the indexing threshold and optimizer deliberately:

# collection_config.py
from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff, OptimizersConfigDiff,
)

client = QdrantClient(url="http://qdrant-0:6333")

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    hnsw_config=HnswConfigDiff(
        m=16,                  # graph connectivity; 16 is a sane default
        ef_construct=128,      # build-time candidate list
        # Below this many vectors, skip HNSW and brute-force. Keeps small
        # segments from the cost of indexing churn.
        full_scan_threshold=10_000,
    ),
    optimizers_config=OptimizersConfigDiff(
        # Trigger a merge once a segment is >20% deleted vectors.
        deleted_threshold=0.2,
        # Cap segment count so search doesn't fan out across too many.
        default_segment_number=4,
        # Index segments promptly under write load.
        indexing_threshold=20_000,
    ),
)

indexing_threshold is the lever for the Monday-morning problem. Too high, and large unindexed segments accumulate under write load and searches degrade to brute force. Too low, and you pay constant indexing overhead. Tune it against your actual write rate.

Watching Segments and Memory

Per-collection state isn’t all in /metrics — pull it from the collection info API and feed it to Prometheus through a small exporter, or poll it in a sidecar. Optimizer status is the single most useful field.

# segment_check.py
from qdrant_client import QdrantClient

client = QdrantClient(url="http://qdrant-0:6333")


def collection_health(name: str) -> dict:
    info = client.get_collection(collection_name=name)
    # status: "green" healthy, "yellow" optimizing, "red" error.
    return {
        "status": info.status,
        "optimizer_status": str(info.optimizer_status),
        "segments_count": info.segments_count,
        "indexed_vectors": info.indexed_vectors_count or 0,
        "total_vectors": info.points_count or 0,
    }


def index_coverage(name: str) -> float:
    """Fraction of vectors actually in an HNSW index. Below ~0.95
    under load means searches are falling back to brute force."""
    h = collection_health(name)
    total = h["total_vectors"]
    return h["indexed_vectors"] / total if total else 1.0

index_coverage is the metric I wish I’d had on that first bad Monday. When it drops below ~0.95 during a write burst, you know searches are scanning unindexed data and latency is about to spike — before users feel it.

On memory: HNSW graphs want to live in RAM. When the working set exceeds available memory, Qdrant pages vector data from disk and search latency jumps by an order of magnitude. Monitor the container’s RSS against its limit and alert well before the limit, because the latency cliff arrives before the OOM kill does.

Alerting on the Symptoms That Matter

# qdrant-alerts.yaml
groups:
  - name: qdrant_alerts
    rules:
      - alert: QdrantSearchLatencyHigh
        expr: qdrant:search_latency_seconds:p99 > 0.15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Qdrant search p99 above 150ms on {{ $labels.instance }}"

      - alert: QdrantSearchLatencyTailSpread
        # p99 more than 8x p50 — tail degrading while median looks fine.
        expr: |
          qdrant:search_latency_seconds:p99
            > 8 * qdrant:search_latency_seconds:p50
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Qdrant tail latency spreading from the median"

      - alert: QdrantRequestFailures
        expr: |
          rate(rest_responses_fail_total[5m]) > 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Qdrant returning errors on {{ $labels.instance }}"

The QdrantSearchLatencyTailSpread alert is the one that earns its keep. It catches the silent degradation — index coverage dropping, segments unoptimized — long before absolute p99 crosses a static threshold.

Common Pitfalls

Load testing against a static dataset. Real traffic writes while it reads. Test with a concurrent ingestion stream or you’ll miss the indexing-churn failure entirely.
One global latency number. It averages collections together and hides the slow one. Break latency down per collection and endpoint.
Ignoring index coverage. Searches against unindexed segments are slow but return 200s. Without coverage monitoring you’re blind to it.
Setting hnsw_ef once and forgetting it. It’s a live latency-versus-recall trade. Higher ef recovers recall at a latency cost; tune it per workload.
No headroom on memory. The latency cliff from disk paging hits before the OOM kill. Alert on RSS well under the limit.
Too many segments. Each search fans out across every segment. Unbounded segment count quietly inflates latency; cap it with default_segment_number.

Troubleshooting

Symptom: search p99 spikes during ingestion, no errors. Cause: new segments are unindexed and searches fall back to brute force. Fix: lower indexing_threshold so segments index sooner, and watch index_coverage during writes.

Symptom: latency degraded 10x with no traffic change. Cause: the working set outgrew RAM and Qdrant is paging vectors from disk. Fix: add memory, enable scalar quantization to shrink the footprint, or shard the collection across nodes.

Symptom: collection status stuck on yellow. Cause: the optimizer is mid-run, often after a large delete or bulk write. Fix: this is usually transient — if it persists for hours, check disk I/O and CPU headroom, since the optimizer is starved.

Symptom: recall dropped but latency is fine. Cause: hnsw_ef is too low for the current dataset size, or a segment indexed with a low ef_construct. Fix: raise the search-time hnsw_ef, and rebuild the collection with a higher ef_construct if the regression persists.

Symptom: /metrics returns 404. Cause: scraping the wrong port, or the metrics endpoint disabled in config. Fix: scrape the REST API port (6333 by default) at path /metrics, and confirm it’s enabled in the Qdrant service config.

What’s Next

You can now see vector search latency per collection, catch the silent tail-latency spread, and tie a slowdown back to unindexed segments or memory pressure before users notice. The natural follow-on is making the load test honest — driving Qdrant with realistic concurrent read-write traffic so the next Monday morning holds no surprises.

What Qdrant Exposes

Latency Percentiles Per Collection

The HNSW Internals Behind the Numbers

Watching Segments and Memory

Alerting on the Symptoms That Matter

Common Pitfalls

Troubleshooting

What’s Next

Related posts

Load Testing a Vector Search Pipeline Before It Breaks

Tracking Token Usage and Cost per Request with OpenTelemetry

Adopting OpenTelemetry GenAI Semantic Conventions

Instrumenting LLM Calls with OpenTelemetry Traces

Embedding Strategies for Support Documentation in 2025

Observability for n8n in 2025, Metrics, Logs, and Traces

Reading pg_stat_io and Modern Postgres Internals

AIOps in May 2025, What Actually Works in Production

Let’s Start a Project