Milvus 2.2 in Production, Self-Hosting the Heavyweight Open-Source Vector Database

Milvus 2.2 in Production, Self-Hosting the Heavyweight Open-Source Vector Database

April 17, 2023 · 7 min read · by Muhammad Amal ai

TL;DR — Milvus is a real distributed system with six microservices, etcd, Pulsar (or Kafka), and an object store. Don’t deploy it like you’d deploy Postgres. / The HNSW vs IVF_PQ choice is the most important tuning decision, and it depends entirely on whether your dataset fits in RAM. / Resource sizing is the operational pain point; expect to revisit it after the first month in production.

I moved a team off Pinecone onto self-hosted Milvus in March, mostly for cost reasons - we were running 50M vectors and the pod-based bill was looking unfortunate. The migration itself took two weeks. Stabilizing the deployment took another month. This post is the operational knowledge I wish I’d had on day one.

Milvus 2.2 is a genuinely capable system, but it’s also a complex one. If you’re considering it, make sure the cost or scale arguments justify the operational overhead, because the overhead is real. The vector database landscape post from earlier this month has the broader decision framework if you’re still weighing options.

I’ll cover the architecture, the install path that actually works, index selection, schema design, and the failure modes I’ve watched happen in production.

What you’re actually deploying

Milvus 2.2 is a cloud-native architecture with a clear separation of concerns. The components:

Proxy. The entry point for client requests. Stateless, horizontally scalable.
Root coordinator. Handles DDL operations (create/drop collection), transaction IDs, and time-tick allocation.
Query coordinator. Manages query nodes, balances load.
Data coordinator. Manages data segments, triggers compactions.
Index coordinator. Schedules index building.
Query nodes. Actually run vector searches. Memory-heavy.
Data nodes. Handle insertions, write segments to object storage.
Index nodes. Build vector indexes asynchronously. CPU-heavy.

Plus three external dependencies:

etcd for metadata.
Pulsar (default) or Kafka for the write-ahead log.
MinIO (default) or S3 for segment storage.

A standalone deployment runs all of this in a single container, which is fine for development. A cluster deployment runs each as a separate service, and that’s what production requires.

Installing on Kubernetes

The Milvus Helm chart is the supported install path. Skip everything else; the Docker Compose setup is for laptops, not servers.

helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update

helm install milvus milvus/milvus \
    --version 4.0.20 \
    --namespace milvus \
    --create-namespace \
    -f values.yaml

The interesting bits live in values.yaml. A starter cluster config that’s worked for me, scoped to a workload of ~50M 768-dim vectors:

cluster:
  enabled: true

externalEtcd:
  enabled: false
etcd:
  replicaCount: 3

externalS3:
  enabled: true
  host: s3.us-west-2.amazonaws.com
  port: 443
  accessKey: "..."
  secretKey: "..."
  bucketName: "company-milvus"
minio:
  enabled: false

pulsar:
  enabled: true
  zookeeper:
    replicaCount: 3
  bookkeeper:
    replicaCount: 3
  broker:
    replicaCount: 2

queryNode:
  replicas: 3
  resources:
    requests:
      memory: 16Gi
      cpu: 4
    limits:
      memory: 24Gi
      cpu: 8

indexNode:
  replicas: 2
  resources:
    requests:
      memory: 8Gi
      cpu: 4

dataNode:
  replicas: 2
  resources:
    requests:
      memory: 4Gi
      cpu: 2

proxy:
  replicas: 2

The thing that will catch you off guard: Pulsar is itself a distributed system with its own ZooKeeper and BookKeeper clusters. If you’re not on Kubernetes already, you’re now operating two distributed systems, not one. Some teams swap Pulsar for Kafka, which works but isn’t the upstream default and gets less testing attention.

The Milvus deployment docs cover the install in more detail, but the section on resource sizing is thin. You’ll be tuning the values for weeks.

Schema design

Milvus collections have explicit schemas. Define them carefully because alterations are limited in 2.2.

from pymilvus import (
    connections, FieldSchema, CollectionSchema, DataType,
    Collection, utility,
)

connections.connect(alias="default", host="milvus-proxy", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=64),
    FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
    FieldSchema(name="timestamp", dtype=DataType.INT64),
]
schema = CollectionSchema(
    fields=fields,
    description="documentation chunks",
    enable_dynamic_field=False,
)

if utility.has_collection("docs"):
    Collection("docs").drop()

collection = Collection(name="docs", schema=schema, shards_num=4)

shards_num is the per-collection sharding factor. It defaults to 2, but for write-heavy collections I bump it to 4 or 8. You can’t change it after creation, so err on the high side.

enable_dynamic_field is new in 2.2 and lets you add unstructured JSON fields. I keep it off in production because it makes filter performance harder to reason about.

Choosing an index

This is the decision that will define your performance characteristics. The options:

HNSW - In-memory graph index. Fastest queries, highest memory cost. The right choice if your full dataset fits in query node RAM.
IVF_FLAT - Inverted file with full vectors. Lower memory than HNSW, somewhat slower queries.
IVF_SQ8 - IVF with 8-bit scalar quantization. ~4x memory savings, small recall loss.
IVF_PQ - IVF with product quantization. ~16x memory savings, larger recall loss. The right choice when your dataset doesn’t fit in RAM.
DISKANN - Disk-based index. Newest in 2.2, lets you keep most of the data on NVMe. Worth experimenting with for very large datasets.

For the 50M-vector workload I migrated, I went with HNSW after benchmarking. RAM cost was 200GB across the query nodes, which fit comfortably on 3x r5.4xlarge instances. Recall@10 was 0.98 versus exact search; latency p99 was 25ms.

collection.create_index(
    field_name="embedding",
    index_params={
        "index_type": "HNSW",
        "metric_type": "COSINE",
        "params": {"M": 16, "efConstruction": 200},
    },
)
collection.load()

The HNSW parameters are worth understanding:

M - Number of neighbors per node in the graph. Higher M = better recall, more memory. 16 is the sweet spot for most workloads. 32 if you’re tuning hard.
efConstruction - Build-time search width. Higher = better graph quality, slower build. 200 is the default; I haven’t found a reason to go higher.

At query time, ef controls the search width. Higher = better recall, slower queries.

results = collection.search(
    data=[query_vector],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 64}},
    limit=10,
    expr='source == "blog"',
    output_fields=["doc_id", "source"],
)

ef should be at least limit and is typically set to 4-8x limit for good recall.

The load/release dance

Milvus has an explicit “load” step that pulls a collection’s index from object storage into query node memory. Until you call collection.load(), queries fail. After collection.release(), queries fail again.

This matters because collection load isn’t instant on large datasets. A 50M-vector HNSW index takes ~5 minutes to load across 3 query nodes. If you restart query nodes (rolling upgrade, autoscale event), there’s a window where queries fail.

My standard pattern: keep important collections loaded at all times, and use a sidecar process that polls collection.has_index() and utility.loading_progress() and re-loads on startup.

from pymilvus import utility

def ensure_loaded(name: str, timeout: int = 600):
    collection = Collection(name)
    progress = utility.loading_progress(name)
    if progress["loading_progress"] == "100%":
        return
    collection.load(timeout=timeout)

Compactions

Milvus segments small writes into separate files, then compacts them in the background. If your write rate is high relative to your compaction throughput, you’ll accumulate small segments and query performance degrades.

Symptoms: query latency creeping up over weeks, segment count visible via utility.get_query_segment_info() climbing into the thousands.

Mitigations: provision more index nodes (compactions run there), tune dataCoord.segment.sealProportion, or trigger manual compactions during low-traffic windows.

utility.compact("docs")

I do a manual compaction nightly via a CronJob in low-traffic environments. The Milvus team is improving the auto-compaction heuristics, but in 2.2 they’re conservative.

Common Pitfalls

Underprovisioned query nodes. The query node OOMs are the most common production failure. If your dataset is 100GB in HNSW format and you have 2 query nodes with 32GB RAM each, you’re going to crash. Add headroom; aim for 50% memory utilization at steady state.
Forgetting consistency_level. Milvus offers four consistency levels (Strong, Bounded, Session, Eventually). The default is Bounded. If your writes need to be queryable immediately, set consistency_level="Strong" on the search call, but it costs latency.
Using auto_id=True with deduplication. Milvus generates IDs if you set auto_id=True, which means re-ingestion creates duplicates. For idempotent pipelines, use deterministic IDs (hash of source ID).
Treating delete as immediate. Deletes in Milvus 2.2 are logical; the data is marked tombstoned and only physically removed during compaction. Counts and search results reflect the deletion immediately, but storage doesn’t shrink until compaction.
Skipping the standalone tier for development. A standalone Milvus container is ~2GB RAM and runs fine for local dev. Don’t make every developer share a cluster instance.

Observability

The Milvus team ships a Prometheus exporter that’s solid. The metrics I watch:

milvus_querynode_search_latency_bucket - p50/p95/p99 query latency.
milvus_querynode_load_segment_latency_bucket - load times during recovery.
milvus_datanode_save_latency_bucket - write path latency.
milvus_dataacoord_segment_num - number of segments per collection. If this climbs into the hundreds for a stable workload, compactions are falling behind.
process_resident_memory_bytes on query nodes - the OOM canary.

Alerting on query node memory >85% and segment count >500 catches most operational issues before they become incidents.

Wrapping Up

Milvus is the right tool when you’re past 50M vectors and the Pinecone bill stops making sense, or when data residency or operational independence is a hard requirement. It’s not the right tool for “we don’t want to depend on a vendor” if you’re under 10M vectors; the operational cost outweighs the licensing savings at that scale.

The next post in this series goes back to easier territory with Weaviate, which sits in the sweet spot between Pinecone’s simplicity and Milvus’s flexibility for many workloads.

What you’re actually deploying

Installing on Kubernetes

Schema design

Choosing an index

The load/release dance

Compactions

Common Pitfalls

Observability

Wrapping Up

Related posts

The Vector Database Landscape in 2023, Pinecone, Milvus, Weaviate, and Chroma Compared

LangChain 0.0.13x, The Framework, the Hype, and the Real Engineering Tradeoffs

Chroma 0.3, The Local-First Vector Database for Notebook-Scale Prototyping

Weaviate 1.18 and Hybrid Search, When Keyword and Vector Search Are Both Right

Building Semantic Search From Scratch, A Production Walkthrough

Embedding Models in 2023, ada-002, sentence-transformers, and What Actually Matters

Pinecone in Production, Pod Sizing, Upserts, and the Cost Math That Surprises Teams

Pod Security Standards in 2023, Migrating Off PSPs Without Breaking Everything

Let’s Start a Project