Serving SLMs at Scale with vLLM, A Production Guide

Slm article cover illustration on a gradient background

January 15, 2025 · 9 min read · by Muhammad Amal programming

TL;DR — vLLM 0.6.4 is what you use when one llama.cpp instance isn’t enough. PagedAttention plus continuous batching gives you 5-10x the throughput of naive serving on the same GPU. Costs you VRAM, complexity, and a hard requirement on CUDA.

I run llama.cpp for low-concurrency local work and vLLM for anything that needs to scale past about 10 requests per second on a single GPU. The crossover point is real and hardware-dependent, but the question to ask is: am I CPU-bound on prompt processing, or am I leaving GPU cycles on the floor between requests? If the answer is the second, you want vLLM.

vLLM is also where you go when you need genuine multi-tenant serving — many users hitting the same endpoint with different prompts of different lengths. Its scheduler is built for that case. llama.cpp can be made to do it with --parallel, but vLLM was designed for it from day one.

This post walks through a vLLM 0.6.4 deployment from install through hardening. I’m targeting NVIDIA hardware (vLLM is Linux + CUDA in early 2025, ROCm support is improving but still has caveats). For Apple Silicon, see the MLX route in a later post.

Why vLLM and Not Just llama.cpp

Two features make vLLM faster under concurrency:

PagedAttention. The KV cache is allocated in fixed-size “pages” (like an OS virtual memory system) instead of one big contiguous block per request. This means you don’t have to reserve worst-case memory for each slot. A request that needs 100 tokens uses 100 tokens of cache; a request that needs 8000 uses 8000. Net effect: 2-4x more concurrent requests on the same VRAM.

Continuous batching. Instead of waiting for all requests in a batch to finish, vLLM evicts completed requests and admits new ones every step. This keeps GPU utilization near 100% even with heterogeneous request lengths.

The cost is complexity and Python-shaped overhead. Cold start is slower. The minimum VRAM footprint is higher. And vLLM is firmly Python-first, while llama.cpp is a small C++ binary.

Installation

Step 1, set up a CUDA-capable Python environment

python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

Check your CUDA:

nvidia-smi
# CUDA Version: 12.4 (or 12.1, both fine)

Step 2, install vLLM

pip install "vllm==0.6.4"

This is a large install (multiple GB of CUDA wheels). For air-gapped environments, build a wheelhouse first.

Step 3, sanity check

python -c "import vllm; print(vllm.__version__)"
# 0.6.4

Your First Server

vLLM ships a built-in OpenAI-compatible server. Use it.

Step 1, start the server

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 \
  --port 8000

Important flags:

--dtype bfloat16 for Ampere or newer; use float16 for Turing.
--max-model-len 8192 is the maximum context per request. Set this to your real maximum, not the model’s theoretical max — KV cache scales linearly with this.
--gpu-memory-utilization 0.85 reserves 85% of VRAM for vLLM. Default is 0.9. Lower it if you share the GPU with other workloads.

The first start downloads weights to ~/.cache/huggingface/hub/. Subsequent starts hit the cache.

Step 2, call it

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="sk-empty")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "Reply tersely."},
        {"role": "user", "content": "Why is PagedAttention faster?"},
    ],
    temperature=0.1,
    max_tokens=256,
)
print(resp.choices[0].message.content)

Note that the model parameter must match what vLLM was started with. Unlike llama.cpp’s server, vLLM enforces this.

Quantization on vLLM

Out of the box vLLM runs unquantized weights. For 3B at bf16 that’s around 6.4 GB. For larger models you’ll want quantization. vLLM supports AWQ, GPTQ, and FP8 in 0.6.4.

Step 1, pull an AWQ-quantized variant

pip install "autoawq==0.2.7.post3"

Many models have community AWQ versions on Hugging Face:

huggingface-cli download Qwen/Qwen2.5-7B-Instruct-AWQ

Step 2, serve it

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct-AWQ \
  --quantization awq \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

AWQ on a 7B drops you from around 14 GB to around 4.5 GB of weights. The throughput is comparable to bf16 for small batches and slightly higher at large batches.

Step 3, consider FP8 for newer hardware

If you’re on H100 or H200, FP8 is faster than AWQ. Pass --quantization fp8 with an FP8-converted model.

Architecture

How vLLM fits in a real deployment:

                    +-----------+
                    |  Clients  |
                    +-----+-----+
                          | HTTPS
                          v
                    +-----------+
                    |  Nginx /  |
                    |  Envoy    |
                    +-----+-----+
                          | HTTP
        +-----------------+-----------------+
        v                                   v
+----------------+                 +----------------+
| vLLM server #1 |                 | vLLM server #2 |
|  GPU 0         |                 |  GPU 1         |
|  +----------+  |                 |  +----------+  |
|  |Scheduler |  |                 |  |Scheduler |  |
|  +----+-----+  |                 |  +----+-----+  |
|       |        |                 |       |        |
|  +----v-----+  |                 |  +----v-----+  |
|  |PagedAttn |  |                 |  |PagedAttn |  |
|  | KV cache |  |                 |  | KV cache |  |
|  +----------+  |                 |  +----------+  |
+----------------+                 +----------------+
        |                                   |
        +-----------------+-----------------+
                          |
                          v
                +-------------------+
                |  Prometheus +     |
                |  Grafana          |
                +-------------------+

Each vLLM instance owns one (or more) GPUs. Scale horizontally by adding instances behind a load balancer; vLLM doesn’t have built-in multi-instance coordination. The load balancer just round-robins or least-connections.

Production Hardening

Step 1, run under a process manager

A systemd unit for vLLM:

# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM server
After=network.target

[Service]
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
Environment="HF_HOME=/var/cache/huggingface"
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
ExecStart=/opt/vllm/.venv/bin/python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --host 127.0.0.1 \
  --port 8000 \
  --disable-log-requests
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

--disable-log-requests is important. The default log spam is huge under load.

Step 2, scrape Prometheus metrics

vLLM exposes /metrics in Prometheus format. The metrics that matter:

vllm:num_requests_running — current concurrency
vllm:num_requests_waiting — queue depth (alarming if non-zero for long)
vllm:gpu_cache_usage_perc — KV cache pressure
vllm:time_to_first_token_seconds — TTFT histogram
vllm:time_per_output_token_seconds — inter-token latency

A minimal Prometheus scrape config:

scrape_configs:
  - job_name: vllm
    scrape_interval: 5s
    static_configs:
      - targets: ['127.0.0.1:8000']

Step 3, set rate limits at the proxy

vLLM does not have built-in rate limiting. Do it in nginx:

limit_req_zone $binary_remote_addr zone=vllm:10m rate=20r/s;

server {
    listen 443 ssl;
    server_name slm.example.com;

    location /v1/ {
        limit_req zone=vllm burst=40 nodelay;
        proxy_pass http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

20 req/s per IP with bursts to 40 is reasonable for an internal endpoint. Adjust for your traffic.

Streaming and Function Calling

vLLM’s OpenAI API supports streaming and tool calls out of the box.

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about KV cache."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

Tool calling

Pass --enable-auto-tool-choice --tool-call-parser llama3_json (or hermes, depending on model) when starting vLLM. Then:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Jakarta?"}],
    tools=tools,
    tool_choice="auto",
)
print(resp.choices[0].message.tool_calls)

Function calling on 3B models is hit-and-miss. Llama 3.2 3B is the best of the SLMs for this, but I still validate every output.

Common Pitfalls

Setting --max-model-len to the model max. Llama 3.2 supports 128k. Setting --max-model-len 131072 allocates enough KV cache for one such request and crushes concurrency. Fix: set to your 99th-percentile real prompt length plus generation.
Forgetting --gpu-memory-utilization. Default 0.9 will OOM if anything else is on the GPU. Fix: 0.85 in shared environments.
Mixing model versions with HF cache. If you change the model ID slightly, vLLM happily downloads a new copy and fills your cache disk. Fix: pin model IDs by digest, prune the cache regularly.
Expecting bit-identical outputs. vLLM does not produce the same outputs as transformers or llama.cpp even at temperature 0. Different attention implementations, different kernel order. Fix: don’t pin assertions to exact strings; pin to JSON schema validity or semantic equality.

Troubleshooting

Symptom: Cold start takes 90+ seconds. Diagnose: Model is being downloaded from HF on first run. Pre-warm by running huggingface-cli download <model> during the container build. CUDA graph capture also takes 10-20s on first start; this is normal.

Symptom: RuntimeError: CUDA out of memory mid-traffic, not at startup. Diagnose: Either a long-prompt request is filling KV cache, or you have memory leaks from a buggy custom decoder. Check vllm:gpu_cache_usage_perc. If it spikes to 100% before OOM, you need to lower --max-num-seqs or --max-model-len.

Symptom: Throughput drops by half after a few hours. Diagnose: Almost always a host-side bottleneck — Python GC pause, log file growth filling disk, or your reverse proxy choking. Check vllm:gpu_cache_usage_perc and nvidia-smi. If GPU util is still high but TTFT is up, the problem is in front of vLLM.

Capacity Planning the Quick Way

People ask me for sizing rules. There aren’t any rigorous ones, but here’s the rule of thumb that has gotten me close enough.

Take your model’s bf16 weight size (Llama-3.2-3B = 6.4 GB). Add KV cache for one slot at your max context: roughly 2 * num_layers * hidden_dim * max_seq_len * 2 bytes. For Llama-3.2-3B at 8k context, that’s around 1 GB per slot. Multiply by max_num_seqs. Add 1-2 GB headroom for activations and CUDA workspace.

For Llama-3.2-3B at 8k context, 16 concurrent slots: 6.4 + 16*1.0 + 2 = roughly 24 GB. Fits on an RTX 4090 if you turn --gpu-memory-utilization to 0.95.

vLLM exposes --max-num-seqs to cap concurrency. Without it, vLLM tries to admit every request and crashes under load instead of queuing gracefully. Always set it.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --max-model-len 8192 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.85

Multi-LoRA Serving

A feature I use heavily in 0.6.4: serving multiple LoRA adapters off a single base model. If you’ve fine-tuned three task-specific adapters (extractor, router, summarizer), serve them all from one vLLM with shared base weights.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --enable-lora \
  --lora-modules \
    extractor=/path/to/extractor-lora \
    router=/path/to/router-lora \
    summarizer=/path/to/summarizer-lora \
  --max-lora-rank 16

Clients select the adapter by model name in the request:

client.chat.completions.create(
    model="extractor",  # not the base name
    messages=[...],
)

Shared base, swapped adapters, one GPU. This is the killer feature if you’re shipping more than one fine-tuned task.

Prefix Caching

A free win in vLLM 0.6.4: automatic prefix caching. If you have a long shared system prompt across many requests, vLLM detects the shared prefix and reuses the KV cache for it. Enable with --enable-prefix-caching.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --enable-prefix-caching \
  --max-model-len 8192 \
  --max-num-seqs 16

For a workload where every request starts with a 1500-token system prompt, prefix caching cuts time-to-first-token by 60-80% on cache hits. Watch the vllm:cpu_prefix_cache_hit_rate metric to confirm it’s working. The cache is automatic but invalidated whenever the prefix changes by even a single token, so be deliberate about prompt versioning — small “improvements” to your system prompt can silently halve your throughput by blowing the cache.

What’s Next

vLLM is the heaviest of the three serving options in this series, but on a single GPU under real concurrency it’s also the fastest. The mental model to keep is “PagedAttention plus continuous batching equals near-100% GPU utilization under mixed loads.” Once that clicks, the rest is operational. The next post leaves serving for a while and digs into fine-tuning, because no amount of clever serving fixes a model that doesn’t know your task. See the official vLLM docs for the long tail of flags I skipped.