background-shape
llama.cpp Deep Dive, Quantization, GGUF, and Inference Speed
January 13, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — llama.cpp gives you knobs Ollama hides. Build with the right backend flags, pick the right quantization (Q4_K_M for most, Q5_K_M when you have the VRAM, IQ4_XS when you don’t), and tune --n-gpu-layers plus --ctx-size to fit. Done well, a 3B model serves 60+ tok/s on a single RTX 4070.

Ollama is great until you need to debug why your tokens-per-second number is half of what someone else reports on the same hardware. At that point you’re staring at llama.cpp anyway, because Ollama is llama.cpp underneath. Better to learn the layer where the actual work happens.

This post is a guided tour. It covers what GGUF actually contains, what the alphabet soup of quantization names means, how to build llama.cpp with the flags that matter, and what to tune for speed. I’ll assume you’ve used Ollama or transformers before and want to understand the layer below.

A note on versions: llama.cpp moves fast. I’m using the January 2025 master, commit roughly around b3500. The CLI flag names have been stable for months, but build flags shift every few weeks. Check the README if something doesn’t match.

What GGUF Actually Is

GGUF (GPT-Generated Unified Format) is the file format that replaced GGML in 2023. It’s a single-file container with:

  • A header with format version and metadata schema
  • Key-value metadata (tokenizer config, chat template, architecture)
  • Tensor descriptors (name, shape, quantization type, offset)
  • Tensor data, aligned for mmap

Three properties matter in practice. First, it’s mmap-friendly: the runtime memory-maps the file and lets the OS page in tensors as needed. Second, it’s self-describing: the tokenizer, the chat template, even the prompt format are inside the file. Third, it’s quantization-aware: each tensor can be stored in a different precision.

You can inspect a GGUF with the metadata CLI tool shipped with llama.cpp:

./llama-gguf-dump models/Llama-3.2-3B-Instruct-Q4_K_M.gguf | head -40

You’ll see entries like general.architecture = llama, llama.context_length = 131072, and per-tensor quantization types. If you ever wonder why a model “just works” when you load it, this is why.

The Quantization Alphabet Soup

Quantization is lossy compression for weights. llama.cpp ships about a dozen schemes. You don’t need all of them. Here are the ones that matter in January 2025.

Name      Bits   Notes
--------  -----  ------------------------------------------------
F16       16     Reference. Big and slow but correct.
Q8_0      8.5    Near-lossless. Use when VRAM allows.
Q6_K      6.6    Quietly excellent. Underrated.
Q5_K_M    5.7    Strong default if you have the VRAM.
Q4_K_M    4.8    The pragmatic default. Best speed/quality knee.
Q4_0      4.5    Older, worse than Q4_K_M. Avoid.
IQ4_XS    4.3    "Importance-matrix" quant. Smaller than Q4_K_M,
                 nearly identical quality. Slightly slower.
IQ3_M     3.7    For phones and 4GB GPUs. Quality drop is real.
IQ2_XXS   2.1    Last resort. Visible degradation.

The _K family uses k-means quantization on weight blocks; the IQ family uses an “importance matrix” computed from calibration data so frequent activations stay precise. _M (medium) and _S (small) suffixes pick which layers get higher precision — _M keeps attention layers in higher precision, which is almost always worth it.

My rule of thumb:

  • 24 GB VRAM, want best quality: Q8_0 or Q6_K.
  • 12-16 GB VRAM, want best balance: Q5_K_M or Q4_K_M.
  • 8 GB VRAM or less, must fit: IQ4_XS or IQ3_M.

For a 3B parameter model, Q4_K_M lands around 2.0 GB. Q5_K_M around 2.3 GB. Q8_0 around 3.4 GB.

Building llama.cpp Correctly

Pre-built binaries are convenient but rarely optimal. Build from source.

Step 1, clone and check out a known commit

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b3500  # or whatever the January 2025 release tag is

I pin to a tag. Master is fast-moving and occasionally regresses.

Step 2, build with the right backend

On NVIDIA Linux:

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

On Apple Silicon (Metal, MLX is separate):

cmake -B build \
  -DGGML_METAL=ON \
  -DGGML_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

On CPU only (AVX2/AVX512):

cmake -B build \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=OpenBLAS \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

GGML_NATIVE=ON tells the compiler to use whatever SIMD your CPU supports. Don’t use it in a binary you ship to other people — it bakes in your CPU’s instruction set.

Step 3, sanity check the build

./build/bin/llama-cli --version

You should see the build commit and the backends compiled in.

Quantizing a Model Yourself

Pre-quantized models on Hugging Face are fine, but sometimes you want to do it yourself — for instance, when you’ve fine-tuned and need to ship the result.

Step 1, convert from HF to GGUF F16

python convert_hf_to_gguf.py \
  /path/to/Llama-3.2-3B-Instruct \
  --outfile Llama-3.2-3B-Instruct-F16.gguf \
  --outtype f16

This produces the reference fp16 GGUF. Big (around 6.4 GB for 3B) but correct.

Step 2, quantize to your target

./build/bin/llama-quantize \
  Llama-3.2-3B-Instruct-F16.gguf \
  Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  Q4_K_M

For importance-matrix quants (IQ4_XS, IQ3_M, etc.), you need to compute an importance matrix first:

./build/bin/llama-imatrix \
  -m Llama-3.2-3B-Instruct-F16.gguf \
  -f calibration_data.txt \
  -o imatrix.dat \
  --chunks 100

./build/bin/llama-quantize \
  --imatrix imatrix.dat \
  Llama-3.2-3B-Instruct-F16.gguf \
  Llama-3.2-3B-Instruct-IQ4_XS.gguf \
  IQ4_XS

calibration_data.txt should be representative of your real inputs. A few hundred kilobytes of plain text is plenty. Don’t use random Wikipedia — use your task domain.

Running the Server

llama.cpp ships an HTTP server that speaks an OpenAI-compatible API. This is what I use in production when I want more control than Ollama gives.

Step 1, start the server

./build/bin/llama-server \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 999 \
  --ctx-size 8192 \
  --parallel 4 \
  --cont-batching \
  --flash-attn \
  --threads 8

Key flags:

  • --n-gpu-layers 999 offloads everything possible to GPU. If you don’t have enough VRAM, lower it until the model fits.
  • --ctx-size 8192 is the maximum sequence length across all parallel slots. With --parallel 4, each slot sees 2048 by default unless you size up.
  • --cont-batching enables continuous batching — adding new requests to an in-flight batch. Free latency win.
  • --flash-attn enables flash attention for supported architectures.

Step 2, call the OpenAI-compatible endpoint

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")

resp = client.chat.completions.create(
    model="local",
    messages=[
        {"role": "system", "content": "Reply tersely."},
        {"role": "user", "content": "What's 2+2?"},
    ],
    temperature=0.1,
)
print(resp.choices[0].message.content)

Yes, the openai SDK works directly. The model parameter is ignored — whatever you loaded is what you get.

Benchmarking Tokens Per Second

The right way to measure performance is the llama-bench tool. It runs prompt-processing and token-generation passes in isolation.

./build/bin/llama-bench \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p 512 \
  -n 128 \
  -ngl 999 \
  -t 8 \
  -r 5

Output looks like:

| model            |   size |   params | backend | ngl |  test |    t/s |
|------------------|--------|----------|---------|-----|-------|--------|
| llama 3B Q4_K_M  | 1.87GB |  3.21B   | CUDA    | 999 | pp512 | 4521.3 |
| llama 3B Q4_K_M  | 1.87GB |  3.21B   | CUDA    | 999 | tg128 |   62.4 |

pp512 is prompt-processing throughput (parallel, fast). tg128 is token-generation throughput (sequential, slow). For chat, tg128 is what users feel. 62 tok/s on a 3B Q4_K_M on a 4070 is a reasonable target.

Architecture

How llama.cpp sits in a typical local deployment:

+--------------+   HTTP    +-------------------+
|  Your app    | --------> |  llama-server     |
|              |  /v1/chat |                   |
+--------------+           |  +-------------+  |
                           |  | scheduler   |  |
                           |  |  + batching |  |
                           |  +-------------+  |
                           |  +-------------+  |
                           |  | KV cache    |  |
                           |  | (paged)     |  |
                           |  +-------------+  |
                           |  +-------------+  |
                           |  | CUDA / Metal|  |
                           |  | kernels     |  |
                           |  +-------------+  |
                           +-------------------+
                                    |
                                    | mmap
                                    v
                           +-------------------+
                           |   GGUF file       |
                           +-------------------+

The KV cache is where most of your VRAM goes after the weights themselves. For a 3B model at 8k context with 4 parallel slots, the KV cache is typically larger than the quantized weights. This is why context size dominates VRAM math.

Common Pitfalls

  1. Mismatched chat template. llama.cpp reads the chat template from GGUF metadata, but old GGUFs from before mid-2024 don’t have one and fall back to a generic format that produces garbage. Fix: re-convert with current convert_hf_to_gguf.py, or pass --chat-template llama3 explicitly.

  2. Setting --ctx-size without thinking about --parallel. The total KV cache is ctx_size, divided among parallel slots. With --ctx-size 8192 --parallel 4 you get 2048 tokens per request, not 8192. Fix: size ctx-size to (max_prompt_per_request) * parallel.

  3. Forgetting --flash-attn. It’s not on by default and the speedup at long contexts is large (1.5-2x at 8k context). Fix: always pass --flash-attn unless your hardware doesn’t support it.

  4. Quantizing without an imatrix for low bits. Below 4 bits, a vanilla quantize is noticeably worse than an imatrix quantize. The imatrix step takes 10 minutes; do it. Fix: run llama-imatrix before quantizing to IQ3 or lower.

Troubleshooting

Symptom: Server starts but throughput is half of expected. Diagnose: Either you didn’t offload all layers (--n-gpu-layers too low) or you forgot --flash-attn. Check nvidia-smi — if your GPU is below 80% util during generation, the bottleneck is host-side.

Symptom: OOM during long generations only. Diagnose: KV cache growth. You sized --ctx-size larger than VRAM allows once the cache fills. Reduce --ctx-size, or enable --cache-type-k q4_0 --cache-type-v q4_0 to quantize the KV cache itself (small quality cost, big VRAM win).

Symptom: First request takes 20+ seconds, subsequent fast. Diagnose: Cold mmap. The OS hasn’t paged in the weights yet. On Linux, prefault with --mlock (requires root or ulimit), or accept the first-request cost.

KV Cache Quantization

If you’ve sized everything correctly and still hit OOM at long contexts, quantize the KV cache itself. The KV cache stores keys and values from previous tokens; at 8k context with a 3B model it’s around 800 MB in fp16. Quantizing to Q4_0 cuts that to 200 MB.

./build/bin/llama-server \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --ctx-size 16384 \
  --parallel 4 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --flash-attn

Note: KV cache quantization requires --flash-attn. There’s a small quality cost (negligible for most tasks, measurable on reasoning-heavy benchmarks). Q8_0 cache is a safer middle ground if you’re worried.

Speculative Decoding

A neat trick for boosting tokens/sec on a single request: speculative decoding. You run a tiny “draft” model that proposes several tokens at once; the main model verifies them in parallel. If the draft was right, you get those tokens for free.

./build/bin/llama-server \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -md models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --draft 16 \
  --port 8080

The 1B drafts for the 3B main model. On easy prompts (continuations, formatted output) the speedup is 1.5-2x. On hard prompts where the draft is usually wrong, it’s neutral or slightly slower. Worth trying on your workload.

CPU-Only Performance

llama.cpp is the best option if you must run on CPU. A few notes from running production workloads on Xeon hardware without GPUs.

Use the BLAS backend with OpenBLAS or Intel MKL. Set --threads to the number of physical (not logical) cores. Hyperthreading hurts on llama.cpp because it causes cache contention between sibling threads doing the same matmul.

./build/bin/llama-server \
  -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --threads 8 \
  --threads-batch 16 \
  --ctx-size 4096 \
  --port 8080

--threads-batch can be higher than --threads because prompt processing is more parallelizable than token generation. On a 16-core machine, threads=8 / threads-batch=16 is a reasonable starting point.

Expected throughput on a modern Xeon (Sapphire Rapids, AVX-512): 6-10 tok/s on Llama-3.2-3B Q4_K_M. That’s barely interactive but workable for batch jobs. Below AVX-512, expect 3-5 tok/s.

Wrapping Up

Once you understand GGUF and the quantization knobs, llama.cpp is not scary — it’s just very explicit. The flags that matter are --n-gpu-layers, --ctx-size, --parallel, --cont-batching, and --flash-attn. The quantization that matters is Q4_K_M for most cases and IQ4_XS when you’re tight on VRAM. Get those right and you’re already at the performance ceiling for your hardware. See the llama.cpp README for the flags I didn’t cover, and the next post in this series for when you outgrow even llama.cpp and need vLLM.