Quantizing SLMs to 4-Bit with GGUF Without Wrecking Accuracy
TL;DR — GGUF 4-bit quantization shrinks a model 4x and barely moves accuracy when you do it right / Q4_K_M is the default; Q5_K_M buys back precision for a 25% size bump / measure perplexity before and after, or you’re flying blind.
The first time I quantized a model I did it the lazy way: grabbed a random Q4 GGUF off Hugging Face, dropped it into production, and called it done. It worked. Then a teammate asked the obvious question — “how much accuracy did we lose?” — and I had no answer. I’d never measured the FP16 baseline. That’s the mistake I want to help you avoid.
Quantization is the single highest-leverage step in shipping a small language model to constrained hardware. It takes a model stored as 16-bit floats and re-encodes the weights at lower precision — 4 or 5 bits per weight in the cases I care about. The payoff is a 3-4x reduction in size and RAM, and a meaningful speedup on CPU because you’re moving far less data through the memory bus. The risk is accuracy degradation that you won’t notice until a user does.
This post is the careful version. We’ll convert a model to GGUF, quantize it with llama-quantize, and — this is the part most tutorials skip — measure the accuracy cost with perplexity so you can make an informed call. If you haven’t yet, the case for running small language models at the edge
sets the stage for why this matters.
What GGUF and 4-Bit Quantization Actually Do
GGUF is the file format used by llama.cpp
. It’s a single-file container holding weights, tokenizer, and metadata, designed for memory-mapped loading. The format is decoupled from the quantization scheme — a .gguf file can hold FP16 weights or any of llama.cpp’s quantized types.
Quantization maps a block of high-precision weights onto a small set of discrete levels. Naive 4-bit quantization gives you 16 levels per weight and tends to butcher accuracy. llama.cpp’s K-quants are smarter: they group weights into blocks, store a per-block scale (and sometimes a min), and allocate bits non-uniformly so that more sensitive tensors keep more precision. The result is dramatically better quality per bit than the old uniform schemes.
The two types worth knowing for SLMs:
- Q4_K_M — 4-bit K-quant, medium variant. The de facto default. Roughly 4.8 bits per weight effective (the metadata overhead is real), ~2 GB for a 3B model, and accuracy loss that’s usually in the noise.
- Q5_K_M — 5-bit K-quant, medium variant. About 25% larger than Q4_K_M, noticeably closer to the FP16 baseline. Reach for it when Q4_K_M’s perplexity gap is too wide for your task, or when the model is small enough that the extra few hundred MB don’t matter.
There’s also Q6_K and Q8_0 if you want near-lossless at larger sizes, and Q3/Q2 variants if you’re truly desperate for space — but below 4 bits the accuracy cliff gets steep fast. For SLMs I almost never go below Q4_K_M.
Set Up the Toolchain
You need llama.cpp built with its conversion and quantization tools, plus the Python conversion script’s dependencies.
# llama.cpp release b4585 (January 2026)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout b4585
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON
cmake --build build --config Release -j "$(nproc)" \
--target llama-quantize llama-perplexity llama-cli
# Python deps for the HF -> GGUF converter
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements/requirements-convert_hf_to_gguf.txt
That requirements-convert_hf_to_gguf.txt pins the exact transformers, torch, and numpy versions the converter expects. Don’t substitute your own — version drift in the conversion script is a common source of silent corruption.
Step 1: Get the Full-Precision Model
Pull the original model from Hugging Face. We want the FP16/BF16 weights, not someone else’s pre-quantized GGUF — the whole point is to control the pipeline.
pip install huggingface_hub==0.27.1
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
--local-dir ./models/Llama-3.2-3B-Instruct \
--exclude "original/*"
The --exclude "original/*" skips the PyTorch consolidated checkpoint; the converter reads the safetensors shards instead, which is what you want.
Step 2: Convert to a GGUF Baseline
Convert the HF model to a GGUF file at full precision. This is your reference — every quantized variant gets measured against it.
python3 convert_hf_to_gguf.py \
./models/Llama-3.2-3B-Instruct \
--outfile ./models/llama-3.2-3b-f16.gguf \
--outtype f16
--outtype f16 keeps weights at 16-bit. The output is around 6.4 GB for a 3B model — large, but it’s a transient artifact you’ll delete once quantization is done.
Step 3: Build an Importance Matrix (the step people skip)
The single biggest accuracy win in modern llama.cpp quantization is the importance matrix (imatrix). It’s a profile of which weights actually matter, computed by running representative text through the FP16 model. llama-quantize uses it to spend precision where it counts. Quantizing without an imatrix is leaving accuracy on the table.
You need a calibration text file. A few hundred KB of text resembling your production traffic works well — domain documents, sample prompts, representative content.
# Use a real calibration corpus. A generic one (e.g. wikitext) is a fallback;
# domain-specific text is strictly better if your workload is narrow.
./build/bin/llama-imatrix \
-m ./models/llama-3.2-3b-f16.gguf \
-f ./calibration/domain-corpus.txt \
-o ./models/llama-3.2-3b.imatrix \
--chunks 200
--chunks 200 caps how many text chunks get processed — enough for a stable matrix without burning an hour of CPU. Save the .imatrix file; it’s reusable across every quantization of this model.
Step 4: Quantize
Now produce the two variants worth comparing.
# Q4_K_M — the default. Pass the imatrix for better weight allocation.
./build/bin/llama-quantize \
--imatrix ./models/llama-3.2-3b.imatrix \
./models/llama-3.2-3b-f16.gguf \
./models/llama-3.2-3b-Q4_K_M.gguf \
Q4_K_M
# Q5_K_M — higher precision, ~25% larger.
./build/bin/llama-quantize \
--imatrix ./models/llama-3.2-3b.imatrix \
./models/llama-3.2-3b-f16.gguf \
./models/llama-3.2-3b-Q5_K_M.gguf \
Q5_K_M
Check the sizes:
ls -lh ./models/*.gguf
# llama-3.2-3b-f16.gguf ~6.4G
# llama-3.2-3b-Q4_K_M.gguf ~2.0G
# llama-3.2-3b-Q5_K_M.gguf ~2.3G
Step 5: Measure the Accuracy Cost
This is the step that turns guesswork into engineering. Perplexity measures how surprised the model is by held-out text — lower is better. Run it on the same corpus for the FP16 baseline and each quantized variant. The gap is your quantization tax.
Use held-out text — not the calibration corpus, or you’ll measure overfitting to the imatrix rather than true quality.
for model in f16 Q4_K_M Q5_K_M; do
echo "=== $model ==="
./build/bin/llama-perplexity \
-m "./models/llama-3.2-3b-${model}.gguf" \
-f ./eval/holdout.txt \
--ctx-size 2048 \
2>&1 | grep -E "^Final estimate"
done
You’ll get output like:
=== f16 ===
Final estimate: PPL = 8.142 +/- 0.061
=== Q4_K_M ===
Final estimate: PPL = 8.301 +/- 0.063
=== Q5_K_M ===
Final estimate: PPL = 8.193 +/- 0.062
Interpret it as a relative delta. Here Q4_K_M costs about 2% perplexity over FP16 and Q5_K_M costs under 1%. For most bounded tasks a 2% perplexity increase is invisible in real output — ship Q4_K_M. If your task is precision-sensitive (structured extraction with strict schemas, code), the smaller Q5_K_M gap may be worth the extra 300 MB.
Perplexity is a proxy, not the whole story. Back it with a task-level eval: run a fixed set of representative prompts through each variant and diff the outputs. A quantized model can hold steady on perplexity while drifting on a specific behavior you care about.
# Quick task-level smoke test against a running llama-server.
# pip install openai==1.59.0
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="x")
PROMPTS = [
"Extract the total amount from: 'Invoice total: $4,210.55 due Feb 1'.",
"Classify sentiment (positive/negative/neutral): 'It works, barely.'",
"Rewrite formally: 'hey can u send the report asap thx'.",
]
for p in PROMPTS:
out = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": p}],
temperature=0.0,
max_tokens=64,
)
print(f"PROMPT: {p}\nOUTPUT: {out.choices[0].message.content.strip()}\n")
Run that against an FP16-served model and each quantized variant, then eyeball the diffs. Cheap insurance.
Common Pitfalls
- Quantizing without an imatrix. You lose a measurable chunk of accuracy for no reason. Build the importance matrix once and reuse it.
- Calibrating on the wrong data. A generic corpus produces a generic imatrix. If your workload is narrow, calibrate on domain text — the quantizer will protect the weights your traffic actually exercises.
- Measuring perplexity on the calibration set. That measures memorization, not generalization. Always use held-out text for evaluation.
- Chasing the smallest type. Q3 and Q2 look tempting on a disk-space spreadsheet, but the accuracy cliff below 4-bit is steep. For SLMs, Q4_K_M is the floor I’ll defend.
- No FP16 baseline. Without the unquantized reference number, “PPL = 8.3” means nothing. Always measure the baseline first.
Troubleshooting
Symptom: convert_hf_to_gguf.py fails with an unknown architecture error.
Cause: the model uses an architecture newer than your llama.cpp checkout. Fix: update to a recent release tag; architecture support lands continuously.
Symptom: quantized model produces garbled or repetitive output.
Cause: a broken conversion, often from mismatched converter dependencies, or a corrupted download. Fix: re-download the HF model, verify shard checksums, reinstall the pinned requirements-convert_hf_to_gguf.txt, and reconvert.
Symptom: llama-quantize ignores the imatrix and prints a warning.
Cause: the imatrix file was built from a different model or a different GGUF conversion. Fix: rebuild the imatrix from the exact f16.gguf you’re quantizing — they must come from the same conversion.
Symptom: perplexity is dramatically worse than expected (double-digit jump).
Cause: usually a tokenizer mismatch in conversion, or an eval file with encoding issues. Fix: confirm the GGUF embeds the correct tokenizer (llama-cli will log it), and ensure the eval file is plain UTF-8.
Symptom: perplexity run is unbearably slow.
Cause: large context size or a huge eval file on CPU. Fix: drop --ctx-size to 2048 and trim the eval file to a few hundred KB — that’s plenty for a stable estimate.
What’s Next
Quantization gets you a model that fits, but fitting and running well aren’t the same thing. The next step is to take a Q4_K_M build and put it on genuinely constrained hardware — measuring real token throughput, thermals, and memory pressure, not just disk size. Once you’ve got a quantized model in hand, deploying it to a single-board computer is where the rubber meets the road.