Benchmarking SLM Latency and Memory on Constrained Hardware
TL;DR — A credible SLM latency benchmark separates prefill from decode, reports TTFT and tokens-per-second with percentiles, and tracks peak RSS / use llama-bench for GGUF, ONNX Runtime’s profiler for ONNX, and
perffor the system view / I cover methodology, the exact commands, and the measurement traps that produce numbers that lie.
Most “benchmark” numbers I see for small language models on edge hardware are useless, and not because the people running them are careless. They’re useless because they measure one prompt, once, on a thermally-throttled board, and report a single tokens-per-second figure. That number tells you nothing about p99 latency, nothing about whether the board can sustain it, and nothing about memory headroom.
A real SLM latency benchmark answers concrete questions. How long until the first token appears (TTFT)? How fast do tokens stream after that? What’s the spread between the median and the worst case? And critically: how much RAM does the whole thing actually occupy, because on a 4 GB board that number decides whether the model runs at all.
This post is a benchmarking methodology you can defend, with the exact tooling — llama-bench for GGUF models, ONNX Runtime 1.20’s profiler for ONNX models, and Linux perf for the system-level picture. If you’re benchmarking a specific deployment, my writeup on Phi-4-mini on a Raspberry Pi
is a concrete target to apply this to.
What to measure, and why each one matters
Four metrics, and you need all four. A model that wins on one and loses on another is common.
Time to first token (TTFT). The latency from submitting the prompt to the first output token. This is dominated by prefill — the model processing the whole prompt in parallel. For an interactive agent, TTFT is what the user feels as “responsiveness.”
Decode throughput (tokens/sec). After the first token, how fast the rest stream. This is the decode phase, which is memory-bandwidth bound on edge CPUs — each new token requires reading the entire model weights from RAM.
Latency percentiles. Never report a mean alone. Edge CPUs throttle, schedulers preempt, and DRAM refresh stalls happen. Report p50, p90, p99. The gap between p50 and p99 is your real quality-of-service story.
Peak resident memory (RSS). Model weights plus the KV cache plus the runtime’s own overhead. The KV cache grows with context length, so peak RSS depends on how long a conversation runs. Measure it at your real maximum context, not at 128 tokens.
Controlling the environment
A benchmark on an uncontrolled board measures the board’s mood, not the model. Pin the variables first.
# Pin CPU frequency — kill dynamic scaling that adds variance.
sudo cpupower frequency-set --governor performance
# Confirm it took
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Watch thermals during the run; throttling silently halves throughput.
watch -n1 'cat /sys/class/thermal/thermal_zone0/temp'
# Drop page cache so a cold-start measurement is genuinely cold.
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# Pin the benchmark to specific cores, away from housekeeping on core 0.
taskset -c 1-3 <your-benchmark-command>
The thermal point is not paranoia. On a passively cooled Pi 5 I have watched decode throughput drop 40% over a three-minute sustained run as the SoC heated up. If you benchmark for ten seconds you’ll never see it — and your production agent runs for hours.
Benchmarking GGUF models with llama-bench
llama-bench ships with llama.cpp and is the right tool for GGUF-quantized models. It already separates prefill (pp) from decode (tg, “text generation”) and repeats runs for you.
# Build llama.cpp with optimizations for the target. On ARM:
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON
cmake --build build --config Release -j
# Benchmark: 512-token prefill, 128-token decode, 5 repetitions.
./build/bin/llama-bench \
-m models/phi-4-mini-q4_k_m.gguf \
-p 512 \
-n 128 \
-r 5 \
-t 4 \
-o json > bench_q4.json
The flags that matter: -p is prefill token count, -n is decode token count, -r is repetitions (set it to at least 5 so you get a spread), -t is thread count. JSON output is non-negotiable if you want to script comparisons.
To compare quantization levels — the most common reason to benchmark an SLM — sweep them and diff:
for q in q4_k_m q5_k_m q8_0; do
./build/bin/llama-bench -m models/phi-4-mini-$q.gguf \
-p 512 -n 128 -r 5 -t 4 -o json >> sweep.json
done
llama-bench reports pp512 (prefill tokens/sec) and tg128 (decode tokens/sec) with a standard deviation per row. The standard deviation is the part people ignore; a high one means your environment isn’t pinned.
Measuring TTFT and memory directly
llama-bench gives throughput but not TTFT or RSS. For those, drive llama-cli and wrap it.
#!/usr/bin/env bash
# ttft.sh — measure time-to-first-token over N runs
set -euo pipefail
MODEL="models/phi-4-mini-q4_k_m.gguf"
PROMPT="Summarize the following log line in one sentence:"
RUNS=20
for i in $(seq 1 "$RUNS"); do
start=$(date +%s.%N)
# --n-predict 1 stops after the first token; that wall time IS the TTFT.
taskset -c 1-3 ./build/bin/llama-cli \
-m "$MODEL" -p "$PROMPT" --n-predict 1 -t 4 \
--no-display-prompt 2>/dev/null >/dev/null
end=$(date +%s.%N)
echo "$end - $start" | bc
done | sort -n | awk '
{ v[NR]=$1; sum+=$1 }
END {
printf "TTFT p50=%.3fs p90=%.3fs p99=%.3fs mean=%.3fs\n",
v[int(NR*0.50)], v[int(NR*0.90)], v[int(NR*0.99)], sum/NR
}'
For peak RSS, /usr/bin/time -v reports “Maximum resident set size” directly:
/usr/bin/time -v ./build/bin/llama-cli \
-m models/phi-4-mini-q4_k_m.gguf \
-p "$(head -c 2000 sample_prompt.txt)" --n-predict 256 -t 4 \
2>&1 | grep "Maximum resident set size"
Feed it a long prompt and a long --n-predict so the KV cache is at its realistic peak. Measuring RSS on a toy prompt understates it badly.
Benchmarking ONNX models with ONNX Runtime 1.20
If your model is ONNX rather than GGUF, ONNX Runtime 1.20 has a built-in profiler that breaks time down per operator — invaluable for spotting which layer is the bottleneck.
# ort_bench.py — ONNX Runtime 1.20
import time
import numpy as np
import onnxruntime as ort
def bench_onnx(model_path: str, runs: int = 50, warmup: int = 5):
opts = ort.SessionOptions()
opts.enable_profiling = True # writes a chrome-trace JSON
opts.intra_op_num_threads = 4
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession(
model_path, opts, providers=["CPUExecutionProvider"]
)
inp = sess.get_inputs()[0]
# Build a realistic input: batch 1, sequence length 512.
shape = [1 if isinstance(d, str) or d is None else d for d in inp.shape]
if len(shape) >= 2:
shape[1] = 512
feed = {inp.name: np.random.randint(0, 32000, size=shape).astype(np.int64)}
for _ in range(warmup): # warmup excludes one-time allocation cost
sess.run(None, feed)
samples = []
for _ in range(runs):
t0 = time.perf_counter()
sess.run(None, feed)
samples.append((time.perf_counter() - t0) * 1000.0) # ms
samples.sort()
p = lambda q: samples[int(len(samples) * q)]
print(f"latency p50={p(0.50):.2f}ms p90={p(0.90):.2f}ms "
f"p99={p(0.99):.2f}ms mean={sum(samples)/len(samples):.2f}ms")
prof_file = sess.end_profiling()
print(f"per-operator trace written to {prof_file}")
if __name__ == "__main__":
bench_onnx("models/slm-int8.onnx")
The warmup loop is mandatory. The first run triggers memory-arena allocation and, on some providers, kernel JIT — including it in your samples inflates p50 and destroys p99. Open the profiling JSON in chrome://tracing to see exactly which operators dominate; on quantized SLMs it’s almost always the MatMul / MatMulInteger nodes, which tells you memory bandwidth is your wall.
The system view with perf
llama-bench and the ORT profiler tell you what the model does. perf tells you why the hardware is slow — cache misses, branch mispredictions, stalled cycles.
# Cycle/instruction/cache summary for one inference run.
perf stat -e cycles,instructions,cache-references,cache-misses,\
branch-misses,stalled-cycles-frontend \
taskset -c 1-3 ./build/bin/llama-cli \
-m models/phi-4-mini-q4_k_m.gguf \
-p "$(cat sample_prompt.txt)" --n-predict 128 -t 4
# Sample the hot path to find the bottleneck function.
perf record -g -- taskset -c 1-3 ./build/bin/llama-cli \
-m models/phi-4-mini-q4_k_m.gguf -p "test" --n-predict 128 -t 4
perf report --stdio | head -40
Two numbers from perf stat matter most for SLM decode. A high cache-miss rate confirms you’re memory-bandwidth bound — expected for decode, and the reason smaller quantization helps. A high stalled-cycles-frontend count means the CPU is waiting on instruction fetch, often a sign the build wasn’t compiled with the right -march for the chip. If you see the latter, rebuild with -DGGML_NATIVE=ON and confirm it picked up NEON / dotprod on ARM.
Common Pitfalls
- One run, one number. A single measurement has no error bar. Run 20+ and report percentiles, always.
- No warmup. The first inference pays allocation and cache-warming costs. Discard warmup runs or your p50 is wrong and your p99 is fiction.
- Ignoring thermals. A ten-second benchmark on a passively cooled board never sees throttling that a real workload hits in minutes. Run long enough to reach thermal steady state.
- Toy-sized prompts. TTFT scales with prompt length and KV-cache RSS scales with total context. Benchmark at your real context size, not 32 tokens.
- Mixing prefill and decode into one number. They have completely different performance characteristics. Report
ppandtgseparately or the number is meaningless. - Comparing across unpinned governors. A run under
ondemandversusperformancecan differ 2x. Pin the governor before every comparison.
Troubleshooting
Symptom: throughput is high for ten seconds then drops sharply.
Cause: thermal throttling — the SoC hit its temperature limit and the firmware cut clocks.
Fix: watch thermal_zone0/temp during the run. Add a heatsink or fan, or accept and report the throttled steady-state number, since that’s what production sees.
Symptom: p99 latency is many times p50 with no obvious pattern.
Cause: scheduler preemption from other processes, or no CPU pinning.
Fix: taskset the benchmark to dedicated cores and leave core 0 for the OS. Re-check after pinning; the p99/p50 gap should tighten substantially.
Symptom: llama-bench reports a large standard deviation across repetitions.
Cause: an unpinned CPU governor, background load, or thermal drift.
Fix: set the performance governor, kill background services, drop caches between runs, and ensure the board is at a stable temperature before starting.
Symptom: ONNX Runtime latency is far worse than expected.
Cause: graph optimizations disabled, or the session is spawning more threads than physical cores and thrashing.
Fix: set graph_optimization_level to ORT_ENABLE_ALL and intra_op_num_threads to the physical core count — not the hyperthread count. See the ONNX Runtime performance docs
for the full tuning checklist.
Symptom: measured peak RSS is far below what production OOMs at.
Cause: the benchmark used a short prompt and few generated tokens, so the KV cache never grew.
Fix: re-measure with a prompt and --n-predict at your real maximum context length.
Wrapping Up
A defensible SLM latency benchmark pins the environment, separates prefill from decode, reports percentiles instead of a lone mean, and measures peak RSS at realistic context length. With llama-bench, the ONNX Runtime profiler, and perf you can see not just how fast a model runs on constrained hardware but exactly which resource is the wall. Wire these into CI and you’ll catch a regression the day a quantization or build-flag change ships, instead of the day a user complains.