background-shape
Raspberry pi article cover illustration on a gradient background
January 13, 2026 · 9 min read · by Muhammad Amal programming
Advertisement

TL;DR — A Raspberry Pi 5 runs Phi-4-mini at a usable 5-9 tokens/sec on CPU with a Q4_K_M GGUF / thermals and thread count are the two knobs that decide whether it’s good or terrible / treat it as a real service with systemd, memory limits, and a health check.

I keep a Raspberry Pi 5 on my desk running a language model. Not as a toy — it tags incoming documents for a side project, fully offline, and it’s been up for months. Every time I mention it, someone assumes it must be painfully slow or hopelessly fiddly. It’s neither, but getting there means understanding what an 8GB ARM single-board computer can and can’t do.

The Pi 5 is a genuine step up from the Pi 4 for this work. The Cortex-A76 cores are roughly 2-3x faster, memory bandwidth improved, and the 8GB model gives you real headroom for a 3B-class model plus its KV cache. Phi-4-mini — Microsoft’s ~3.8B model, heavy on synthetic reasoning data — is a sweet spot: small enough to fit comfortably, capable enough to be genuinely useful for structured tasks.

Advertisement

This is the full walkthrough: building llama.cpp on the Pi, getting a properly quantized model on there, tuning for the hardware, and — the part that separates a demo from a deployment — managing heat and shipping it as a supervised service. If you want the background on picking and shrinking the model, the post on GGUF 4-bit quantization covers that ground.

Hardware and OS Baseline

What I’m running, and what I’d recommend:

  • Raspberry Pi 5, 8GB. The 4GB model technically works for a 3B model at Q4 but leaves no room for context or anything else. Get the 8GB.
  • Active cooling. Non-negotiable. The official Active Cooler or a fan HAT. Sustained inference pins all four cores; without a fan the Pi throttles within minutes and your token rate collapses.
  • A decent NVMe SSD via the PCIe HAT, or at minimum a fast A2 microSD. Model files are gigabytes; memory-mapped loading is much happier on fast storage.
  • Raspberry Pi OS (64-bit), Bookworm or later. 64-bit is mandatory — the 32-bit OS can’t address enough memory and lacks the ARM features llama.cpp wants.

Confirm you’re on 64-bit:

uname -m
# aarch64  <- correct. armv7l means reinstall the 64-bit OS.

Step 1: Build llama.cpp on the Pi

You can cross-compile, but building natively on the Pi is simpler and the Pi 5 is fast enough that it only takes a few minutes.

sudo apt update
sudo apt install -y build-essential cmake git libcurl4-openssl-dev

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout b4585  # January 2026 release

# GGML_NATIVE lets the compiler target the Pi 5's Cortex-A76 directly.
# Safe here because we build and run on the same machine.
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_NATIVE=ON \
  -DLLAMA_CURL=ON

cmake --build build --config Release -j4

-j4 uses all four cores. The Pi 5 builds llama.cpp in about three to five minutes. Native compilation matters here — the Cortex-A76 supports ARM dot-product and FP16 instructions that give a real inference speedup, and GGML_NATIVE=ON makes sure they’re used.

Step 2: Get a Quantized Phi-4-mini

A 3.8B model at FP16 is ~7.6 GB — it won’t fit in 8GB alongside the OS. You need a 4-bit quantized GGUF. Q4_K_M lands around 2.4 GB, which leaves comfortable headroom for the KV cache and the rest of the system.

mkdir -p ~/models && cd ~/models

# Download a pre-quantized GGUF directly.
curl -L -o phi-4-mini-Q4_K_M.gguf \
  "https://huggingface.co/bartowski/Phi-4-mini-instruct-GGUF/resolve/main/Phi-4-mini-instruct-Q4_K_M.gguf?download=true"

ls -lh phi-4-mini-Q4_K_M.gguf
# ~2.4G

If you want full control over the quantization — and you should for anything serious — do the convert-and-quantize pipeline on a faster machine and copy the resulting GGUF to the Pi. Quantizing on the Pi itself is slow and pointless.

Step 3: First Run and a Reality Check

Sanity-check the model before tuning anything.

cd ~/llama.cpp

./build/bin/llama-cli \
  -m ~/models/phi-4-mini-Q4_K_M.gguf \
  -p "List three uses for an edge-deployed language model." \
  -n 128 \
  --threads 4 \
  --no-display-prompt

llama.cpp prints timing at the end. On a well-cooled Pi 5 with Q4_K_M, expect prompt processing around 25-45 tokens/sec and generation around 5-9 tokens/sec. That’s slower than a workstation, obviously, but for bounded async tasks — tag this document, classify this message, extract these fields — it’s perfectly workable.

If your numbers are well below that range, you have a thermal or threading problem. We fix both next.

Step 4: Tune for the Pi 5

Thread count

The Pi 5 has exactly four cores. The intuition is “use all four,” and for generation that’s right. But the optimal count can differ between prompt processing and generation, so measure.

for t in 2 3 4; do
  echo "=== threads=$t ==="
  ./build/bin/llama-cli \
    -m ~/models/phi-4-mini-Q4_K_M.gguf \
    -p "Summarize: edge AI runs models locally." \
    -n 64 --threads "$t" --no-display-prompt 2>&1 \
    | grep -E "eval time"
done

In practice --threads 4 wins for generation on the Pi 5. The reason to test is that pinning all cores also leaves nothing for the OS, which can hurt latency if the Pi is doing other work. If the box is dedicated to inference, four is the answer.

Context size

KV cache memory grows with context length. On an 8GB Pi, be deliberate.

./build/bin/llama-server \
  -m ~/models/phi-4-mini-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 \
  --ctx-size 2048 \
  --threads 4 \
  --batch-size 256 \
  --mlock
  • --ctx-size 2048 — sized to real prompts. Don’t reserve 16K of context “just in case”; you’re spending RAM the Pi doesn’t have to spare.
  • --mlock — pins the model in RAM so the kernel can’t page it out under memory pressure. On a memory-tight device this prevents the latency disaster of inference weights getting swapped to disk.
  • --batch-size 256 — a smaller batch than the default fits the Pi’s cache hierarchy better.

Step 5: Manage Thermals

This is where most Pi inference setups quietly fail. Sustained inference loads all four cores; the SoC heats up; once it crosses ~80-85°C the firmware throttles the clock to protect itself. Your token rate doesn’t crash all at once — it sags over a few minutes, which is exactly the kind of degradation that’s hard to spot.

Watch the temperature live during a sustained run:

watch -n 2 'vcgencmd measure_temp; vcgencmd get_throttled'

get_throttled returns a hex bitmask. 0x0 is healthy. A non-zero value means throttling has occurred — bit 0x4 is “currently throttled,” bit 0x40000 is “throttling has occurred since boot.” If you ever see non-zero, your cooling is inadequate for the workload.

With the official Active Cooler, a Pi 5 under continuous inference settles around 60-70°C and never throttles. Without active cooling it’ll cross 80°C within a couple of minutes and the throttled flag will light up. There is no software fix for this — buy the fan.

Step 6: Ship It as a Service

Don’t leave inference running in a terminal. systemd, with the device’s constraints encoded as limits.

# /etc/systemd/system/phi-server.service
[Unit]
Description=Phi-4-mini inference server (llama.cpp)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=pi
WorkingDirectory=/home/pi/llama.cpp
ExecStart=/home/pi/llama.cpp/build/bin/llama-server \
  -m /home/pi/models/phi-4-mini-Q4_K_M.gguf \
  --host 127.0.0.1 --port 8080 \
  --ctx-size 2048 --threads 4 \
  --batch-size 256 --mlock
Restart=on-failure
RestartSec=5
# 8GB box: leave headroom for the OS.
MemoryMax=6500M
# mlock needs the capability.
AmbientCapabilities=CAP_IPC_LOCK

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now phi-server
journalctl -u phi-server -f

AmbientCapabilities=CAP_IPC_LOCK is required for --mlock to work under a non-root user — miss it and the server logs a warning and silently runs without pinned memory. MemoryMax=6500M keeps a runaway context from OOM-killing the whole Pi.

Add a health check so dependents know when the server is actually ready:

#!/usr/bin/env bash
# health-check.sh — exit 0 only when the model is loaded and serving.
set -euo pipefail

response=$(curl -s -o /dev/null -w "%{http_code}" \
  --max-time 5 http://127.0.0.1:8080/health || echo "000")

if [[ "$response" == "200" ]]; then
  echo "phi-server: ready"
  exit 0
else
  echo "phi-server: not ready (http $response)"
  exit 1
fi

Loading a 2.4 GB model from storage and allocating the KV cache takes a few seconds on the Pi. Don’t route traffic until /health returns 200.

Common Pitfalls

  • Running without active cooling. The most common mistake. The Pi will throttle and your throughput silently degrades. Buy the fan.
  • Using the 4GB Pi 5. A 3B-class model at Q4 leaves almost no room for context or the OS. The 8GB model is the real minimum.
  • A 32-bit OS. It can’t address the memory and skips ARM features llama.cpp relies on. Reinstall 64-bit Raspberry Pi OS.
  • Oversized context window. Reserving 16K of context on an 8GB Pi burns RAM you need. Size --ctx-size to actual prompts.
  • Skipping --mlock (or CAP_IPC_LOCK). Under memory pressure the kernel pages out model weights and inference latency falls off a cliff. Pin the model.
  • Quantizing on the Pi. It works but it’s painfully slow. Quantize on a fast machine and copy the GGUF over.

Troubleshooting

Symptom: token rate starts fine then degrades over a few minutes. Cause: thermal throttling. Fix: check vcgencmd get_throttled — a non-zero result confirms it. Add active cooling; there is no software workaround.

Symptom: illegal instruction when running the binary. Cause: a binary built for a different architecture, or a 32-bit OS. Fix: confirm uname -m reports aarch64 and rebuild natively on the Pi with GGML_NATIVE=ON.

Symptom: the process is killed during model load or first request. Cause: out of memory — the model plus KV cache plus OS exceeded 8GB. Fix: reduce --ctx-size, confirm you’re on the Q4_K_M (not a larger) GGUF, and set MemoryMax so you get a clean failure instead of an OOM-killed host.

Symptom: --mlock logs a warning and the model isn’t pinned. Cause: the service lacks CAP_IPC_LOCK. Fix: add AmbientCapabilities=CAP_IPC_LOCK to the unit file, or raise the memlock rlimit.

Symptom: model loads slowly, every single time. Cause: a slow microSD card and cold page cache. Fix: move the model to NVMe via the PCIe HAT, or use --mlock so the model stays resident after the first load.

What’s Next

A Raspberry Pi 5 with a quantized Phi-4-mini is a genuinely capable, fully offline inference node — and once you’ve measured the throughput and thermal behavior, you’ve got the baseline you need to decide whether one Pi is enough or whether your workload needs a fleet. The natural follow-up is adapting the model to your specific domain so a small model performs like a much larger one on the task that actually matters to you.

Advertisement