background-shape
Small Language Models in January 2025, A Practical Survey
January 6, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Four families dominate practical SLM work in January 2025 (Llama 3.2, Phi-3.5-mini, Qwen2.5, Gemma 2). Pick by license, context window, and tokenizer behavior, not by benchmark vanity. A 3B model with the right quantization usually beats a 7B you can’t afford to run.

I’ve spent the last six months replacing GPT-4 calls in production with locally-hosted SLMs wherever the task allows it. The economics flipped sometime in late 2024. A quantized 3B-parameter model running on a single consumer GPU now handles classification, extraction, summarization, and structured generation well enough that paying per-token for those tasks feels like burning money.

The cost of “well enough” is taste, though. You can’t just ollama pull the highest-ranked model on a leaderboard and call it done. Each of the four major SLM families released or refreshed in late 2024 has a distinct personality. They tokenize differently, refuse differently, hallucinate differently, and react to instructions differently. This post is the survey I wish I’d had when I started.

If you want to actually run these models after reading, the next post in this series walks through Ollama end to end. Here, I want to focus on what the models are, where they came from, and how to choose between them.

The Four Families Worth Knowing

For shipping work right now, I only seriously consider four model families. Anything else either lacks a permissive license, lacks tooling, or simply hasn’t kept pace. Here’s the lineup as of the first week of January 2025.

Llama 3.2 (Meta)

Released September 2024 in 1B and 3B text variants (plus 11B and 90B vision models we’ll ignore here). The 3B is the workhorse. It’s the first Llama small enough to run on a laptop without quantization heroics and still the most familiar to anyone who has used Llama 2 or 3.1. Llama 3.2 inherits the 128k context window from Llama 3.1, which is wild for a 3B model.

License is the Llama 3.2 Community License, which is “open weights with strings.” You’re fine for most commercial uses below 700M MAU, but legal teams sometimes balk. Tokenizer is BPE with 128k vocab, same as Llama 3.1.

Phi-3.5-mini (Microsoft)

3.8B parameters, MIT license, 128k context. Phi-3.5-mini is the strongest “follow instructions even when the input is garbage” model in this size class. Microsoft trained heavily on synthetic data, which shows up as crisp instruction-following but also as a tendency to refuse benign requests and to sound vaguely like a corporate chatbot. The 128k context is real and works without major degradation up to about 64k in my testing.

Qwen2.5 (Alibaba)

Qwen2.5 ships in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes, all Apache 2.0 except the 3B and 72B (which are still under the Qwen license). For SLM work I focus on 1.5B and 3B. Qwen2.5 has the best multilingual support of the four — Chinese obviously, but also strong Indonesian, Japanese, and Arabic. Tokenizer is BPE with 151k vocab. Context is 32k for the 3B variant.

Gemma 2 (Google DeepMind)

Released June 2024 in 2B, 9B, and 27B sizes. The 2B is the smallest credible model in this survey. Gemma 2 uses a custom license that is more permissive than Llama 3.2’s but less permissive than Apache 2.0. Context window is 8k, which feels small in 2025 but is genuinely enough for many tasks. Tokenizer uses a 256k vocab — twice anyone else’s — which makes Gemma 2 surprisingly compact at the byte level for many non-English languages.

How They Stack Up at a Glance

Model           Params   License        Context   Tokenizer  Best at
--------------  -------  -------------  --------  ---------  ----------------------------
Llama-3.2-3B    3.2B     Llama 3.2 CL   128k      128k BPE   General + long context
Phi-3.5-mini    3.8B     MIT            128k      32k BPE    Instruction following
Qwen2.5-3B      3.1B     Qwen License   32k       151k BPE  Multilingual + code
Gemma-2-2B      2.6B     Gemma TOU      8k        256k SP    Compact for non-English

That table is what fits on my whiteboard. License column is the first filter — if your lawyer can’t accept a custom license, you’re down to Phi-3.5-mini and the Apache-2.0 Qwen variants.

Setting Up to Try Them Yourself

If you want to follow along, here’s the minimum environment. I’m pinning to versions available in January 2025.

Step 1, create a clean Python environment

python3.12 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

Step 2, install transformers and friends

pip install \
  "transformers==4.47.1" \
  "accelerate==1.2.1" \
  "torch==2.5.1" \
  "bitsandbytes==0.44.1" \
  "sentencepiece==0.2.0"

Step 3, pull a model and run a quick sanity check

# sanity_check.py
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a terse senior engineer."},
    {"role": "user", "content": "Explain why context windows aren't free in 3 sentences."},
]

inputs = tok.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

out = model.generate(inputs, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

You’ll need a Hugging Face token with access granted on the model page. Llama 3.2 and Gemma 2 are gated; Phi-3.5 and Qwen2.5 are not. See the Llama 3.2 model card for the gate.

How I Actually Choose

Benchmarks lie. Or rather, they tell you something true but irrelevant. MMLU score does not predict whether a model will follow a JSON schema you wrote at 2am for a niche extraction task. Here’s the rubric I use instead, in order of importance.

1. License compatibility with your shipping target

If you’re shipping in a regulated industry, anything with “Acceptable Use Policy” attached needs sign-off. MIT-licensed Phi-3.5-mini is the lowest-friction option here. Apache-2.0 Qwen2.5-1.5B and 7B are the next. The Llama and Gemma licenses are workable but require reading.

2. Tokenizer behavior on your inputs

Tokenize a representative batch of your real inputs against each candidate model’s tokenizer. The number of tokens per request directly affects both latency and the effective context window. Gemma 2’s 256k vocab is brutally efficient on multilingual text. Llama 3.2’s 128k vocab is balanced. Phi-3.5 inherits a smaller 32k vocab that bloats non-English inputs.

from transformers import AutoTokenizer

CANDIDATES = [
    "meta-llama/Llama-3.2-3B-Instruct",
    "microsoft/Phi-3.5-mini-instruct",
    "Qwen/Qwen2.5-3B-Instruct",
    "google/gemma-2-2b-it",
]

sample = open("real_inputs.txt").read()  # your actual data

for m in CANDIDATES:
    tok = AutoTokenizer.from_pretrained(m, trust_remote_code=True)
    n = len(tok.encode(sample))
    print(f"{m:48s} {n:>8d} tokens")

I’ve seen this single measurement flip a model decision more than any benchmark. If your inputs are Bahasa Indonesia or Japanese, Gemma 2 might use 40% fewer tokens than Phi-3.5 for the same content.

3. Refusal behavior

Run your actual prompts through each model and count the refusals. Phi-3.5-mini is the most “safety-tuned” of the four and will sometimes refuse extraction tasks on news articles about violence or politics. Qwen2.5 refuses less. Llama 3.2 sits in the middle. Gemma 2 is closer to Phi-3.5.

4. Structured output compliance

Test JSON schema compliance with the model in its default state, no constrained decoding. Qwen2.5 and Llama 3.2 produce clean JSON most reliably in my testing. Phi-3.5 is fine but more verbose. Gemma 2-2B is the weakest here — the smallest model in the survey shows it on structured tasks.

A Reproducible Comparison

Here’s a script I use to compare candidates on a fixed task. It’s not a benchmark, it’s an A/B harness.

# ab.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, time, json

PROMPT = """Extract the company name, role, and years from this resume bullet.
Return strict JSON with keys company, role, years.

Bullet: Led the backend team at Acme Corp from 2019 to 2023, scaling Postgres to 4TB."""

CANDIDATES = [
    "meta-llama/Llama-3.2-3B-Instruct",
    "microsoft/Phi-3.5-mini-instruct",
    "Qwen/Qwen2.5-3B-Instruct",
    "google/gemma-2-2b-it",
]

for mid in CANDIDATES:
    tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
    m = AutoModelForCausalLM.from_pretrained(
        mid, torch_dtype=torch.bfloat16, device_map="auto",
        trust_remote_code=True,
    )
    msgs = [{"role": "user", "content": PROMPT}]
    inp = tok.apply_chat_template(msgs, return_tensors="pt",
                                  add_generation_prompt=True).to(m.device)
    t0 = time.time()
    out = m.generate(inp, max_new_tokens=128, do_sample=False)
    dt = time.time() - t0
    text = tok.decode(out[0][inp.shape[1]:], skip_special_tokens=True)
    print(f"--- {mid} ({dt:.2f}s) ---")
    print(text.strip())
    del m
    torch.cuda.empty_cache()

Run it three times. The first run is dominated by warm-up. The numbers you care about are the second and third.

Common Pitfalls

A few traps I’ve hit personally that you can skip.

  1. Trusting the chat template by accident. Every model in this list ships a chat_template in its tokenizer config. They are not interchangeable. If you copy Llama’s prompt template into a Phi call, generation quality drops noticeably and sometimes catastrophically. Fix: always use tok.apply_chat_template. Never hand-roll the format.

  2. Forgetting trust_remote_code=True for Qwen. Qwen2.5 uses a custom tokenizer class that needs trust_remote_code=True to load. The error is opaque if you miss it. Fix: pass it explicitly for any non-Llama model.

  3. Loading the base model when you want Instruct. meta-llama/Llama-3.2-3B is the base completion model. meta-llama/Llama-3.2-3B-Instruct is the fine-tuned chat model. The base will happily continue your system prompt as if it were a story. Fix: always grep your model ID for -Instruct or -it (Gemma) when chatting.

  4. Comparing on a single sample. SLMs have higher variance than frontier models. A single bad sample doesn’t condemn a model and a single good sample doesn’t bless one. Fix: run at least 50 samples per candidate before deciding.

Troubleshooting

Three real failure modes you’ll hit early.

Symptom: OSError: meta-llama/Llama-3.2-3B-Instruct is not a local folder and is not a valid model identifier. Diagnose: You haven’t accepted the model’s gated license on Hugging Face, or your HF_TOKEN isn’t set. Run huggingface-cli login and visit the model page in a browser to accept terms.

Symptom: Garbled output that looks like raw byte fragments. Diagnose: Tokenizer mismatch. You probably passed skip_special_tokens=False, or you’re decoding with a different tokenizer than the one that generated. Re-load the tokenizer from the same model ID as the model itself.

Symptom: CUDA OOM on a 3B model with a 24GB GPU. Diagnose: You loaded the model in float32 by default. Pass torch_dtype=torch.bfloat16 for Ampere or newer, or torch.float16 for Turing. If you still OOM, switch to 4-bit via BitsAndBytesConfig(load_in_4bit=True).

Hardware Implications

A quick sketch of what each model needs to run comfortably at Q4_K_M quantization, batch size 1:

Model           VRAM (Q4)  Tokens/sec on RTX 4070  Tokens/sec on M2 Pro
--------------  ---------  ----------------------  --------------------
Llama-3.2-3B    ~2.0 GB     62 tok/s                28 tok/s
Phi-3.5-mini    ~2.4 GB     54 tok/s                25 tok/s
Qwen2.5-3B      ~2.0 GB     61 tok/s                27 tok/s
Gemma-2-2B      ~1.7 GB     74 tok/s                34 tok/s

Numbers are from my own bench runs at batch 1, 512 prompt tokens, 128 generation tokens. Yours will differ; treat these as relative anchors. The takeaway: any of these models runs interactively on a five-year-old GPU. A 16 GB MacBook handles them all simultaneously if you wanted, which is wild for what was frontier capability eighteen months earlier.

For CPU-only servers without a GPU, Gemma-2-2B is the only one I’d consider for interactive use. It can sustain 10-12 tok/s on a modern Xeon with AVX-512, which is barely usable. The others land below 5 tok/s. Pay for a GPU if you can; the cost-effectiveness is overwhelming.

A Note on Multimodal Variants

Both Llama-3.2 (11B and 90B) and Phi-3.5 (Phi-3.5-vision-instruct) ship vision-capable variants. I’m deliberately excluding them from this survey because they’re not SLMs by my definition — the 11B Llama with vision needs 24 GB of VRAM at Q4, which is well outside the “single consumer card” envelope.

If you specifically need vision, Phi-3.5-vision (4.2B) is the only true small multimodal model in this list and it’s surprisingly good for OCR and document layout tasks. It’s not in my main rotation only because text tasks vastly outnumber vision tasks in the work I ship.

Wrapping Up

The SLM landscape in January 2025 is finally healthy. Four solid families, real licenses, real ecosystems, real tooling. The decision is no longer “which model is best” but “which model fits my constraints” — and constraints are mostly about license, tokenizer efficiency, and how grumpy your model is allowed to be about refusals. Start with Phi-3.5-mini if MIT matters, Llama 3.2 3B if context matters, Qwen2.5 if multilingual matters, and Gemma 2 2B if footprint matters. Then measure on your actual task, because the surveys lie just enough to be dangerous.