Benchmarking SLMs for Your Use Case, From Lmeval to Custom Suites
TL;DR — Run lm-evaluation-harness for a calibration baseline, then build a custom suite around your real task. A 200-example custom eval set you own beats 10 public benchmarks. Pin everything (seed, sampler, prompt format, model digest) or the numbers are gossip.
I’ve watched teams pick models off the Open LLM Leaderboard, deploy them, and discover their MMLU score did not predict whether the model could extract a phone number from a customer email. The leaderboard isn’t lying. It’s just measuring something different from your task. If you don’t have your own benchmark, you don’t know whether your model is good. You only know whether someone else’s model is good at someone else’s problem.
This post is the evaluation workflow I use after building a local RAG system or fine-tuning an SLM. Two layers: a baseline pass with lm-evaluation-harness to confirm the model isn’t broken, then a custom task-specific suite that’s the actual decision-maker. Both layers matter. Skipping the first hides regressions; skipping the second hides task-specific failure.
Versions: Python 3.12, lm-evaluation-harness 0.4.5 (commit b281b09 from Dec 2024), Ollama 0.5.4, and the usual SLMs from earlier in this series.
What Public Benchmarks Actually Measure
A short and incomplete decoder ring:
- MMLU: multiple-choice general knowledge across 57 subjects. Measures knowledge breadth, not reasoning depth or instruction following.
- HumanEval / MBPP: Python coding. Useful only if you ship code generation.
- GSM8K: grade school math word problems. Mostly measures whether the model can carry arithmetic through reasoning steps.
- TruthfulQA: tendency to repeat common misconceptions. Adversarial by design; SLMs do poorly and that’s mostly fine.
- IFEval: instruction following with verifiable constraints. Closest to “follows my prompts” of any public benchmark.
- MT-Bench / Arena-Hard: open-ended chat quality. Judged by GPT-4. Useful as a vibe check, terrible as a primary metric.
For SLM-shaped work, IFEval is the most predictive of “will this model do what I tell it to.” MMLU is the least.
Step 1, Run the Harness Baseline
The harness is the standard tool for reproducing public benchmark scores. Install it:
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
git checkout b281b09
pip install -e ".[vllm]"
Run a small task suite against a local model via vLLM:
lm_eval \
--model vllm \
--model_args "pretrained=meta-llama/Llama-3.2-3B-Instruct,dtype=bfloat16,gpu_memory_utilization=0.85" \
--tasks ifeval,gsm8k,arc_easy \
--batch_size auto \
--num_fewshot 0 \
--output_path results/llama32-3b
A few hours later (depending on hardware) you have a JSON file with scores. For Llama-3.2-3B-Instruct, ballpark numbers in my runs:
ifeval : 75.4% (prompt-level strict)
gsm8k : 44.3% (5-shot)
arc_easy : 71.2%
Use these to confirm the model is performing where the literature says it should. If your number is wildly off, you have a setup bug (wrong dtype, wrong template, broken tokenizer). Fix the setup before going further.
Why this matters even though it doesn’t measure your task
Two reasons. First, a baseline lets you spot environmental issues. If you score 30% on ARC-easy when the public number is 71%, something is broken regardless of your downstream task. Second, when you fine-tune or quantize later, you want to make sure you haven’t regressed on general capability. Run the same baseline after the change.
Step 2, Define Your Real Task
Now the work that matters. A custom benchmark needs four pieces:
- A representative input set. 200-500 examples that look like real user inputs.
- Ground-truth labels. What you’d consider a correct answer for each.
- A scoring function. Programmatic, deterministic, fast.
- A reporting format. Per-example, not just aggregate.
The hardest piece is the labels. Skimping here is the most common mistake. Spend an afternoon hand-labeling 200 real examples and you have a benchmark that beats anything off-the-shelf.
Step 2a, define the schema
For our running extraction example:
# eval/schema.py
from pydantic import BaseModel
class ExtractionLabel(BaseModel):
title: str
company: str
location: str
remote: bool
class ExtractionExample(BaseModel):
id: str
input: str
label: ExtractionLabel
difficulty: str # easy | medium | hard
notes: str = ""
difficulty and notes are gold. They let you slice scores by hardness and inspect why hard ones fail.
Step 2b, store as JSONL
{"id":"job-0001","input":"We're hiring a Senior Backend Engineer at Acme Corp, remote-friendly...","label":{"title":"Senior Backend Engineer","company":"Acme Corp","location":"Remote","remote":true},"difficulty":"easy"}
{"id":"job-0002","input":"Acme is looking for someone who can lead our backend team. Based in Singapore, hybrid OK.","label":{"title":"Backend Lead","company":"Acme","location":"Singapore","remote":false},"difficulty":"medium"}
Hand-label these. Yes, all of them. Yes, it’s tedious. No, GPT-4 can’t do it for you reliably enough to skip the step — it’ll get 90% right and the 10% it gets wrong are exactly the cases your benchmark needs to measure.
Step 2c, write a scorer
# eval/score.py
from dataclasses import dataclass
import json
@dataclass
class Score:
valid_json: bool
field_correct: dict # field name -> bool
exact_match: bool
def score(prediction_text: str, label: dict) -> Score:
try:
pred = json.loads(prediction_text)
except json.JSONDecodeError:
return Score(False, {k: False for k in label}, False)
field_correct = {}
for k, expected in label.items():
got = pred.get(k)
if isinstance(expected, str):
field_correct[k] = (got or "").strip().lower() == expected.strip().lower()
else:
field_correct[k] = got == expected
exact = all(field_correct.values())
return Score(True, field_correct, exact)
Choose your strictness consciously. For extraction, I usually accept case-insensitive string match for company but exact match for remote (bool). For RAG-grounded QA, I’d switch to a semantic-equivalence check (an embedding cosine similarity over the answer and the label).
Step 3, Run Your Suite
# eval/run.py
import json, time
from ollama import Client
from eval.score import score
client = Client()
def evaluate(model: str, dataset_path: str, out_path: str):
rows = [json.loads(l) for l in open(dataset_path)]
results = []
t_total = time.monotonic()
for ex in rows:
msgs = [
{"role": "system",
"content": "Extract title, company, location, remote (bool) as JSON."},
{"role": "user", "content": ex["input"]},
]
t0 = time.monotonic()
resp = client.chat(model=model, messages=msgs,
options={"temperature": 0.0, "seed": 42},
format={
"type": "object",
"properties": {
"title": {"type": "string"},
"company": {"type": "string"},
"location": {"type": "string"},
"remote": {"type": "boolean"},
},
"required": ["title","company","location","remote"],
})
dt = time.monotonic() - t0
out = resp["message"]["content"]
s = score(out, ex["label"])
results.append({
"id": ex["id"], "difficulty": ex["difficulty"],
"valid_json": s.valid_json, "exact_match": s.exact_match,
"field_correct": s.field_correct, "latency_s": round(dt, 3),
"prediction": out,
})
with open(out_path, "w") as f:
for r in results: f.write(json.dumps(r) + "\n")
n = len(results)
em = sum(r["exact_match"] for r in results) / n
vj = sum(r["valid_json"] for r in results) / n
lat = sum(r["latency_s"] for r in results) / n
print(f"model={model} n={n} exact={em:.1%} valid_json={vj:.1%} avg_lat={lat:.2f}s total={time.monotonic()-t_total:.0f}s")
if __name__ == "__main__":
import sys
evaluate(sys.argv[1], sys.argv[2], sys.argv[3])
temperature=0.0 and seed=42 are non-negotiable. Without them, your scores wobble run-to-run and you can’t reason about anything.
Step 4, Slice and Inspect
Aggregate score is the headline. The signal is in the slices.
# eval/report.py
import json
from collections import defaultdict
def report(path: str):
rows = [json.loads(l) for l in open(path)]
by_diff = defaultdict(list)
by_field = defaultdict(list)
for r in rows:
by_diff[r["difficulty"]].append(r["exact_match"])
for k, v in r["field_correct"].items():
by_field[k].append(v)
for diff, vals in sorted(by_diff.items()):
print(f" difficulty={diff:6s} n={len(vals):3d} em={sum(vals)/len(vals):.1%}")
for field, vals in sorted(by_field.items()):
print(f" field={field:12s} n={len(vals):3d} acc={sum(vals)/len(vals):.1%}")
if __name__ == "__main__":
import sys
report(sys.argv[1])
You’ll learn things like “the model nails title and company but bombs on remote.” That tells you where to focus next: better prompts? Few-shot examples? Fine-tuning? You can’t know without the slice.
Step 5, Compare Models
This is the payoff. Run the same suite against three or four candidates:
python -m eval.run llama3.2:3b-instruct-q4_K_M data/eval.jsonl out/llama32.jsonl
python -m eval.run phi3.5:3.8b-mini-instruct-q4_K_M data/eval.jsonl out/phi35.jsonl
python -m eval.run qwen2.5:3b-instruct-q4_K_M data/eval.jsonl out/qwen25.jsonl
python -m eval.run gemma2:2b-instruct-q4_K_M data/eval.jsonl out/gemma2.jsonl
Then a comparison table:
model n em valid_json avg_lat
llama32 200 82.0% 99.5% 0.41s
phi35 200 79.5% 100.0% 0.52s
qwen25 200 85.5% 99.0% 0.38s
gemma2 200 71.0% 97.5% 0.29s
Now you can make a decision based on something other than vibes. Notice that on this task, Qwen2.5 wins despite being slightly smaller than Phi-3.5. Public benchmarks would not have predicted this.
Architecture
+----------------+ +----------------+ +----------------+
| Dataset |---->| Runner |---->| Predictions |
| (JSONL) | | (per-model) | | (JSONL) |
+----------------+ +-------+--------+ +-------+--------+
| |
| v
| +----------------+
| | Scorer |
| | (deterministic)|
| +-------+--------+
| |
v v
+----------------+ +----------------+
| Run metadata | | Sliced report |
| (model id, | | (by difficulty|
| seed, dtype) | | by field) |
+----------------+ +----------------+
Every dimension that affects scores is pinned in the run metadata: model digest, quantization, seed, system prompt hash, scorer version. Without these, comparisons across time are meaningless.
Common Pitfalls
-
Letting the eval set leak into prompts. You few-shot the model with examples from the eval set, then test on those same examples. Score is meaningless. Fix: split your data into
train,dev, andheld_out. Touchheld_outonly at the very end. -
Reusing the eval set as a training set. Even worse than leaking: you tune your prompt against the eval set until it scores well, then declare victory. You’ve overfit to the eval. Fix: keep a separate dev set for iteration; lock the eval set away and run it once per release.
-
Letting an LLM judge an LLM. “GPT-4 says my output is correct” is a measurement, but it’s measuring what GPT-4 thinks, not what is true. Fine for vibe checks; not fine for production decisions. Fix: use programmatic scorers for anything quantitative.
-
Comparing models trained with different chat templates. If your runner uses the same prompt string for all four candidates, three of them are getting the wrong template applied. Fix: route through each model’s
apply_chat_template, never hand-roll.
Troubleshooting
Symptom: Scores fluctuate wildly between runs.
Diagnose: Either you forgot temperature=0 and seed, or you’re hitting a different model than you think (Ollama auto-pulled a newer tag). Pin the model digest with ollama show --modelfile <name> and check the FROM hash.
Symptom: Public benchmark scores way off from published numbers.
Diagnose: Almost always a tokenization or template bug. Re-run with --write_out in the harness and inspect actual generated text. If it starts with stray template tokens (<|begin_of_text|>), the template is wrong.
Symptom: Custom benchmark says model A wins, production says model B wins. Diagnose: Your eval set doesn’t match production traffic. Sample 100 real production inputs, hand-label them, add to your eval, re-run. The drift is the lesson.
Continuous Evaluation in CI
Once your benchmark is stable, wire it into CI. Every model change (new fine-tune, new quantization, new prompt) runs the eval and posts a diff.
A minimal GitHub Actions step:
- name: Run SLM benchmark
run: |
python -m eval.run "${{ inputs.model_id }}" data/eval.jsonl out/results.jsonl
python -m eval.report out/results.jsonl > report.txt
cat report.txt >> $GITHUB_STEP_SUMMARY
- name: Compare with baseline
run: |
python -m eval.compare out/results.jsonl baselines/last_release.jsonl
eval.compare checks whether per-difficulty scores regressed beyond a threshold (say, 2 percentage points). Regressions block the deploy. This catches the “I changed the system prompt and didn’t realize it broke the ‘hard’ slice” failures that otherwise ship.
Pin the eval set in git, hash it, and refuse to run against a hash you don’t recognize. If someone updates the eval set, that’s a separate commit reviewed separately. Otherwise you can silently grade-inflate by removing hard examples.
Wrapping Up
A benchmark you wrote and own is worth more than a benchmark someone else wrote, every single time. Public benchmarks have a place — they catch broken installs and give you a north star for general capability — but they don’t decide what ships. Build your custom suite early, run it on every change, and slice the results until you understand which examples fail and why. Once you have that loop, every other piece of this series (model selection, fine-tuning, RAG) becomes a controlled experiment instead of a guess. See the lm-evaluation-harness repo for the long list of tasks I didn’t run, and the rest of this series for what to do with the answers your benchmark gives you.