background-shape
Benchmarking SLMs for Your Use Case, From Lmeval to Custom Suites
January 29, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Run lm-evaluation-harness for a calibration baseline, then build a custom suite around your real task. A 200-example custom eval set you own beats 10 public benchmarks. Pin everything (seed, sampler, prompt format, model digest) or the numbers are gossip.

I’ve watched teams pick models off the Open LLM Leaderboard, deploy them, and discover their MMLU score did not predict whether the model could extract a phone number from a customer email. The leaderboard isn’t lying. It’s just measuring something different from your task. If you don’t have your own benchmark, you don’t know whether your model is good. You only know whether someone else’s model is good at someone else’s problem.

This post is the evaluation workflow I use after building a local RAG system or fine-tuning an SLM. Two layers: a baseline pass with lm-evaluation-harness to confirm the model isn’t broken, then a custom task-specific suite that’s the actual decision-maker. Both layers matter. Skipping the first hides regressions; skipping the second hides task-specific failure.

Versions: Python 3.12, lm-evaluation-harness 0.4.5 (commit b281b09 from Dec 2024), Ollama 0.5.4, and the usual SLMs from earlier in this series.

What Public Benchmarks Actually Measure

A short and incomplete decoder ring:

  • MMLU: multiple-choice general knowledge across 57 subjects. Measures knowledge breadth, not reasoning depth or instruction following.
  • HumanEval / MBPP: Python coding. Useful only if you ship code generation.
  • GSM8K: grade school math word problems. Mostly measures whether the model can carry arithmetic through reasoning steps.
  • TruthfulQA: tendency to repeat common misconceptions. Adversarial by design; SLMs do poorly and that’s mostly fine.
  • IFEval: instruction following with verifiable constraints. Closest to “follows my prompts” of any public benchmark.
  • MT-Bench / Arena-Hard: open-ended chat quality. Judged by GPT-4. Useful as a vibe check, terrible as a primary metric.

For SLM-shaped work, IFEval is the most predictive of “will this model do what I tell it to.” MMLU is the least.

Step 1, Run the Harness Baseline

The harness is the standard tool for reproducing public benchmark scores. Install it:

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
git checkout b281b09
pip install -e ".[vllm]"

Run a small task suite against a local model via vLLM:

lm_eval \
  --model vllm \
  --model_args "pretrained=meta-llama/Llama-3.2-3B-Instruct,dtype=bfloat16,gpu_memory_utilization=0.85" \
  --tasks ifeval,gsm8k,arc_easy \
  --batch_size auto \
  --num_fewshot 0 \
  --output_path results/llama32-3b

A few hours later (depending on hardware) you have a JSON file with scores. For Llama-3.2-3B-Instruct, ballpark numbers in my runs:

ifeval        : 75.4% (prompt-level strict)
gsm8k         : 44.3% (5-shot)
arc_easy      : 71.2%

Use these to confirm the model is performing where the literature says it should. If your number is wildly off, you have a setup bug (wrong dtype, wrong template, broken tokenizer). Fix the setup before going further.

Why this matters even though it doesn’t measure your task

Two reasons. First, a baseline lets you spot environmental issues. If you score 30% on ARC-easy when the public number is 71%, something is broken regardless of your downstream task. Second, when you fine-tune or quantize later, you want to make sure you haven’t regressed on general capability. Run the same baseline after the change.

Step 2, Define Your Real Task

Now the work that matters. A custom benchmark needs four pieces:

  1. A representative input set. 200-500 examples that look like real user inputs.
  2. Ground-truth labels. What you’d consider a correct answer for each.
  3. A scoring function. Programmatic, deterministic, fast.
  4. A reporting format. Per-example, not just aggregate.

The hardest piece is the labels. Skimping here is the most common mistake. Spend an afternoon hand-labeling 200 real examples and you have a benchmark that beats anything off-the-shelf.

Step 2a, define the schema

For our running extraction example:

# eval/schema.py
from pydantic import BaseModel

class ExtractionLabel(BaseModel):
    title: str
    company: str
    location: str
    remote: bool

class ExtractionExample(BaseModel):
    id: str
    input: str
    label: ExtractionLabel
    difficulty: str   # easy | medium | hard
    notes: str = ""

difficulty and notes are gold. They let you slice scores by hardness and inspect why hard ones fail.

Step 2b, store as JSONL

{"id":"job-0001","input":"We're hiring a Senior Backend Engineer at Acme Corp, remote-friendly...","label":{"title":"Senior Backend Engineer","company":"Acme Corp","location":"Remote","remote":true},"difficulty":"easy"}
{"id":"job-0002","input":"Acme is looking for someone who can lead our backend team. Based in Singapore, hybrid OK.","label":{"title":"Backend Lead","company":"Acme","location":"Singapore","remote":false},"difficulty":"medium"}

Hand-label these. Yes, all of them. Yes, it’s tedious. No, GPT-4 can’t do it for you reliably enough to skip the step — it’ll get 90% right and the 10% it gets wrong are exactly the cases your benchmark needs to measure.

Step 2c, write a scorer

# eval/score.py
from dataclasses import dataclass
import json

@dataclass
class Score:
    valid_json: bool
    field_correct: dict  # field name -> bool
    exact_match: bool

def score(prediction_text: str, label: dict) -> Score:
    try:
        pred = json.loads(prediction_text)
    except json.JSONDecodeError:
        return Score(False, {k: False for k in label}, False)

    field_correct = {}
    for k, expected in label.items():
        got = pred.get(k)
        if isinstance(expected, str):
            field_correct[k] = (got or "").strip().lower() == expected.strip().lower()
        else:
            field_correct[k] = got == expected
    exact = all(field_correct.values())
    return Score(True, field_correct, exact)

Choose your strictness consciously. For extraction, I usually accept case-insensitive string match for company but exact match for remote (bool). For RAG-grounded QA, I’d switch to a semantic-equivalence check (an embedding cosine similarity over the answer and the label).

Step 3, Run Your Suite

# eval/run.py
import json, time
from ollama import Client
from eval.score import score

client = Client()

def evaluate(model: str, dataset_path: str, out_path: str):
    rows = [json.loads(l) for l in open(dataset_path)]
    results = []
    t_total = time.monotonic()
    for ex in rows:
        msgs = [
            {"role": "system",
             "content": "Extract title, company, location, remote (bool) as JSON."},
            {"role": "user", "content": ex["input"]},
        ]
        t0 = time.monotonic()
        resp = client.chat(model=model, messages=msgs,
                           options={"temperature": 0.0, "seed": 42},
                           format={
                               "type": "object",
                               "properties": {
                                   "title":    {"type": "string"},
                                   "company":  {"type": "string"},
                                   "location": {"type": "string"},
                                   "remote":   {"type": "boolean"},
                               },
                               "required": ["title","company","location","remote"],
                           })
        dt = time.monotonic() - t0
        out = resp["message"]["content"]
        s = score(out, ex["label"])
        results.append({
            "id": ex["id"], "difficulty": ex["difficulty"],
            "valid_json": s.valid_json, "exact_match": s.exact_match,
            "field_correct": s.field_correct, "latency_s": round(dt, 3),
            "prediction": out,
        })
    with open(out_path, "w") as f:
        for r in results: f.write(json.dumps(r) + "\n")

    n = len(results)
    em  = sum(r["exact_match"] for r in results) / n
    vj  = sum(r["valid_json"]  for r in results) / n
    lat = sum(r["latency_s"]   for r in results) / n
    print(f"model={model}  n={n}  exact={em:.1%}  valid_json={vj:.1%}  avg_lat={lat:.2f}s  total={time.monotonic()-t_total:.0f}s")

if __name__ == "__main__":
    import sys
    evaluate(sys.argv[1], sys.argv[2], sys.argv[3])

temperature=0.0 and seed=42 are non-negotiable. Without them, your scores wobble run-to-run and you can’t reason about anything.

Step 4, Slice and Inspect

Aggregate score is the headline. The signal is in the slices.

# eval/report.py
import json
from collections import defaultdict

def report(path: str):
    rows = [json.loads(l) for l in open(path)]
    by_diff = defaultdict(list)
    by_field = defaultdict(list)
    for r in rows:
        by_diff[r["difficulty"]].append(r["exact_match"])
        for k, v in r["field_correct"].items():
            by_field[k].append(v)
    for diff, vals in sorted(by_diff.items()):
        print(f"  difficulty={diff:6s}  n={len(vals):3d}  em={sum(vals)/len(vals):.1%}")
    for field, vals in sorted(by_field.items()):
        print(f"  field={field:12s}  n={len(vals):3d}  acc={sum(vals)/len(vals):.1%}")

if __name__ == "__main__":
    import sys
    report(sys.argv[1])

You’ll learn things like “the model nails title and company but bombs on remote.” That tells you where to focus next: better prompts? Few-shot examples? Fine-tuning? You can’t know without the slice.

Step 5, Compare Models

This is the payoff. Run the same suite against three or four candidates:

python -m eval.run llama3.2:3b-instruct-q4_K_M data/eval.jsonl out/llama32.jsonl
python -m eval.run phi3.5:3.8b-mini-instruct-q4_K_M data/eval.jsonl out/phi35.jsonl
python -m eval.run qwen2.5:3b-instruct-q4_K_M data/eval.jsonl out/qwen25.jsonl
python -m eval.run gemma2:2b-instruct-q4_K_M data/eval.jsonl out/gemma2.jsonl

Then a comparison table:

model      n    em        valid_json  avg_lat
llama32    200  82.0%     99.5%       0.41s
phi35      200  79.5%     100.0%      0.52s
qwen25     200  85.5%     99.0%       0.38s
gemma2     200  71.0%     97.5%       0.29s

Now you can make a decision based on something other than vibes. Notice that on this task, Qwen2.5 wins despite being slightly smaller than Phi-3.5. Public benchmarks would not have predicted this.

Architecture

+----------------+     +----------------+     +----------------+
|  Dataset       |---->|  Runner        |---->|  Predictions   |
|  (JSONL)       |     |  (per-model)   |     |  (JSONL)       |
+----------------+     +-------+--------+     +-------+--------+
                               |                      |
                               |                      v
                               |              +----------------+
                               |              |  Scorer        |
                               |              |  (deterministic)|
                               |              +-------+--------+
                               |                      |
                               v                      v
                       +----------------+     +----------------+
                       |  Run metadata  |     |  Sliced report |
                       |  (model id,    |     |  (by difficulty|
                       |   seed, dtype) |     |   by field)    |
                       +----------------+     +----------------+

Every dimension that affects scores is pinned in the run metadata: model digest, quantization, seed, system prompt hash, scorer version. Without these, comparisons across time are meaningless.

Common Pitfalls

  1. Letting the eval set leak into prompts. You few-shot the model with examples from the eval set, then test on those same examples. Score is meaningless. Fix: split your data into train, dev, and held_out. Touch held_out only at the very end.

  2. Reusing the eval set as a training set. Even worse than leaking: you tune your prompt against the eval set until it scores well, then declare victory. You’ve overfit to the eval. Fix: keep a separate dev set for iteration; lock the eval set away and run it once per release.

  3. Letting an LLM judge an LLM. “GPT-4 says my output is correct” is a measurement, but it’s measuring what GPT-4 thinks, not what is true. Fine for vibe checks; not fine for production decisions. Fix: use programmatic scorers for anything quantitative.

  4. Comparing models trained with different chat templates. If your runner uses the same prompt string for all four candidates, three of them are getting the wrong template applied. Fix: route through each model’s apply_chat_template, never hand-roll.

Troubleshooting

Symptom: Scores fluctuate wildly between runs. Diagnose: Either you forgot temperature=0 and seed, or you’re hitting a different model than you think (Ollama auto-pulled a newer tag). Pin the model digest with ollama show --modelfile <name> and check the FROM hash.

Symptom: Public benchmark scores way off from published numbers. Diagnose: Almost always a tokenization or template bug. Re-run with --write_out in the harness and inspect actual generated text. If it starts with stray template tokens (<|begin_of_text|>), the template is wrong.

Symptom: Custom benchmark says model A wins, production says model B wins. Diagnose: Your eval set doesn’t match production traffic. Sample 100 real production inputs, hand-label them, add to your eval, re-run. The drift is the lesson.

Continuous Evaluation in CI

Once your benchmark is stable, wire it into CI. Every model change (new fine-tune, new quantization, new prompt) runs the eval and posts a diff.

A minimal GitHub Actions step:

- name: Run SLM benchmark
  run: |
    python -m eval.run "${{ inputs.model_id }}" data/eval.jsonl out/results.jsonl
    python -m eval.report out/results.jsonl > report.txt
    cat report.txt >> $GITHUB_STEP_SUMMARY
- name: Compare with baseline
  run: |
    python -m eval.compare out/results.jsonl baselines/last_release.jsonl

eval.compare checks whether per-difficulty scores regressed beyond a threshold (say, 2 percentage points). Regressions block the deploy. This catches the “I changed the system prompt and didn’t realize it broke the ‘hard’ slice” failures that otherwise ship.

Pin the eval set in git, hash it, and refuse to run against a hash you don’t recognize. If someone updates the eval set, that’s a separate commit reviewed separately. Otherwise you can silently grade-inflate by removing hard examples.

Wrapping Up

A benchmark you wrote and own is worth more than a benchmark someone else wrote, every single time. Public benchmarks have a place — they catch broken installs and give you a north star for general capability — but they don’t decide what ships. Build your custom suite early, run it on every change, and slice the results until you understand which examples fail and why. Once you have that loop, every other piece of this series (model selection, fine-tuning, RAG) becomes a controlled experiment instead of a guess. See the lm-evaluation-harness repo for the long list of tasks I didn’t run, and the rest of this series for what to do with the answers your benchmark gives you.