Fine Tuning SLMs with LoRA and QLoRA, A Hands On Tutorial
TL;DR — Fine-tune SLMs with QLoRA when VRAM is tight, plain LoRA when it isn’t. 1000-5000 high-quality examples beats 100k mediocre ones. Merge the adapter into a GGUF and serve through llama.cpp or vLLM. Skip full fine-tuning unless you really know why.
Most teams reach for fine-tuning too early. If your task is constrained extraction, classification, or routing, the right answer is usually better prompts plus structured output. Fine-tuning is what you do after you’ve exhausted prompting — when the model has the underlying capability but consistently misses your specific style, schema, or domain vocabulary.
When you do reach that point, LoRA and QLoRA are the right tools for SLMs. Full fine-tuning a 3B model on a 24GB GPU is borderline impossible without distributed training. LoRA gets it done in an afternoon, on a single 16GB card if you’re careful. QLoRA gets it down to 8GB. The cost is a small quality gap relative to full fine-tuning, which for most tasks is invisible.
This post is a worked example. We’re going to fine-tune Llama-3.2-3B-Instruct on a structured extraction task, evaluate it honestly, and ship the result. The code runs on a 16GB GPU. Versions: Python 3.12, transformers 4.47.1, peft 0.14.0, bitsandbytes 0.44.1, trl 0.13.0.
What LoRA Actually Is
LoRA (Low-Rank Adaptation) decomposes the weight update into two small matrices. For a weight matrix W of shape (d, k), instead of learning the full ΔW, you learn A of shape (d, r) and B of shape (r, k), where r (the rank) is tiny — typically 8, 16, 32, or 64. The update becomes W + αBA / r, where α is a scalar.
The base weights are frozen. Only A and B are trained. For a 3B model with r=16, that’s roughly 10-20 million trainable parameters instead of 3 billion. Memory drops by orders of magnitude. Quality drops a few percent on most tasks.
QLoRA adds one twist: the frozen base is loaded in 4-bit (NF4 quantization), and only the LoRA adapters live in higher precision. This squeezes a 3B model + LoRA into about 5GB of VRAM during training.
Setting Up
Step 1, install dependencies
python3.12 -m venv .venv
source .venv/bin/activate
pip install \
"torch==2.5.1" \
"transformers==4.47.1" \
"peft==0.14.0" \
"bitsandbytes==0.44.1" \
"trl==0.13.0" \
"datasets==3.2.0" \
"accelerate==1.2.1"
Step 2, prepare a dataset
For this tutorial, our task is “given a job posting paragraph, extract title, company, location, and remote-friendliness as JSON.” A real dataset would be a few thousand examples. Here’s the schema:
# Each row:
{
"input": "We're hiring a Senior Backend Engineer at Acme Corp...",
"output": {"title": "Senior Backend Engineer", "company": "Acme Corp",
"location": "Remote", "remote": true}
}
I keep my data as JSONL on disk. The dataset object:
from datasets import load_dataset
ds = load_dataset("json", data_files={
"train": "data/train.jsonl",
"valid": "data/valid.jsonl",
})
print(ds["train"][0])
print(f"train: {len(ds['train'])}, valid: {len(ds['valid'])}")
Step 3, format as chat templates
The model expects messages, not raw input-output pairs. We use the model’s own chat template so the format matches inference exactly.
from transformers import AutoTokenizer
import json
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
SYSTEM = """You are an extractor. Reply with strict JSON only, no commentary.
Schema: {title, company, location, remote (bool)}."""
def format_example(ex):
msgs = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": ex["input"]},
{"role": "assistant", "content": json.dumps(ex["output"])},
]
text = tok.apply_chat_template(msgs, tokenize=False)
return {"text": text}
ds = ds.map(format_example, remove_columns=ds["train"].column_names)
QLoRA Training Loop
We’ll use trl’s SFTTrainer because it handles the boilerplate (padding, label masking, gradient accumulation) correctly.
Step 1, load the model in 4-bit
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_cfg,
device_map="auto",
attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)
prepare_model_for_kbit_training casts certain layers (layer norms, embeddings) back to fp32 to keep numerics stable.
Step 2, attach the LoRA adapter
lora_cfg = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()
# trainable params: ~24M || all params: ~3.2B || trainable: 0.75%
target_modules matters. Including all linear layers (attention + MLP) consistently outperforms attention-only LoRA in my experiments, at the cost of about 2x more trainable params. For 3B models the extra is cheap.
Step 3, configure training
from trl import SFTConfig, SFTTrainer
cfg = SFTConfig(
output_dir="out/extractor-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.0,
optim="paged_adamw_8bit",
bf16=True,
logging_steps=10,
save_strategy="epoch",
eval_strategy="epoch",
max_seq_length=1024,
dataset_text_field="text",
packing=False,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=cfg,
train_dataset=ds["train"],
eval_dataset=ds["valid"],
tokenizer=tok,
)
trainer.train()
trainer.save_model("out/extractor-lora/final")
Hyperparameters that matter most:
learning_rate=2e-4is the LoRA default. Don’t use the full-fine-tuning value of 2e-5.lora_alpha=32withr=16gives a scaling factor of 2.0. The ratioalpha/ris what matters, not the absolute values.optim="paged_adamw_8bit"uses 8-bit AdamW with paged memory — much smaller optimizer state.packing=Falsefor short examples. For long-text fine-tuning,packing=Trueis a big throughput win.
Training time on a 16GB GPU with 5000 examples, 3 epochs, ~512 token sequences: roughly 30-45 minutes.
Architecture of the Training Setup
Frozen 4-bit base weights
+-------------------+
| W (NF4, 4 bits) |
+---------+---------+
|
| forward
v
+-------------+ in +-----------------+ out +------------+
| Input |------>| y = Wx + BAx |--------> | Output |
| (bf16) | +-----------------+ | (bf16) |
+-------------+ ^ ^ +------------+
| |
+----------+ +-----------+
| A (bf16)| | B (bf16) |
| (d x r) | | (r x k) |
+---------+ +----------+
^ ^
| |
+-------+-------+
|
gradient updates
(paged AdamW 8-bit)
Everything that updates is in the right column. Everything that doesn’t is in 4-bit NF4 on the GPU and never moves.
Evaluating Honestly
Loss curves are not evaluation. They tell you whether training is progressing, not whether your model is useful. For an extraction task, evaluate on schema validity and field accuracy.
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, "out/extractor-lora/final")
model.eval()
tok = AutoTokenizer.from_pretrained(MODEL_ID)
correct, valid_json, total = 0, 0, 0
for ex in eval_data:
msgs = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": ex["input"]},
]
inp = tok.apply_chat_template(msgs, return_tensors="pt",
add_generation_prompt=True).to(model.device)
with torch.no_grad():
out = model.generate(inp, max_new_tokens=256, do_sample=False)
text = tok.decode(out[0][inp.shape[1]:], skip_special_tokens=True)
total += 1
try:
pred = json.loads(text)
valid_json += 1
if pred == ex["output"]:
correct += 1
except json.JSONDecodeError:
pass
print(f"valid_json: {valid_json/total:.2%} exact_match: {correct/total:.2%}")
Track both metrics. Valid JSON tells you whether the model learned the format. Exact match tells you whether it learned the content. Before fine-tuning, baseline Llama-3.2-3B-Instruct on my actual extraction task scored 67% valid JSON, 41% exact match. After 3 epochs of LoRA on 4000 examples: 99% valid JSON, 88% exact match.
Shipping the Adapter
You have three options for deployment.
Option A, keep the adapter separate (development)
base = AutoModelForCausalLM.from_pretrained(MODEL_ID, ...)
model = PeftModel.from_pretrained(base, "out/extractor-lora/final")
This is fine for dev. Inference is slightly slower because the LoRA path adds a matmul per layer.
Option B, merge into the base weights (HF deployment)
merged = model.merge_and_unload()
merged.save_pretrained("out/extractor-merged")
tok.save_pretrained("out/extractor-merged")
The merged model is a normal HF checkpoint. Serve it through vLLM or transformers normally.
Option C, convert to GGUF (llama.cpp / Ollama deployment)
# inside llama.cpp checkout
python convert_hf_to_gguf.py /path/to/out/extractor-merged \
--outfile extractor-3b-f16.gguf --outtype f16
./build/bin/llama-quantize extractor-3b-f16.gguf extractor-3b-Q4_K_M.gguf Q4_K_M
Now you can ollama create a Modelfile from this GGUF and treat it like any other Ollama model.
Common Pitfalls
-
Overfitting on tiny datasets. With 200 examples and 5 epochs, your model memorizes. The valid loss looks fine because the valid set is from the same distribution. The model dies on anything slightly out-of-distribution. Fix: collect more data, or aggressively reduce epochs and add dropout.
-
Wrong
target_modules. Targeting onlyq_projandv_proj(the original LoRA paper recommendation) leaves capacity on the table. Modern practice is all linear layers. Fix: include MLP projections (gate_proj,up_proj,down_proj). -
Chat template drift between training and inference. If you train with one system prompt and infer with another, you’ve created a distribution shift the LoRA wasn’t trained for. Fix: pin the system prompt with the dataset version. Treat them as one artifact.
-
Forgetting to evaluate the base model. You can’t claim improvement without a baseline. Sometimes you spend a weekend training and the base model was already 90% — the LoRA gives you 92% and isn’t worth the operational cost. Fix: always benchmark the base model on your eval set first.
Troubleshooting
Symptom: Loss is NaN after a few steps.
Diagnose: Probably learning rate too high or fp16 numerical issues. Lower LR to 1e-4. Switch to bf16 if hardware supports it. Make sure prepare_model_for_kbit_training was called.
Symptom: Training works but the merged model is broken.
Diagnose: You merged into a quantized base. merge_and_unload requires a non-quantized base. Re-load the base in bf16, attach the adapter, then merge.
Symptom: Out of memory at batch size 1.
Diagnose: Sequence length is too long, or target_modules is too broad. Reduce max_seq_length, enable gradient checkpointing with model.gradient_checkpointing_enable(), or drop lora_alpha to lower rank.
Building the Dataset Itself
I glossed over this but it’s the highest-leverage activity in fine-tuning. A few practical notes.
Get real inputs, not synthetic ones. If your task is extracting fields from customer emails, your training inputs should be customer emails. Synthetic ones from GPT-4 work for prototyping but introduce subtle distribution differences that hurt at deployment.
Label with the schema you’ll actually use. Don’t fine-tune on a JSON schema you might change next month. Lock the schema first, then label.
Audit difficulty distribution. A dataset that’s 90% easy examples teaches the model nothing useful. Aim for roughly 30% easy, 50% medium, 20% hard. The hard cases are where fine-tuning pays back its cost.
Hold out a real eval split. Five hundred labeled examples? Use 400 for training, 50 for eval-during-training, 50 you don’t touch until the final model. The held-out 50 is your ground truth.
A minimal dataset audit script:
import json
from collections import Counter
rows = [json.loads(l) for l in open("data/train.jsonl")]
print(f"total: {len(rows)}")
print(f"input length p50/p95: "
f"{sorted(len(r['input']) for r in rows)[len(rows)//2]} / "
f"{sorted(len(r['input']) for r in rows)[int(len(rows)*0.95)]}")
# Check for duplicates
inputs = [r["input"] for r in rows]
dups = [k for k, v in Counter(inputs).items() if v > 1]
print(f"duplicates: {len(dups)}")
# Check label field coverage
labels = [r["output"] for r in rows]
for k in labels[0].keys():
vals = [l[k] for l in labels if k in l]
print(f" {k}: {len(vals)} present, {len(set(map(str, vals)))} unique")
Run this on every dataset update. It catches the boring problems (duplicates, missing fields, length outliers) that ruin training runs before they start.
Continuing from a Checkpoint
Long training runs sometimes get interrupted. Resuming is one line.
trainer.train(resume_from_checkpoint=True)
This picks up the most recent checkpoint in output_dir. If you want a specific one, pass the path. The optimizer state and scheduler state are restored, so the resumed run continues as if uninterrupted.
For very long runs, set save_steps to checkpoint every 100-500 steps. Disk is cheap; redoing 4 hours of training because the cluster preempted is expensive.
Wrapping Up
LoRA and QLoRA make SLM fine-tuning practical on hardware you can afford. The recipe is small: pick the right base, write a clean chat-template dataset, target all linear layers at rank 16-32, run for 2-3 epochs, evaluate honestly, and ship the merged model. The mistakes that cost time are dataset-shaped, not training-loop-shaped. Spend your effort there. See the PEFT docs and TRL docs for advanced variants like DoRA, rsLoRA, and ORPO, once the basics are working.