background-shape
Fine tuning article cover illustration on a gradient background
January 16, 2026 · 9 min read · by Muhammad Amal programming
Advertisement

TL;DR — LoRA fine-tuning teaches a small model one domain without retraining the whole thing / QLoRA with Unsloth makes it fit on a single consumer GPU in under an hour / merge the adapter and export to GGUF and you’ve got an edge-ready specialist.

A general-purpose 3B model is a decent generalist and a mediocre specialist. That’s the gap I run into constantly: the model understands English fine, but it doesn’t know my company’s product taxonomy, my client’s ticket categories, or the exact JSON shape a downstream service expects. Prompt engineering papers over some of it. Past a point, you’re fighting the model instead of using it.

Fine-tuning closes that gap, and LoRA makes fine-tuning cheap enough that it’s a routine engineering task rather than a research project. Instead of updating all 3 billion weights, LoRA freezes the base model and trains a small pair of low-rank matrices alongside each target layer. You end up modifying well under 1% of the parameters. The adapter is a few tens of megabytes, training runs on a single consumer GPU, and the base model stays untouched and reusable.

Advertisement

This post walks the full pipeline: building a domain dataset, training a LoRA adapter on Llama 3.2 3B with Unsloth and QLoRA, merging the result, and exporting to GGUF so it runs on edge hardware. It pairs naturally with the case for small language models at the edge — fine-tuning is how you make a small model genuinely good at the one thing you need.

LoRA and QLoRA in One Section

LoRA (Low-Rank Adaptation) rests on a simple observation: the weight update a model needs to learn a narrow task is low-rank. So instead of learning a full update matrix, you learn two small matrices A and B whose product approximates it. The base weights stay frozen; only A and B train. Two hyperparameters matter:

  • rank (r) — the inner dimension of the A/B pair. Higher rank means more capacity and a larger adapter. For single-domain tuning, r=16 is a sensible default; r=8 for very narrow tasks, r=32+ if the domain is broad.
  • alpha — a scaling factor on the adapter’s contribution. The common convention is alpha = 2 * r.

QLoRA goes one step further: it loads the frozen base model in 4-bit precision while keeping the trainable LoRA matrices in higher precision. The base model’s memory footprint drops roughly 4x, which is what lets a 3B model fine-tune comfortably on a 12-16GB consumer GPU. The accuracy cost is negligible because the quantized weights are frozen — you’re not training them, just reading them.

Unsloth is a training library that re-implements the hot paths with custom kernels. In practice it roughly doubles training throughput and cuts memory use versus a stock Hugging Face transformers + peft setup, with no change to the resulting model quality.

Environment Setup

You need an NVIDIA GPU with at least 12GB of VRAM. Anything from an RTX 3060 12GB upward handles a 3B model under QLoRA.

# Python 3.11, CUDA 12.x toolchain assumed.
python3 -m venv .venv
source .venv/bin/activate

# Versions current as of January 2026.
pip install "unsloth==2026.1.4"
pip install "trl==0.13.0" "datasets==3.2.0" "transformers==4.48.0"

Confirm the GPU is visible:

python3 -c "import torch; print(torch.cuda.get_device_name(0), torch.cuda.is_available())"

Step 1: Build the Dataset

This is where fine-tuning succeeds or fails. The model learns whatever your data shows it — quality and consistency beat volume. For a single bounded domain, a few hundred to a couple thousand clean, varied examples is plenty. A thousand consistent examples will outperform ten thousand sloppy ones.

Use a chat-style JSONL format. Each line is one training example as a list of messages.

{"messages": [{"role": "system", "content": "You are a support assistant for Acme Cloud. Classify the ticket and reply with a JSON object."}, {"role": "user", "content": "My instance won't boot after the latest update."}, {"role": "assistant", "content": "{\"category\": \"technical\", \"priority\": \"high\", \"team\": \"infra\"}"}]}
{"messages": [{"role": "system", "content": "You are a support assistant for Acme Cloud. Classify the ticket and reply with a JSON object."}, {"role": "user", "content": "Can I get an invoice for last month?"}, {"role": "assistant", "content": "{\"category\": \"billing\", \"priority\": \"low\", \"team\": \"finance\"}"}]}

A few rules I hold to:

  • Identical system prompt across every example — if it varies, the model learns the variation as signal.
  • Cover the edges — include ambiguous and adversarial inputs, not just the easy middle.
  • Exact output format — if production needs strict JSON, every assistant message must be strict JSON, no prose around it.
  • Hold out 10-15% for evaluation. Never train on what you’ll evaluate with.
# split_dataset.py — deterministic train/eval split.
import json
import random

random.seed(42)  # reproducible splits

with open("domain_data.jsonl", encoding="utf-8") as f:
    rows = [json.loads(line) for line in f if line.strip()]

random.shuffle(rows)
cut = int(len(rows) * 0.88)
train, eval_ = rows[:cut], rows[cut:]

for name, data in (("train.jsonl", train), ("eval.jsonl", eval_)):
    with open(name, "w", encoding="utf-8") as f:
        for row in data:
            f.write(json.dumps(row, ensure_ascii=False) + "\n")

print(f"train={len(train)}  eval={len(eval_)}")

Step 2: Load the Model with Unsloth

Unsloth wraps model loading and applies the LoRA configuration.

# train.py
from unsloth import FastLanguageModel
import torch

MAX_SEQ_LEN = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=MAX_SEQ_LEN,
    dtype=None,          # auto-detect bf16 where supported
    load_in_4bit=True,   # this is the QLoRA part
)

# Attach LoRA adapters to the attention and MLP projections.
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                # rank
    lora_alpha=32,       # alpha = 2 * r
    lora_dropout=0.0,    # 0.0 is fine and slightly faster
    bias="none",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",  # cuts VRAM further
    random_state=42,
)

Targeting all seven projection modules — attention plus MLP — gives the adapter enough reach to actually learn the domain. Targeting only q_proj/v_proj trains faster but tends to underfit anything beyond trivial tasks.

Step 3: Train

Use TRL’s SFTTrainer. The hyperparameters below are a solid starting point for single-domain tuning on a few hundred to a couple thousand examples.

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

train_ds = load_dataset("json", data_files="train.jsonl", split="train")

def format_chat(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False, add_generation_prompt=False
        )
    }

train_ds = train_ds.map(format_chat)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_ds,
    dataset_text_field="text",
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,   # effective batch size = 8
        warmup_steps=10,
        num_train_epochs=3,              # 2-3 is right; more risks overfit
        learning_rate=2e-4,              # LoRA tolerates higher LR than full FT
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        max_seq_length=MAX_SEQ_LEN,
        report_to="none",
    ),
)

stats = trainer.train()
print(f"Training loss: {stats.training_loss:.4f}")

Watch the loss curve. A healthy LoRA run on a clean dataset drops steadily then flattens. If loss spikes or plateaus high, the dataset is inconsistent or the learning rate is wrong. If loss drives toward zero, you’re memorizing — pull back the epochs.

For a 3B model on ~1,500 examples, three epochs on an RTX 4090 finishes in roughly 15-25 minutes; on a 3060 12GB, closer to 45-60 minutes.

Step 4: Evaluate Before You Trust It

Training loss is not a quality metric. Run the held-out set through the tuned model and check actual behavior.

# evaluate.py
import json
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="outputs/checkpoint-final",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)  # enables the fast inference path

correct, total, malformed = 0, 0, 0
with open("eval.jsonl", encoding="utf-8") as f:
    for line in f:
        ex = json.loads(line)
        prompt_msgs = ex["messages"][:-1]
        expected = ex["messages"][-1]["content"]

        inputs = tokenizer.apply_chat_template(
            prompt_msgs, add_generation_prompt=True, return_tensors="pt"
        ).to("cuda")
        out = model.generate(input_ids=inputs, max_new_tokens=64, do_sample=False)
        got = tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True).strip()

        total += 1
        try:
            if json.loads(got) == json.loads(expected):
                correct += 1
        except json.JSONDecodeError:
            malformed += 1  # the model didn't even produce valid JSON

print(f"exact match: {correct}/{total}  malformed JSON: {malformed}")

Two numbers matter here: exact-match accuracy and malformed-output count. A model that’s accurate but occasionally emits broken JSON is still a production hazard. If malformed is non-zero, you need more format-consistent training data — or constrained decoding at inference time.

Step 5: Merge and Export to GGUF

The adapter is useless on edge hardware by itself. Merge it into the base model, then export to GGUF for llama.cpp.

# export.py
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="outputs/checkpoint-final",
    max_seq_length=2048,
    load_in_4bit=False,   # load in full precision for a clean merge
)

# Merge LoRA weights into the base and write a quantized GGUF in one call.
model.save_pretrained_gguf(
    "domain-llama-3.2-3b",
    tokenizer,
    quantization_method="q4_k_m",
)

save_pretrained_gguf merges the adapter, converts to GGUF, and quantizes — here straight to Q4_K_M, the edge-deployment default. The output is a single ~2 GB .gguf file you can drop onto any llama.cpp host.

Verify it loads and behaves:

./llama.cpp/build/bin/llama-cli \
  -m domain-llama-3.2-3b/unsloth.Q4_K_M.gguf \
  -p "My instance won't boot after the latest update." \
  -n 64 --temp 0 --no-display-prompt

If it returns the domain-specific JSON your training data taught it, the full pipeline — train, merge, quantize, deploy — worked.

Common Pitfalls

  • Inconsistent training data. The fastest way to a bad model. Identical system prompts, identical output schema, every example. The model can’t separate signal from your sloppiness.
  • Too many epochs. LoRA overfits a small dataset fast. Two to three epochs is the band; loss approaching zero means you’ve memorized the training set and lost generalization.
  • Training only q_proj/v_proj. Fine for toy tasks, underfits real domains. Target the MLP projections too.
  • Rank cargo-culting. r=64 doesn’t automatically beat r=16; for a narrow domain it just inflates the adapter and the overfitting risk. Start at 16 and only raise it if eval says to.
  • Trusting training loss as quality. Low loss can mean memorization. Always evaluate on held-out data and on real behavior.
  • Forgetting to merge. A raw adapter doesn’t run on llama.cpp. Merge into the base before exporting to GGUF.

Troubleshooting

Symptom: CUDA out-of-memory during training. Cause: batch size or sequence length too large for the GPU. Fix: drop per_device_train_batch_size to 1 and raise gradient_accumulation_steps to keep the effective batch size, confirm use_gradient_checkpointing="unsloth", and lower max_seq_length to your real example length.

Symptom: loss won’t drop, stays flat and high. Cause: learning rate too low, malformed data, or a chat template mismatch. Fix: verify apply_chat_template output looks correct, confirm the JSONL parses cleanly, and try learning_rate=2e-4.

Symptom: the model overfits — perfect on train, poor on eval. Cause: too many epochs or too small a dataset. Fix: cut to two epochs, add more varied examples (especially edge cases), and consider a small lora_dropout like 0.05.

Symptom: tuned model ignores the new domain and behaves like the base model. Cause: the adapter wasn’t merged, or the dataset was too small to shift behavior. Fix: confirm save_pretrained_gguf ran on the checkpoint (not the bare base), and add more training examples if the domain shift is large.

Symptom: GGUF export fails or produces a corrupt file. Cause: a llama.cpp version mismatch in Unsloth’s bundled converter. Fix: update Unsloth to the latest release, or convert manually with a current convert_hf_to_gguf.py after merging the adapter to safetensors.

What’s Next

A LoRA-tuned, GGUF-exported small model is a specialist that runs anywhere — a laptop, a cloud VM, a single-board computer — and outperforms a far larger generalist on the one task you trained it for. From here the work is operational: version your adapters, build a real evaluation harness so each retrain is a measured improvement, and treat the dataset as the living asset it is.

Advertisement