Training a Defect Detection Model for the Factory Floor

Machine learning article cover illustration on a gradient background

April 3, 2026 · 8 min read · by Muhammad Amal programming

TL;DR — Defect detection is an imbalance problem before it is a modeling problem / Spend your effort on data quality and honest evaluation, not architecture / Optimize the metric the factory cares about, not mAP

Most defect detection projects fail in the dataset, not the network. I have watched teams burn weeks tuning learning rates and augmentation schedules on a dataset where the “scratch” class had eleven examples and three of them were mislabeled. The model converged beautifully and was useless. A factory floor produces overwhelmingly good parts; defects are rare, varied, and often subtle. That single fact shapes everything about how you collect data, train, and decide whether the model is ready.

Defect detection model training is also a domain where the standard computer vision metrics quietly lie to you. A model with 0.92 mAP can still let through the one defect type that triggers a customer recall, because mAP averages across classes and your worst class is buried in the mean. The job is not to maximize a number on a dashboard. The job is to catch the defects that cost money, with a false reject rate the line can tolerate.

This article walks the full pipeline: capturing and labeling data, dealing with class imbalance, training YOLOv11 in PyTorch 2.6, and evaluating in a way that maps to factory economics. The deployment side — getting this model onto the line with deterministic latency — is covered in the edge architecture article .

Data Collection That Reflects Reality

You cannot train a useful defect detector on a few hundred staged photos. You need frames from the actual rig, captured under the actual lighting, across shifts and across the natural drift of a real process. The single highest-leverage decision is to instrument data capture into the production rig from day one, even before you have a model.

# capture_dataset.py — log frames with metadata for later labeling
import cv2, json, time, pathlib

class DatasetLogger:
    def __init__(self, out_dir: str, sample_every_n: int = 1):
        self.dir = pathlib.Path(out_dir)
        (self.dir / "images").mkdir(parents=True, exist_ok=True)
        self.n = sample_every_n
        self.count = 0
        self.manifest = (self.dir / "manifest.jsonl").open("a")

    def log(self, frame, line_id: str, shift: str):
        self.count += 1
        if self.count % self.n != 0:
            return
        ts = time.strftime("%Y%m%dT%H%M%S")
        name = f"{line_id}_{ts}_{self.count}.png"
        cv2.imwrite(str(self.dir / "images" / name), frame)
        self.manifest.write(json.dumps({
            "file": name, "line": line_id, "shift": shift,
            "epoch": time.time(),
        }) + "\n")
        self.manifest.flush()

Capture the metadata — line, shift, timestamp — because you will need it later to check whether your train/test split leaked. If the same physical part appears in both splits, your evaluation is fiction.

Labeling and Class Strategy

Upload to Roboflow or your annotation tool of choice and label tightly. For defect detection, tight bounding boxes matter more than for general object detection because defects are small and a loose box dilutes the positive signal with background.

Decide your class taxonomy deliberately. Resist the urge to create twenty defect classes. A model with a few well-populated classes beats one with twenty starved classes. Group visually similar defects unless the factory genuinely treats them differently downstream.

# data.yaml — YOLOv11 dataset config
path: /data/defects
train: images/train
val: images/val
test: images/test
names:
  0: scratch
  1: dent
  2: contamination
  3: missing_component

When you export from Roboflow, generate the version programmatically so the dataset is reproducible and versioned alongside the training code.

# pull_dataset.py — reproducible Roboflow export
from roboflow import Roboflow
import os

rf = Roboflow(api_key=os.environ["ROBOFLOW_API_KEY"])
project = rf.workspace("factory-qc").project("surface-defects")
version = project.version(7)                    # pin the version
dataset = version.download("yolov11", location="/data/defects")
print(f"Dataset v7 downloaded to {dataset.location}")

Confronting Class Imbalance

This is where most projects either succeed or quietly fail. A typical floor dataset might be 95 percent good parts and a long tail of defects. Three tools, in order of preference:

First, fix the data. Targeted collection of rare defects beats every algorithmic trick. Ask the line operators to set aside defective parts for a labeling session.

Second, weight the loss. YOLOv11 supports class weighting so rare classes contribute more gradient.

Third, augment carefully. Copy-paste augmentation — pasting cropped defects onto good backgrounds — is genuinely effective here, but only if the pasted defects look physically plausible. A floating scratch with a hard edge teaches the model to find compositing artifacts, not defects.

# train.py — YOLOv11 training with PyTorch 2.6 backend
from ultralytics import YOLO
import torch

assert torch.__version__.startswith("2.6")

model = YOLO("yolo11s.pt")          # small variant: edge-friendly

results = model.train(
    data="data.yaml",
    epochs=200,
    imgsz=640,
    batch=32,
    device=0,
    optimizer="AdamW",
    lr0=1e-3,
    cos_lr=True,
    patience=40,                    # early stop on val plateau
    # imbalance + small-object handling
    cls=1.0,                        # classification loss gain
    box=7.5,                        # box loss gain
    mosaic=1.0,
    copy_paste=0.3,                 # copy-paste augmentation
    close_mosaic=20,                # disable mosaic for last 20 epochs
    hsv_v=0.4,                      # value jitter: lighting robustness
    degrees=10.0,
    project="runs/defect",
    name="yolo11s_v7",
)

The close_mosaic parameter matters more than people expect. Mosaic augmentation helps early but produces unrealistic compositions; turning it off for the final epochs lets the model settle on real-world image statistics. The hsv_v jitter is your cheapest insurance against lighting drift between shifts.

Evaluation That Maps to Money

Default YOLO metrics give you mAP. The factory does not care about mAP. It cares about two numbers: the escape rate (defects that pass) and the false reject rate (good parts kicked out). Build an evaluation that reports those directly.

# evaluate.py — confusion-based factory metrics
from ultralytics import YOLO
import numpy as np

def factory_metrics(model_path: str, data_yaml: str, conf: float):
    model = YOLO(model_path)
    metrics = model.val(data=data_yaml, conf=conf, iou=0.5, split="test")
    cm = metrics.confusion_matrix.matrix      # (nc+1, nc+1), last = background

    nc = cm.shape[0] - 1
    report = {}
    for c in range(nc):
        tp = cm[c, c]
        fn = cm[nc, c]                        # defect predicted as background
        fp = cm[c, nc]                        # background predicted as defect
        escape = fn / (tp + fn + 1e-9)        # missed defects
        false_reject = fp / (tp + fp + 1e-9)
        report[c] = {
            "escape_rate": round(float(escape), 4),
            "false_reject_rate": round(float(false_reject), 4),
        }
    return report

for conf in (0.25, 0.40, 0.55, 0.70):
    print(conf, factory_metrics("runs/defect/yolo11s_v7/weights/best.pt",
                                "data.yaml", conf))

Run this sweep across confidence thresholds and you get the real operating curve. A pharmaceutical line will accept a high false reject rate to drive escape rate toward zero. A commodity line will pick a threshold that balances scrap cost against complaint cost. That threshold is a business decision; your job is to surface the trade-off honestly.

Per-Class Inspection

Always look at the worst class, never the average. Pull the failure cases and look at the actual images.

# inspect_failures.py — dump misclassified test images for review
from ultralytics import YOLO
import cv2, pathlib

model = YOLO("runs/defect/yolo11s_v7/weights/best.pt")
out = pathlib.Path("failures"); out.mkdir(exist_ok=True)

for img_path in pathlib.Path("/data/defects/images/test").glob("*.png"):
    res = model(str(img_path), conf=0.40, verbose=False)[0]
    label_path = str(img_path).replace("images", "labels").replace(".png", ".txt")
    gt_classes = set()
    try:
        with open(label_path) as f:
            gt_classes = {int(line.split()[0]) for line in f}
    except FileNotFoundError:
        pass
    pred_classes = {int(b.cls) for b in res.boxes}
    if gt_classes != pred_classes:                # mismatch worth reviewing
        annotated = res.plot()
        cv2.imwrite(str(out / img_path.name), annotated)

Half the time the “model error” is a labeling error. Fixing those and retraining is usually the single biggest accuracy gain available, and it costs nothing but attention.

Common Pitfalls

Split leakage. The same physical part photographed twice ends up in train and test, inflating your numbers. Split by part or by capture session, never randomly by frame.
Twenty starved classes. Each defect class needs hundreds of examples. Merge classes until each one is genuinely learnable.
Implausible copy-paste augmentation. Pasted defects with hard edges teach the model to detect compositing, not defects. Blend them properly.
Trusting mAP. A high mean hides a catastrophic worst class. Report per-class escape and false reject rates.
Training once and shipping. A factory process drifts. Plan for periodic re-labeling and retraining from the start.

Troubleshooting

Symptom: Validation mAP is high but the model misses defects on the line. Cause: Train/test split leaked, or training data lacked production lighting variety. Fix: Re-split by capture session and add frames from every shift.
Symptom: One defect class never gets detected. Cause: Too few examples and no loss weighting. Fix: Targeted data collection for that class plus copy-paste augmentation; raise the classification loss gain.
Symptom: Training loss diverges in the first few epochs. Cause: Learning rate too high for the batch size, or corrupted labels. Fix: Lower lr0, validate label files, and run the dataset checker.
Symptom: Model flags many good parts as defective. Cause: Confidence threshold too low, or background augmentation insufficient. Fix: Raise the operating threshold using the metric sweep and add more good-part variety.
Symptom: Accuracy degrades weeks after deployment. Cause: Process drift — new supplier material, lighting change, camera aging. Fix: Resume dataset capture, re-label recent frames, and retrain on a rolling window.

What’s Next

A defect detection model is only as good as the dataset behind it and the metric you hold it to. Get the data honest, confront imbalance directly, and evaluate in escape and false-reject terms the factory understands. With a trained checkpoint in hand, the next step is making it run fast and predictably on the line — see the Jetson and TensorRT deployment guide for the full conversion and serving path. The full canonical training reference is at docs.ultralytics.com .

Data Collection That Reflects Reality

Labeling and Class Strategy

Confronting Class Imbalance

Evaluation That Maps to Money

Per-Class Inspection

Common Pitfalls

Troubleshooting

What’s Next

Related posts

End-to-End Industrial AI, From Camera to Dashboard

Connecting Edge Vision Inference to an MQTT Telemetry Backbone

Tuning MQTT QoS and Persistence for Reliable Sensor Delivery

Optimizing MQTT Clusters for Critical Environmental Monitoring

Processing Millions of Sensor Events per Second with Go

Building a High-Throughput Telemetry Pipeline in Go

Deploying YOLO Models on NVIDIA Jetson with TensorRT

Architecting Computer Vision Quality Control at the Industrial Edge

Let’s Start a Project