Architecting Computer Vision Quality Control at the Industrial Edge

Computer vision article cover illustration on a gradient background

April 1, 2026 · 17 min read · by Muhammad Amal programming

TL;DR / Run inspection inference on the line, not in the cloud / Budget every millisecond from shutter to PLC signal / Treat the model as one replaceable component in a hard-real-time system

The first time I shipped a vision inspection rig to a real factory, the demo that worked flawlessly on my desk fell apart in two hours. The problem was not the model. It was a network hiccup that stalled a cloud round-trip long enough for three defective parts to sail past the reject gate. That afternoon taught me something every cloud-first engineer eventually learns. A quality control loop has a deadline, and a deadline you cannot guarantee is not a deadline at all.

Computer vision quality control at the industrial edge is a different discipline from building a clever classifier. The accuracy of the model matters, but it is table stakes. What separates a system that survives a year on the floor from one that gets ripped out in a month is architecture: deterministic latency, graceful degradation, and a clean contact with the physical machinery that actually moves parts.

This article is the step-by-step build I wish I had on day one. We will go from an empty NVIDIA Jetson Orin to a running rig that traces a single part from the moment it triggers the camera to the moment a PLC either passes it or kicks it into a reject bin, with YOLOv11 served through TensorRT 10. Each step has a goal, the commands to run, what you should expect to see, and how to check it before you move on.

Before You Start

Get the parts on the bench before you write any code. Swapping hardware halfway through a build invalidates half your latency measurements.

Hardware

Compute: Jetson Orin NX 16GB for a single-camera line, or AGX Orin for multi-camera cells. Leave GPU headroom, do not size for 100 percent utilization.
Camera: a global-shutter industrial camera over GigE Vision or MIPI CSI-2. Global shutter is not optional for moving parts. A rolling shutter exposes the sensor row by row, so a part moving across the frame is captured at slightly different times top to bottom, which smears edges and quietly destroys the defect signal you are trying to detect.
Trigger: a photoelectric sensor wired to the camera trigger line, so every frame is tied to a physical part instead of a free-running clock.
Actuator path: a PLC with a reject gate or air blast, reachable over Modbus TCP, EtherNet/IP, or PROFINET.

Software

JetPack 6.x (Ubuntu 22.04 base, CUDA 12, cuDNN, TensorRT 10).
Ultralytics for YOLOv11 export, aravis 0.8 for GigE capture, pymodbus 3.x for the PLC handshake, and cuda-python for CUDA graph capture.

sudo apt install -y gir1.2-aravis-0.8 python3-gi
pip install ultralytics pymodbus cuda-python opencv-python numpy

Assumption: you already have a trained defect_yolo11s.pt checkpoint. Training the detector is its own article. Here the model is just one swappable component in a real-time system.

1. Write the Latency Budget

Goal: decide whether the line is even feasible before writing inference code, and fix the number every later step is measured against.

On a conveyor moving 0.5 m/s with a reject actuator 400 mm downstream of the camera, you have 800 ms from capture to actuation. That sounds generous until you account for everything that has to happen.

Trigger debounce        : 5 ms
Sensor-to-shutter delay : 12 ms
Frame exposure          : 8 ms
DMA transfer to GPU     : 6 ms
Preprocess (resize/norm): 9 ms
TensorRT inference      : 22 ms
Postprocess + NMS       : 4 ms
Decision logic          : 1 ms
Fieldbus write to PLC   : 15 ms
-------------------------------
Total committed         : 82 ms
Slack                   : 718 ms

The slack is your safety margin, and you spend it on jitter, not average case. A pipeline that runs in 82 ms on average but spikes to 600 ms once a minute will still drop parts. The architectural goal is a tight latency distribution, not a low mean. Everything below is in service of that.

Verify: divide the actuator distance by belt speed to get your hard deadline, then confirm your committed budget leaves at least 5x slack. If it does not, slow the belt or move the camera upstream before going further.

2. Provision the Jetson Orin

Goal: make the board behave the same way on minute one and minute six hundred.

Flash JetPack 6.x, then lock the power mode and clocks so the scheduler stops surprising you.

# Pin maximum, deterministic clocks on Jetson Orin
sudo nvpmodel -m 0          # MAXN power profile
sudo jetson_clocks          # lock GPU/CPU/EMC clocks to max
sudo jetson_clocks --show   # verify no thermal throttling headroom issues

nvpmodel -m 0 selects the MAXN profile, which unlocks the full CPU, GPU, DLA, and EMC clock ceilings. jetson_clocks then pins them so they do not scale down under the OS governor. --show prints the current and max frequency for each engine, and you want current equal to max.

One caveat from NVIDIA’s own guidance: MAXN is meant for benchmarking, not indefinite heavy duty. Under sustained load in a warm enclosure the module will still throttle when total power exceeds the TDP budget. For a rig that runs three shifts a day, flash a power config with the -maxn suffix, which applies more conservative sustained thermal settings, and add active cooling. If you are on a newer Orin Nano or NX, JetPack 6.2 “Super Mode” raises the clock ceilings further, which is worth enabling if your thermal solution can keep up.

Verify: run sudo tegrastats for a minute under a dummy GPU load. Watch the GR3D_FREQ and temperature columns. If the clock holds steady and temperature plateaus below throttle, the board is stable. If the clock sags, fix cooling before you trust any latency number.

3. Set Up Hardware-Triggered Capture

Goal: turn one physical part into exactly one frame, with no software timing in the loop.

The camera should be hardware-triggered by the photoelectric sensor, not free-running. Hardware triggering ties each acquisition to an external edge instead of host-side software timing, which is what removes the variable latency a software trigger would inject.

# capture.py - hardware-triggered acquisition with Aravis (GigE Vision)
import gi
gi.require_version("Aravis", "0.8")
from gi.repository import Aravis
import numpy as np

class TriggeredCamera:
    def __init__(self, device_id: str):
        self.cam = Aravis.Camera.new(device_id)
        self.cam.set_pixel_format(Aravis.PIXEL_FORMAT_BAYER_RG_8)
        self.cam.set_region(0, 0, 1280, 1024)
        self.cam.set_exposure_time(8000)            # 8 ms
        self.cam.set_frame_rate(0)                  # disable free-run
        self.cam.set_trigger("Line1")               # external hardware trigger
        self.stream = self.cam.create_stream(None, None)
        payload = self.cam.get_payload()
        for _ in range(8):                          # pre-allocate buffer pool
            self.stream.push_buffer(Aravis.Buffer.new_allocate(payload))
        self.cam.start_acquisition()

    def grab(self, timeout_us: int = 200_000) -> np.ndarray | None:
        buf = self.stream.timeout_pop_buffer(timeout_us)
        if buf is None or buf.get_status() != Aravis.BufferStatus.SUCCESS:
            if buf is not None:
                self.stream.push_buffer(buf)
            return None
        w, h = buf.get_image_width(), buf.get_image_height()
        raw = np.frombuffer(buf.get_data(), dtype=np.uint8).reshape(h, w).copy()
        self.stream.push_buffer(buf)                # return buffer to pool
        return raw

Order matters here. set_trigger("Line1") must be called before create_stream(), because Aravis configures the trigger on the device before the stream object latches the acquisition settings. The call wires the camera’s FrameStart trigger to physical input Line1 on a rising edge.

The buffer pool matters just as much. Allocating frame memory inside the hot loop is one of the most common sources of latency spikes I have profiled. Pre-allocate a handful of buffers, reuse them, and never malloc on the critical path.

Verify: wave a hand through the sensor beam a few times and confirm grab() returns a frame for each break and None while idle. If frames arrive with no part present, the camera is still free-running and the trigger did not take.

4. Export YOLOv11 to ONNX

Goal: get the trained checkpoint out of PyTorch and into a portable graph TensorRT can consume.

YOLOv11 in raw PyTorch is fine for training, but it has no business running on the line. Export it to ONNX with a fixed batch size and the opset TensorRT 10 expects.

# Export YOLOv11 to ONNX with a fixed batch size
yolo export model=defect_yolo11s.pt format=onnx \
    opset=17 imgsz=640 batch=1 simplify=True

Use opset=17 for TensorRT 10, not the Ultralytics default of 14, or you risk unsupported-operator errors at engine build. The exported graph has a single input named images of shape 1x3x640x640 and a single output named output0.

Verify: open the ONNX in Netron or check the output shape in code. For a detector with N classes, output0 is [1, 4+N, 8400]. A standard 80-class model is [1, 84, 8400]: four box coordinates plus the per-class scores across 8400 anchors. There is no separate objectness channel in the v8/v11 head, which is why the postprocess in step 6 reads class scores directly.

5. Build the INT8 TensorRT Engine

Goal: compile an engine tuned for your exact input shape and precision, calibrated on real factory frames.

# Build a TensorRT 10 engine with INT8 calibration
trtexec --onnx=defect_yolo11s.onnx \
    --saveEngine=defect_yolo11s.int8.engine \
    --int8 --fp16 \
    --calib=calib_cache.bin \
    --shapes=images:1x3x640x640 \
    --builderOptimizationLevel=4 \
    --useCudaGraph

INT8 cuts inference time roughly in half on Orin versus FP16, but it needs a calibration cache built from a few hundred representative frames. Do not skip this. An INT8 engine calibrated on internet images and deployed on factory lighting will quietly lose 8 to 10 points of mAP. Calibrate on frames captured from the actual rig, under the actual lights.

A forward-compatibility note: the --int8 and --calib flags are the TensorRT 10 way. TensorRT 11 removes them and moves quantization offline into the model graph via NVIDIA’s ModelOpt toolkit. If you upgrade later, expect to quantize before the engine build, not during it.

# build_engine.py - INT8 calibration with line-captured frames
import tensorrt as trt
import numpy as np, glob, cv2

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

class FactoryCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, frame_dir: str, cache_path: str):
        super().__init__()
        self.cache_path = cache_path
        self.files = sorted(glob.glob(f"{frame_dir}/*.png"))
        self.idx = 0
        self.batch = np.zeros((1, 3, 640, 640), dtype=np.float32)
        import pycuda.driver as cuda
        self.cuda = cuda
        self.d_input = cuda.mem_alloc(self.batch.nbytes)

    def get_batch_size(self): return 1

    def get_batch(self, names):
        if self.idx >= len(self.files):
            return None
        img = cv2.imread(self.files[self.idx])
        img = cv2.resize(img, (640, 640)).astype(np.float32) / 255.0
        self.batch[0] = img.transpose(2, 0, 1)
        self.cuda.memcpy_htod(self.d_input, np.ascontiguousarray(self.batch))
        self.idx += 1
        return [int(self.d_input)]

    def read_calibration_cache(self):
        try:
            with open(self.cache_path, "rb") as f:
                return f.read()
        except FileNotFoundError:
            return None

    def write_calibration_cache(self, cache):
        with open(self.cache_path, "wb") as f:
            f.write(cache)

Capture the calibration set first: run the rig in trigger mode, save 300 to 500 PNG frames spanning your real defect and pass cases, and point frame_dir at them.

Verify: run the engine on a held-out validation set and compare mAP against the FP16 engine. A drop of one or two points is normal for INT8. A drop of eight or more means your calibration frames did not match production conditions, so recapture and rebuild the cache.

6. Write the Inference Service

Goal: turn frames into decisions with predictable latency, using a CUDA graph to erase per-launch kernel overhead.

This is the step where the obvious library choice is a trap. You will see most TensorRT examples use pycuda for memory and streams, and that part is fine. But pycuda does not expose CUDA graph capture at all. There is no Stream.begin_capture, no Graph, no instantiate. NVIDIA’s documented path for CUDA graphs in Python uses the cuda-python bindings (cuda.bindings.runtime, imported as cudart). So we use cudart for allocation, stream, and graph capture, and let TensorRT bind tensors by address.

# inference.py - TensorRT 10 runtime with CUDA graph capture (cuda-python)
import tensorrt as trt
from cuda.bindings import runtime as cudart
import numpy as np

def _chk(err):
    if isinstance(err, tuple):
        status, *rest = err
        if status != cudart.cudaError_t.cudaSuccess:
            raise RuntimeError(f"CUDA error: {status}")
        return rest[0] if len(rest) == 1 else rest
    if err != cudart.cudaError_t.cudaSuccess:
        raise RuntimeError(f"CUDA error: {err}")

class DefectDetector:
    def __init__(self, engine_path: str, conf_thresh: float = 0.45):
        self.conf = conf_thresh
        logger = trt.Logger(trt.Logger.ERROR)
        with open(engine_path, "rb") as f, trt.Runtime(logger) as rt:
            self.engine = rt.deserialize_cuda_engine(f.read())
        self.ctx = self.engine.create_execution_context()
        self.stream = _chk(cudart.cudaStreamCreate())

        # Allocate pinned host + device memory once
        self.in_shape = (1, 3, 640, 640)
        self.out_shape = tuple(self.ctx.get_tensor_shape("output0"))
        self.in_nbytes = int(np.prod(self.in_shape)) * 4
        self.out_nbytes = int(np.prod(self.out_shape)) * 4
        self.h_in = _chk(cudart.cudaHostAlloc(self.in_nbytes, 0))
        self.h_out = _chk(cudart.cudaHostAlloc(self.out_nbytes, 0))
        self.d_in = _chk(cudart.cudaMalloc(self.in_nbytes))
        self.d_out = _chk(cudart.cudaMalloc(self.out_nbytes))
        self.ctx.set_tensor_address("images", int(self.d_in))
        self.ctx.set_tensor_address("output0", int(self.d_out))
        self._capture_graph()

    def _memcpy(self, dst, src, n, kind):
        cudart.cudaMemcpyAsync(dst, src, n, kind, self.stream)

    def _capture_graph(self):
        H2D = cudart.cudaMemcpyKind.cudaMemcpyHostToDevice
        D2H = cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost
        # Warm up, then capture the launch sequence as a CUDA graph
        for _ in range(3):
            self.ctx.execute_async_v3(self.stream)
        _chk(cudart.cudaStreamSynchronize(self.stream))
        mode = cudart.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
        _chk(cudart.cudaStreamBeginCapture(self.stream, mode))
        self._memcpy(self.d_in, self.h_in, self.in_nbytes, H2D)
        self.ctx.execute_async_v3(self.stream)
        self._memcpy(self.h_out, self.d_out, self.out_nbytes, D2H)
        graph = _chk(cudart.cudaStreamEndCapture(self.stream))
        self.graph_exec = _chk(cudart.cudaGraphInstantiate(graph, 0))

    def infer(self, chw_frame: np.ndarray) -> np.ndarray:
        host_view = np.frombuffer(
            (np.ctypeslib.as_array(  # pinned host input
                (np.ctypeslib.ctypes.c_float * (self.in_nbytes // 4)).from_address(self.h_in)
            )), dtype=np.float32).reshape(self.in_shape)
        np.copyto(host_view, chw_frame)
        _chk(cudart.cudaGraphLaunch(self.graph_exec, self.stream))
        _chk(cudart.cudaStreamSynchronize(self.stream))
        out = np.frombuffer(
            (np.ctypeslib.ctypes.c_float * (self.out_nbytes // 4)).from_address(self.h_out),
            dtype=np.float32).reshape(self.out_shape)
        return self._postprocess(out.copy())

    def _postprocess(self, raw: np.ndarray) -> np.ndarray:
        # raw: (1, 4+num_classes, num_anchors) for YOLOv11
        pred = raw[0].T
        scores = pred[:, 4:].max(axis=1)
        keep = scores > self.conf
        boxes = pred[keep, :4]
        cls = pred[keep, 4:].argmax(axis=1)
        return np.column_stack([boxes, scores[keep], cls]) if keep.any() \
            else np.empty((0, 6), dtype=np.float32)

The shape of the win: capturing the memcpy plus execute_async_v3 plus memcpy sequence once and replaying it with cudaGraphLaunch removes the per-call kernel launch overhead, which is exactly the kind of small, variable cost that widens your latency tail. Warm up with a few plain executions before capture, or the graph records cold-start work.

Verify: time 1000 infer() calls on a static frame and look at the spread, not the mean. A correctly captured graph gives you a tight band. If the first few calls are slow and the rest are fast, your warm-up loop is doing its job.

7. Close the Loop to the PLC

Goal: get the decision out of Python and into the machine that moves parts.

A vision system that prints “DEFECT” to a console is a science project. Most factory floors speak Modbus TCP, EtherNet/IP, or PROFINET. Modbus is the lowest common denominator and the easiest to get right.

# actuator.py - write pass/reject decision to PLC over Modbus TCP
from pymodbus.client import ModbusTcpClient
import time, logging

log = logging.getLogger("actuator")

class RejectGate:
    def __init__(self, plc_host: str, coil_addr: int = 16, device_id: int = 1):
        self.client = ModbusTcpClient(plc_host, port=502, timeout=0.05)
        self.coil = coil_addr
        self.device_id = device_id
        if not self.client.connect():
            raise ConnectionError(f"PLC unreachable at {plc_host}")

    def signal(self, is_defect: bool, part_id: int) -> bool:
        try:
            rr = self.client.write_coil(self.coil, is_defect, device_id=self.device_id)
            if rr.isError():
                log.error("Modbus write failed for part %d", part_id)
                return False
            return True
        except Exception as exc:
            log.error("PLC write exception part %d: %s", part_id, exc)
            return False

Two pymodbus details worth knowing. The slave-address keyword has churned across versions: it is device_id in pymodbus 3.7 and later, was slave before that, and unit in the 2.x line. The positional write_coil(addr, value) form works regardless if you only ever talk to one device. Also, after the initial connect(), pymodbus checks the connection on each call and reconnects automatically if it dropped, so you do not need to babysit it.

Note the 50 ms timeout. If the PLC does not acknowledge in time, you do not get to retry forever, because the part has moved. The safe default for a missed acknowledgement depends on your line. A pharmaceutical line fails closed and rejects on uncertainty, a low-cost-part line may fail open. Make that policy explicit in code, never implicit.

Verify: trigger a known-defect part and watch the reject actuator fire. Then pull the network cable mid-run and confirm your fail policy does what you intended, not whatever the exception handler happens to do.

8. Orchestrate the Hot Loop

Goal: wire capture, inference, and actuation into one tight loop with a latency guard.

# pipeline.py - the orchestrated hot loop
def run(cam, detector, gate, conf=0.45):
    part_id = 0
    while True:
        raw = cam.grab()
        if raw is None:
            continue                       # no trigger, idle
        part_id += 1
        t0 = time.perf_counter_ns()

        rgb = cv2.cvtColor(raw, cv2.COLOR_BAYER_RG2RGB)
        chw = (cv2.resize(rgb, (640, 640)).astype("float32") / 255.0
               ).transpose(2, 0, 1)[None]
        dets = detector.infer(chw)

        is_defect = len(dets) > 0
        ok = gate.signal(is_defect, part_id)
        latency_ms = (time.perf_counter_ns() - t0) / 1e6

        if not ok:                          # fail-closed policy
            gate.signal(True, part_id)
        if latency_ms > 120:
            log.warning("part %d latency %.1f ms over budget",
                        part_id, latency_ms)

The loop idles cheaply when no part is present, stamps a timestamp the moment a frame arrives, and logs any part that blows the 120 ms guard. That guard is your early warning that thermal drift or network contention is creeping in.

Verify: run a known sequence of good and defective parts past the camera and confirm the reject count matches. Tail the log for over-budget warnings during a sustained run.

9. Validate the Latency Distribution

Goal: prove the rig meets its deadline at the tail, not just on average.

The line does not care about your mean latency. It cares about your p99.9, because that rare slow frame is the one that drops a part. Measure the distribution explicitly.

# latency_probe.py - measure the tail, not the mean
import numpy as np, time

def profile(detector, frame, n=2000):
    chw = frame  # already preprocessed (1,3,640,640) float32
    samples = np.empty(n)
    for i in range(n):
        t0 = time.perf_counter_ns()
        detector.infer(chw)
        samples[i] = (time.perf_counter_ns() - t0) / 1e6
    for p in (50, 90, 99, 99.9):
        print(f"p{p:>4}: {np.percentile(samples, p):6.2f} ms")
    print(f"max : {samples.max():6.2f} ms")

Run this on the actual board, with clocks pinned, while tegrastats streams in another terminal. Watch the GPU clock and temperature columns during the run. If p99.9 is close to p50 you have a healthy, tight distribution. If the max is several times p50, hunt the spike: it is almost always an allocation on the hot path, a thermal downclock, or a garbage-collection pause.

Verify: confirm p99.9 plus the rest of your committed budget from step 1 still fits inside the hard deadline. If it does not, you found the problem in the lab instead of on the floor.

10. Run It as a Service

Goal: make the rig survive a reboot and restart itself if it crashes.

A pipeline you have to start by hand over SSH is not deployed. Wrap it in a systemd unit so it comes up with the board and restarts on failure.

# /etc/systemd/system/vision-qc.service
[Unit]
Description=Edge Vision Quality Control
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=vision
WorkingDirectory=/opt/vision-qc
ExecStartPre=/usr/bin/sudo /usr/bin/jetson_clocks
ExecStart=/usr/bin/python3 /opt/vision-qc/main.py
Restart=always
RestartSec=2
# fail fast if it cannot reach the PLC or camera on startup
TimeoutStartSec=30

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now vision-qc.service
sudo systemctl status vision-qc.service
journalctl -u vision-qc.service -f

The ExecStartPre line re-pins the clocks on every boot, because jetson_clocks does not persist across reboots by default. Restart=always brings the loop back after a crash, and the startup timeout makes a rig that cannot see its camera or PLC fail loudly instead of limping.

Verify: reboot the board and confirm the service comes up on its own and starts processing parts. Then kill the process and watch systemd restart it within a couple of seconds.

Common Pitfalls

Free-running cameras. A camera that streams at a fixed FPS hands you frames that do not line up with parts. Hardware-trigger every acquisition.
Calibrating INT8 on the wrong data. Factory lighting is not COCO. Calibrate on rig-captured frames or accept silent accuracy loss.
Reaching for pycuda CUDA graphs. pycuda has no graph capture API. Use cuda-python (cudart) for graph capture, and keep pycuda only for plain memory and stream work if you prefer it elsewhere.
Allocating on the hot path. Every np.zeros and every buffer allocation inside the loop is a latency spike waiting for the worst possible moment.
Ignoring thermal throttling. Orin quietly downclocks under sustained load in a hot enclosure, and MAXN makes this worse over long runs. Pin clocks, monitor tegrastats, and use a -maxn sustained config with active cooling for three-shift duty.
Treating mean latency as the SLA. The line cares about your p99.9. A rare 500 ms spike still drops parts.

Troubleshooting

Symptom: Inference latency creeps up over hours. Cause: Thermal throttling in the enclosure. Fix: Add active cooling, monitor tegrastats, alert on clock drops below the pinned frequency, and switch to a -maxn sustained power config rather than raw MAXN.
Symptom: mAP drops sharply after deploying the INT8 engine. Cause: Calibration cache built from non-representative images. Fix: Rebuild the cache from a few hundred frames captured on the actual rig under production lighting, then re-validate against a held-out set.
Symptom: AttributeError on stream.begin_capture or cuda.Graph. Cause: You are calling CUDA graph methods that pycuda does not implement. Fix: Capture the graph with cuda-python (cuda.bindings.runtime) as in step 6.
Symptom: Random missed parts with no logged error. Cause: Camera buffer pool exhausted, frames silently dropped. Fix: Increase the pre-allocated buffer count and return buffers to the pool immediately after copy.
Symptom: PLC occasionally ignores reject commands. Cause: Modbus write timing out under network load, or a stale slave-address keyword. Fix: Put the vision system and PLC on an isolated VLAN, shorten cabling, confirm the device_id keyword matches your pymodbus version, and verify the fail-closed fallback fires.
Symptom: First few frames after startup are very slow. Cause: TensorRT context and CUDA graph not yet warmed. Fix: Run several warm-up inferences before signalling the line that the system is ready.

Wrapping Up

A vision quality control system lives or dies on its latency distribution and its honest contact with the machinery, not on a leaderboard mAP score. Walk the ten steps in order: budget the deadline, pin the board, trigger the camera, export and calibrate the model, capture the graph correctly, close the loop to the PLC, then validate the tail and deploy it as a service. Get those right and the model becomes a swappable part. Next we will go one level deeper into the component everyone wants to talk about first: training the defect detection model that this pipeline serves.

Before You Start

1. Write the Latency Budget

2. Provision the Jetson Orin

3. Set Up Hardware-Triggered Capture

4. Export YOLOv11 to ONNX

5. Build the INT8 TensorRT Engine

6. Write the Inference Service

7. Close the Loop to the PLC

8. Orchestrate the Hot Loop

9. Validate the Latency Distribution

10. Run It as a Service

Common Pitfalls

Troubleshooting

Wrapping Up

Related posts

Edge AI Hardware in April 2025, Jetson, Coral, and Raspberry Pi 5 AI Hat

Connecting Edge Vision Inference to an MQTT Telemetry Backbone

Deploying YOLO Models on NVIDIA Jetson with TensorRT

Why Small Language Models Belong at the Edge in 2026

Observability for Edge Fleets at Scale, Patterns That Work

Deploying Models with TFLite Micro on Constrained Devices

Bridging OPC UA and Modbus to MQTT in Go, A Step by Step Guide

Streaming Inference Pipelines with Kafka and Go, A Production Walkthrough

Let’s Start a Project