Deploying YOLO Models on NVIDIA Jetson with TensorRT

Jetson article cover illustration on a gradient background

April 7, 2026 · 7 min read · by Muhammad Amal programming

TL;DR — Build the TensorRT engine on the exact Jetson that will run it / INT8 plus a calibration cache is the latency win that matters / DeepStream handles multi-stream plumbing you should not write yourself

The gap between a YOLO model that works in a notebook and one that runs reliably on a Jetson at the edge is wider than most people expect. The model is the easy part. The hard part is the build toolchain, the precision tuning, the version pinning, and the stream plumbing — the unglamorous infrastructure that decides whether the deployment survives contact with a 24/7 production line.

A YOLO Jetson TensorRT deployment has one rule that overrides all others: the engine is not portable. A .engine file is compiled against a specific TensorRT version, a specific GPU architecture, and a specific CUDA stack. Build it on your laptop and copy it to an Orin and it will refuse to load, or worse, load and produce garbage. The engine must be built on the same hardware and software stack it will run on.

This guide takes a trained YOLOv11 checkpoint — the kind produced in the defect detection training walkthrough — and turns it into a hardened, multi-stream inference service on Jetson Orin using TensorRT 10 and DeepStream 7.

The Toolchain, Pinned

JetPack ships a matched set of CUDA, cuDNN, and TensorRT. Do not mix and match. Note the exact versions and treat them as a contract.

# Verify the stack on the target Orin
dpkg -l | grep -E 'tensorrt|cuda-toolkit|deepstream' | awk '{print $2, $3}'
# Expected (JetPack 6.1):
#   nvidia-tensorrt        10.3.x
#   cuda-toolkit-12-6      12.6.x
#   deepstream-7.1         7.1.x

# Pin them so unattended-upgrades cannot break the deployment
sudo apt-mark hold nvidia-tensorrt cuda-toolkit-12-6 deepstream-7.1

The apt-mark hold is not optional. I have lost a deployment to an overnight package upgrade that bumped TensorRT a minor version and invalidated every engine on the device. Hold the packages and upgrade deliberately.

Step 1: Export to ONNX

Export on any machine, but keep the parameters identical to what you will build with. Fix the batch size and image size; dynamic shapes cost you optimization opportunities you do not need.

# Export YOLOv11 to ONNX, opset 17, static shapes
yolo export model=defect_yolo11s.pt format=onnx \
    opset=17 imgsz=640 batch=1 simplify=True dynamic=False

# Sanity-check the graph
python -c "import onnx; m=onnx.load('defect_yolo11s.onnx'); \
    onnx.checker.check_model(m); print('ONNX OK', m.opset_import)"

Step 2: Build the TensorRT Engine on the Orin

Copy the ONNX file to the Orin and build there. INT8 is the precision you want for production — it roughly halves inference time versus FP16 on Orin — but it needs a calibration cache built from representative frames.

# On the Orin: build the INT8 engine
/usr/src/tensorrt/bin/trtexec \
    --onnx=defect_yolo11s.onnx \
    --saveEngine=defect_yolo11s.int8.engine \
    --int8 --fp16 \
    --calib=calib_cache.bin \
    --shapes=images:1x3x640x640 \
    --builderOptimizationLevel=5 \
    --useCudaGraph \
    --memPoolSize=workspace:4096 \
    --timingCacheFile=timing.cache

The --fp16 flag alongside --int8 lets the builder fall back to FP16 for layers where INT8 hurts accuracy too much — TensorRT picks per-layer. The --timingCacheFile makes subsequent rebuilds dramatically faster by caching kernel benchmark results.

If you do not yet have a calibration cache, generate one with a Python calibrator over a few hundred line-captured frames before running the build. Calibrate on production frames, not stock images, or you trade away real accuracy.

# verify_engine.py — confirm the engine loads and matches expected shapes
import tensorrt as trt

logger = trt.Logger(trt.Logger.WARNING)
with open("defect_yolo11s.int8.engine", "rb") as f, trt.Runtime(logger) as rt:
    engine = rt.deserialize_cuda_engine(f.read())

assert engine is not None, "engine failed to deserialize"
for i in range(engine.num_io_tensors):
    name = engine.get_tensor_name(i)
    mode = engine.get_tensor_mode(name)
    print(name, mode, engine.get_tensor_shape(name),
          engine.get_tensor_dtype(name))

Step 3: A Standalone Inference Service

For single-stream deployments, a thin TensorRT runtime is enough. Pin memory, capture a CUDA graph, and never allocate in the hot loop.

# trt_runner.py — minimal hardened TensorRT 10 runner
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

class TRTRunner:
    def __init__(self, engine_path: str):
        logger = trt.Logger(trt.Logger.ERROR)
        with open(engine_path, "rb") as f, trt.Runtime(logger) as rt:
            self.engine = rt.deserialize_cuda_engine(f.read())
        if self.engine is None:
            raise RuntimeError(f"failed to load engine {engine_path}")
        self.ctx = self.engine.create_execution_context()
        self.stream = cuda.Stream()
        self.bindings = {}
        for i in range(self.engine.num_io_tensors):
            name = self.engine.get_tensor_name(i)
            shape = tuple(self.engine.get_tensor_shape(name))
            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
            host = cuda.pagelocked_empty(shape, dtype)
            dev = cuda.mem_alloc(host.nbytes)
            self.bindings[name] = (host, dev, shape)
            self.ctx.set_tensor_address(name, int(dev))

    def infer(self, name_in: str, data: np.ndarray, name_out: str):
        h_in, d_in, _ = self.bindings[name_in]
        h_out, d_out, _ = self.bindings[name_out]
        np.copyto(h_in, data)
        cuda.memcpy_htod_async(d_in, h_in, self.stream)
        ok = self.ctx.execute_async_v3(self.stream.handle)
        if not ok:
            raise RuntimeError("execute_async_v3 returned false")
        cuda.memcpy_dtoh_async(h_out, d_out, self.stream)
        self.stream.synchronize()
        return h_out.copy()

Step 4: Multi-Stream with DeepStream 7

The moment you have more than one camera, stop writing your own pipeline. DeepStream 7 handles decode, batching, inference, and tracking across streams on the hardware video engines, leaving the GPU free for inference. You configure it; you do not code it.

# config_infer_yolo11.txt — nvinfer config for DeepStream 7
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-engine-file=defect_yolo11s.int8.engine
labelfile-path=labels.txt
batch-size=4
network-mode=1
num-detected-classes=4
interval=0
gie-unique-id=1
process-mode=1
network-type=0
cluster-mode=2
maintain-aspect-ratio=1
symmetric-padding=1
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=/opt/nvidia/deepstream/deepstream/lib/libnvdsinfer_custom_impl_Yolo.so

[class-attrs-all]
nms-iou-threshold=0.45
pre-cluster-threshold=0.40
topk=100

The pipeline itself is assembled from GStreamer elements. The key elements are nvstreammux to batch streams, nvinfer to run the engine, and nvtracker to keep object identities stable across frames.

# deepstream_pipeline.py — 4-camera YOLO inference pipeline
import gi
gi.require_version("Gst", "1.0")
from gi.repository import Gst, GLib
import sys

Gst.init(None)

def build_pipeline(rtsp_uris: list[str]) -> Gst.Pipeline:
    pipeline = Gst.Pipeline.new("qc-pipeline")
    streammux = Gst.ElementFactory.make("nvstreammux", "mux")
    streammux.set_property("batch-size", len(rtsp_uris))
    streammux.set_property("width", 1280)
    streammux.set_property("height", 720)
    streammux.set_property("batched-push-timeout", 40000)   # 40 ms
    pipeline.add(streammux)

    for i, uri in enumerate(rtsp_uris):
        src = Gst.ElementFactory.make("uridecodebin", f"src-{i}")
        src.set_property("uri", uri)
        pipeline.add(src)
        def on_pad(_, pad, idx=i):
            sink = streammux.get_request_pad(f"sink_{idx}")
            if pad.link(sink) != Gst.PadLinkReturn.OK:
                raise RuntimeError(f"failed to link stream {idx}")
        src.connect("pad-added", on_pad)

    pgie = Gst.ElementFactory.make("nvinfer", "primary-gie")
    pgie.set_property("config-file-path", "config_infer_yolo11.txt")
    tracker = Gst.ElementFactory.make("nvtracker", "tracker")
    tracker.set_property("ll-lib-file",
        "/opt/nvidia/deepstream/deepstream/lib/libnvds_nvmultiobjecttracker.so")
    sink = Gst.ElementFactory.make("fakesink", "sink")

    for el in (pgie, tracker, sink):
        pipeline.add(el)
    streammux.link(pgie)
    pgie.link(tracker)
    tracker.link(sink)
    return pipeline

if __name__ == "__main__":
    pipe = build_pipeline(sys.argv[1:])
    pipe.set_state(Gst.State.PLAYING)
    loop = GLib.MainLoop()
    try:
        loop.run()
    except KeyboardInterrupt:
        pass
    finally:
        pipe.set_state(Gst.State.NULL)

Production Hardening

A deployment is not done when it runs. Wrap it in a systemd service that restarts on crash, runs after the network is up, and logs to the journal.

# /etc/systemd/system/qc-vision.service
[Unit]
Description=YOLO QC Vision Pipeline
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=qc
ExecStartPre=/usr/bin/jetson_clocks
ExecStart=/usr/bin/python3 /opt/qc/deepstream_pipeline.py \
    rtsp://cam1/stream rtsp://cam2/stream
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=120
StartLimitBurst=5

[Install]
WantedBy=multi-user.target

Monitor the device with tegrastats piped into your metrics stack. The two numbers to alert on are GPU utilization sustained near 100 percent — you have no headroom for spikes — and any clock frequency dropping below the pinned value, which means thermal throttling.

Common Pitfalls

Building the engine off-device. A .engine is tied to the exact TensorRT version and GPU. Build on the Orin that runs it.
Unpinned packages. An overnight upgrade can bump TensorRT and invalidate every engine. Use apt-mark hold.
Skipping calibration. INT8 without a representative calibration cache silently loses accuracy. Calibrate on production frames.
Hand-rolling multi-stream pipelines. DeepStream uses the hardware video engines for decode and batching. Reimplementing that in Python wastes the GPU.
No restart policy. A vision service that dies at 3 AM and stays dead until morning let through a full shift of defects.

Troubleshooting

Symptom: Engine fails to deserialize with a version error. Cause: Built on a different TensorRT version or GPU architecture. Fix: Rebuild the engine on the target device with its installed toolchain.
Symptom: DeepStream pipeline stalls on startup. Cause: An RTSP source pad never linked, so nvstreammux waits forever. Fix: Check the pad-added callback links and confirm every camera URI is reachable.
Symptom: Detections are wildly wrong after switching to INT8. Cause: Calibration cache mismatched or corrupt. Fix: Delete the cache, regenerate it from production frames, and rebuild.
Symptom: FPS drops under sustained load. Cause: Thermal throttling, visible as reduced clocks in tegrastats. Fix: Improve enclosure cooling and confirm jetson_clocks ran via the ExecStartPre hook.
Symptom: Pipeline runs but GPU sits near idle. Cause: Video decode bottlenecked on CPU because uridecodebin chose a software decoder. Fix: Force the hardware decoder and verify with nvidia-smi-equivalent tegrastats NVDEC counters.

What’s Next

A solid YOLO Jetson TensorRT deployment is mostly disciplined toolchain management: pin versions, build on-device, calibrate INT8 honestly, and let DeepStream own the stream plumbing. With inference running predictably on the edge, the next concern is the data coming off it — telemetry and detection events that need to reach dashboards and databases at scale. The canonical TensorRT reference lives at developer.nvidia.com .

The Toolchain, Pinned

Step 1: Export to ONNX

Step 2: Build the TensorRT Engine on the Orin

Step 3: A Standalone Inference Service

Step 4: Multi-Stream with DeepStream 7

Production Hardening

Common Pitfalls

Troubleshooting

What’s Next

Related posts

Connecting Edge Vision Inference to an MQTT Telemetry Backbone

Architecting Computer Vision Quality Control at the Industrial Edge

Edge AI Hardware in April 2025, Jetson, Coral, and Raspberry Pi 5 AI Hat

Deploying Next.js 12, Vercel vs Self-Hosted

Deploying Docker Images from GitHub Actions to Staging

End-to-End Industrial AI, From Camera to Dashboard

Tuning MQTT QoS and Persistence for Reliable Sensor Delivery

Optimizing MQTT Clusters for Critical Environmental Monitoring

Let’s Start a Project