background-shape
ONNX Runtime on Edge Devices, A Comprehensive Tutorial
April 16, 2025 · 9 min read · by Muhammad Amal programming

TL;DR — ONNX Runtime 1.20 (October 2024) is the most pragmatic inference runtime for edge in April 2025. Pick the right execution provider (CUDA for Jetson, XNNPACK for Pi, DirectML for Windows IoT), quantize to INT8, set threads explicitly, share a single session across requests. Skip the defaults and you’ll leave 3-5x performance on the table.

There are roughly three serious inference runtimes you’d consider for the edge in 2025: ONNX Runtime, TensorRT, and TFLite. TensorRT is the right answer if you’re locked into NVIDIA. TFLite is the right answer if you’re on Coral. ONNX Runtime is the right answer when you want one toolchain across all your hardware, and you’re willing to pay a small performance tax for that portability.

This is a hands-on guide to running ONNX Runtime 1.20 on edge devices, with the configuration knobs that actually matter. We’ll cover provider selection, quantization, session config, and a benchmark harness you can copy.

Building on the hardware tour from earlier this month, we’re now picking the runtime that sits on those boards. The Jetson examples assume JetPack 6.1, the Pi 5 examples assume Raspberry Pi OS Bookworm.

1. Execution providers, the only choice that matters

An ONNX Runtime “execution provider” is the backend that actually runs the math. Picking the right one is more important than any other knob.

+----------------------+-------------------+----------------------+
| Hardware             | Best provider     | Fallback             |
+----------------------+-------------------+----------------------+
| Jetson Orin Nano     | CUDAExecutionProv | TensorrtExecution... |
|                      |  (or TensorRT)    |                      |
| Raspberry Pi 5       | XnnpackExecution  | CPUExecutionProvider |
|                      |  Provider         |                      |
| x86_64 with GPU      | CUDAExecutionProv | CPUExecutionProvider |
| x86_64 CPU only      | CPUExecutionProvi |  (with OpenMP)       |
| Windows IoT          | DmlExecutionProvi | CPUExecutionProvider |
| ARM CPU only         | XnnpackExecution  | CPUExecutionProvider |
+----------------------+-------------------+----------------------+

You list providers in priority order; ONNX Runtime tries each and falls back if the op isn’t supported. The wrong order silently runs your model on CPU.

1.1 Installing the right wheel

This is where most people trip. pip install onnxruntime installs the CPU-only wheel. That’s not what you want on a Jetson.

# Jetson Orin Nano with JetPack 6.1 (Python 3.10)
# NVIDIA publishes pre-built wheels at developer.download.nvidia.com
pip install --extra-index-url https://pypi.jetson-ai-lab.dev/jp6/cu126 \
  onnxruntime-gpu==1.20.0

# Raspberry Pi 5 (aarch64, Python 3.11)
pip install onnxruntime==1.20.0  # CPU wheel; XNNPACK provider is included

# x86_64 with NVIDIA GPU
pip install onnxruntime-gpu==1.20.0

# Confirm providers
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Jetson: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
# Pi 5:   ['XnnpackExecutionProvider', 'CPUExecutionProvider']

If you see ['CPUExecutionProvider'] on a Jetson, you installed the wrong wheel.

2. Quantizing your model to INT8

Most edge gains come from quantization, not from a faster provider. FP32 to INT8 is a 4x reduction in memory bandwidth and 2-4x speedup on most ops.

2.1 Static quantization with a calibration dataset

# quantize_static.py
from onnxruntime.quantization import quantize_static, QuantType, CalibrationDataReader
from onnxruntime.quantization.shape_inference import quant_pre_process
import numpy as np
import onnxruntime as ort

class ImageCalibReader(CalibrationDataReader):
    def __init__(self, image_paths, input_name, batch_size=1):
        self.image_paths = image_paths
        self.input_name = input_name
        self.batch_size = batch_size
        self.idx = 0

    def get_next(self):
        if self.idx >= len(self.image_paths):
            return None
        batch = self.image_paths[self.idx:self.idx + self.batch_size]
        self.idx += self.batch_size
        # Preprocess to NCHW float32 in [0,1]
        from PIL import Image
        imgs = []
        for p in batch:
            im = Image.open(p).convert("RGB").resize((640, 640))
            arr = np.asarray(im, dtype=np.float32) / 255.0
            arr = arr.transpose(2, 0, 1)  # HWC -> CHW
            imgs.append(arr)
        return {self.input_name: np.stack(imgs, axis=0)}

if __name__ == "__main__":
    # Step 1: pre-process model (fold constants, infer shapes)
    quant_pre_process("yolov8n.onnx", "yolov8n.preproc.onnx")

    # Step 2: collect calibration images
    import glob
    calib_paths = glob.glob("calib_images/*.jpg")[:200]
    reader = ImageCalibReader(calib_paths, input_name="images")

    # Step 3: quantize
    quantize_static(
        model_input="yolov8n.preproc.onnx",
        model_output="yolov8n.int8.onnx",
        calibration_data_reader=reader,
        quant_format="QDQ",          # QDQ is portable; QOperator is faster but provider-specific
        activation_type=QuantType.QInt8,
        weight_type=QuantType.QInt8,
        per_channel=True,
        reduce_range=False,
    )

A few notes from production. per_channel=True gives meaningfully better accuracy at no cost. quant_format="QDQ" is portable across providers; "QOperator" is faster on CUDA but won’t work on XNNPACK. Use 100-500 calibration images that match your production distribution; 10 is too few, 5000 is overkill.

2.2 Quantization-aware training, when static isn’t enough

If post-training quantization drops your accuracy by more than 2-3 points, you need QAT. That’s a separate (longer) topic. Train in PyTorch with torch.ao.quantization, then export. ONNX Runtime accepts the result like any other model.

3. Session configuration that’s not the default

The default InferenceSession() constructor gives you bad defaults for edge. Here’s a config that’s been tested on Pi 5 and Jetson:

# infer_session.py
import onnxruntime as ort

def make_session(model_path: str, is_jetson: bool) -> ort.InferenceSession:
    so = ort.SessionOptions()

    # Threading: explicit, not whatever ORT guesses
    so.intra_op_num_threads = 4   # parallelism within an op
    so.inter_op_num_threads = 1   # parallelism between ops; usually 1 is best

    # Graph optimization: enable all
    so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

    # Memory: arena enabled (default), but cap it
    so.enable_mem_pattern = True
    so.enable_cpu_mem_arena = True

    # Execution mode: sequential is faster for most models on edge
    so.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

    # Logging: warnings only in production
    so.log_severity_level = 2

    if is_jetson:
        providers = [
            ("TensorrtExecutionProvider", {
                "trt_engine_cache_enable": True,
                "trt_engine_cache_path": "/var/cache/trt_engines",
                "trt_fp16_enable": True,
                "trt_int8_enable": True,
                "trt_int8_calibration_table_name": "calib.flatbuffers",
                "trt_max_workspace_size": 1 << 30,  # 1 GB
            }),
            ("CUDAExecutionProvider", {
                "device_id": 0,
                "arena_extend_strategy": "kNextPowerOfTwo",
                "gpu_mem_limit": 2 * 1024 * 1024 * 1024,  # 2 GB
                "cudnn_conv_algo_search": "EXHAUSTIVE",
                "do_copy_in_default_stream": True,
            }),
            "CPUExecutionProvider",
        ]
    else:
        providers = [
            ("XnnpackExecutionProvider", {"intra_op_num_threads": 4}),
            "CPUExecutionProvider",
        ]

    return ort.InferenceSession(model_path, sess_options=so, providers=providers)

The TensorRT provider’s first call is slow. It compiles the model into a TRT engine, which takes 30-300 seconds depending on model size. trt_engine_cache_enable=True caches the compiled engine to disk so subsequent runs are instant.

4. The inference loop, doing it right

Here’s a Python loop that’s been profiled and is what I run in production:

# bench_infer.py
import numpy as np
import time
from infer_session import make_session

sess = make_session("yolov8n.int8.onnx", is_jetson=False)

input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape  # ['N', 3, 640, 640]

# Pre-allocate input buffer; this avoids per-call allocation
batch = np.zeros((1, 3, 640, 640), dtype=np.float32)

# Bind to IO binding for zero-copy on GPU providers
io_binding = sess.io_binding()
io_binding.bind_cpu_input(input_name, batch)
for out in sess.get_outputs():
    io_binding.bind_output(out.name)

# Warmup
for _ in range(20):
    sess.run_with_iobinding(io_binding)

# Benchmark
N = 1000
start = time.perf_counter()
for _ in range(N):
    sess.run_with_iobinding(io_binding)
elapsed = time.perf_counter() - start
print(f"{N/elapsed:.1f} FPS, {elapsed/N*1000:.2f} ms/iter")

run_with_iobinding is the API you want on edge. It avoids the implicit input copy that run() does. On Jetson with CUDA provider, this is a 20-30% speedup on small models because the copies dominate.

5. Calling ONNX Runtime from Go

ONNX Runtime ships a C API, and there are CGo bindings. The maintained one is github.com/yalue/onnxruntime_go (v1.16.0 in April 2025).

// main.go
package main

import (
    "fmt"
    "log"
    "time"

    ort "github.com/yalue/onnxruntime_go"
)

func main() {
    ort.SetSharedLibraryPath("/usr/local/lib/libonnxruntime.so.1.20.0")
    if err := ort.InitializeEnvironment(); err != nil {
        log.Fatal(err)
    }
    defer ort.DestroyEnvironment()

    inputShape := ort.NewShape(1, 3, 640, 640)
    inputData := make([]float32, 1*3*640*640)
    inputTensor, err := ort.NewTensor(inputShape, inputData)
    if err != nil { log.Fatal(err) }
    defer inputTensor.Destroy()

    // Output shape depends on model; YOLOv8n produces 1x84x8400
    outputShape := ort.NewShape(1, 84, 8400)
    outputData := make([]float32, 1*84*8400)
    outputTensor, err := ort.NewTensor(outputShape, outputData)
    if err != nil { log.Fatal(err) }
    defer outputTensor.Destroy()

    sess, err := ort.NewAdvancedSession(
        "yolov8n.int8.onnx",
        []string{"images"},
        []string{"output0"},
        []ort.Value{inputTensor},
        []ort.Value{outputTensor},
        nil, // default session options
    )
    if err != nil { log.Fatal(err) }
    defer sess.Destroy()

    // Warmup
    for i := 0; i < 20; i++ {
        if err := sess.Run(); err != nil { log.Fatal(err) }
    }

    // Bench
    const N = 1000
    start := time.Now()
    for i := 0; i < N; i++ {
        if err := sess.Run(); err != nil { log.Fatal(err) }
    }
    elapsed := time.Since(start)
    fmt.Printf("%.1f FPS, %.2f ms/iter\n", float64(N)/elapsed.Seconds(), float64(elapsed.Microseconds())/float64(N)/1000.0)
}

The Go wrapper is thin over the C API, which means the same provider selection and quantization apply. The Go side is for when you want to colocate inference with your network or aggregation logic without paying the Python startup cost.

6. Throughput numbers from the field

Real numbers from April 2025 benchmarks on YOLOv8n at 640x640, INT8 quantized, batch=1.

+----------------------+-----------+-----------+-----------+
| Hardware             | Provider  | FPS       | Latency   |
+----------------------+-----------+-----------+-----------+
| Jetson Orin Nano Sup | TensorRT  | 162       | 6.2 ms    |
| Jetson Orin Nano Sup | CUDA      | 88        | 11.4 ms   |
| Raspberry Pi 5       | XNNPACK   | 23        | 43 ms     |
| Raspberry Pi 5       | CPU       | 11        | 91 ms     |
| x86 Ryzen 7700X      | CPU       | 95        | 10.5 ms   |
| x86 + RTX 4070       | CUDA      | 720       | 1.4 ms    |
+----------------------+-----------+-----------+-----------+

Headline takeaways. TensorRT on Jetson is 2x CUDA, so always use it if your model converts cleanly. XNNPACK on Pi 5 is 2x naive CPU, so don’t skip it. INT8 vs FP32 is roughly 2x on most providers; do the quantization work.

7. Common Pitfalls

Pitfall 1, leaving threading at the default

The default intra_op_num_threads is “number of cores,” which on a Jetson is 6. That’s almost always wrong. The accelerator provider doesn’t use CPU threads much, and oversubscribing causes context-switching overhead. Set intra_op_num_threads = 2 for CUDA/TRT, intra_op_num_threads = 4 for XNNPACK on Pi 5.

Pitfall 2, creating a session per request

InferenceSession() is heavy. It loads the model, optimizes the graph, and (on TRT) compiles. Per-request session creation will kill your throughput. Create one session at startup, share it across requests. The session is thread-safe for run() calls.

Pitfall 3, forgetting to disable mem_pattern on dynamic shapes

If your input shape varies (e.g., variable batch size), enable_mem_pattern=True will re-plan memory on every shape change. Set enable_mem_pattern=False for dynamic-shape models. Catches a lot of people who deploy LLMs to edge.

Pitfall 4, TensorRT engine cache invalidation

The TRT engine cache key depends on model hash, GPU model, TRT version, and a few other things. If you update any of those, the cache silently rebuilds, which looks like a 5-minute startup hang. Either pre-build the engine and ship it with your app, or warn loudly on rebuild.

8. Troubleshooting

“No CUDA provider available” on Jetson

You installed onnxruntime instead of onnxruntime-gpu, or you installed the x86_64 GPU wheel which doesn’t include the right CUDA bindings for Tegra. Reinstall from the Jetson AI Lab index.

Inference works but accuracy dropped 10 points after quantization

Almost always a calibration issue. Either your calibration images don’t match production distribution, or you forgot per_channel=True. Re-run with more diverse calibration data. If that doesn’t help, you need QAT.

Memory grows over time

You’re probably creating tensors per-request without destroying them. The Go bindings make this explicit (defer tensor.Destroy()), the Python bindings don’t, but they still leak if you keep references around. Use io_binding and reuse buffers.

9. Wrapping Up

ONNX Runtime 1.20 on edge is a mature, well-supported toolchain in April 2025. The pattern is: pick the right provider for the hardware, quantize to INT8 with real calibration data, configure threads and memory explicitly, and reuse sessions. Do that and you’ll get within 20% of the hardware-native runtime for a fraction of the porting effort.

Next post moves to streaming inference pipelines, where ONNX Runtime is one node in a Kafka-fed system. The single-board benchmarks here are the building block; next we wire them together.

The official ONNX Runtime docs are at onnxruntime.ai/docs and the model zoo at github.com/onnx/models is genuinely useful for getting started.