ONNX Runtime on Edge Devices, A Comprehensive Tutorial
TL;DR — ONNX Runtime 1.20 (October 2024) is the most pragmatic inference runtime for edge in April 2025. Pick the right execution provider (CUDA for Jetson, XNNPACK for Pi, DirectML for Windows IoT), quantize to INT8, set threads explicitly, share a single session across requests. Skip the defaults and you’ll leave 3-5x performance on the table.
There are roughly three serious inference runtimes you’d consider for the edge in 2025: ONNX Runtime, TensorRT, and TFLite. TensorRT is the right answer if you’re locked into NVIDIA. TFLite is the right answer if you’re on Coral. ONNX Runtime is the right answer when you want one toolchain across all your hardware, and you’re willing to pay a small performance tax for that portability.
This is a hands-on guide to running ONNX Runtime 1.20 on edge devices, with the configuration knobs that actually matter. We’ll cover provider selection, quantization, session config, and a benchmark harness you can copy.
Building on the hardware tour from earlier this month, we’re now picking the runtime that sits on those boards. The Jetson examples assume JetPack 6.1, the Pi 5 examples assume Raspberry Pi OS Bookworm.
1. Execution providers, the only choice that matters
An ONNX Runtime “execution provider” is the backend that actually runs the math. Picking the right one is more important than any other knob.
+----------------------+-------------------+----------------------+
| Hardware | Best provider | Fallback |
+----------------------+-------------------+----------------------+
| Jetson Orin Nano | CUDAExecutionProv | TensorrtExecution... |
| | (or TensorRT) | |
| Raspberry Pi 5 | XnnpackExecution | CPUExecutionProvider |
| | Provider | |
| x86_64 with GPU | CUDAExecutionProv | CPUExecutionProvider |
| x86_64 CPU only | CPUExecutionProvi | (with OpenMP) |
| Windows IoT | DmlExecutionProvi | CPUExecutionProvider |
| ARM CPU only | XnnpackExecution | CPUExecutionProvider |
+----------------------+-------------------+----------------------+
You list providers in priority order; ONNX Runtime tries each and falls back if the op isn’t supported. The wrong order silently runs your model on CPU.
1.1 Installing the right wheel
This is where most people trip. pip install onnxruntime installs the CPU-only wheel. That’s not what you want on a Jetson.
# Jetson Orin Nano with JetPack 6.1 (Python 3.10)
# NVIDIA publishes pre-built wheels at developer.download.nvidia.com
pip install --extra-index-url https://pypi.jetson-ai-lab.dev/jp6/cu126 \
onnxruntime-gpu==1.20.0
# Raspberry Pi 5 (aarch64, Python 3.11)
pip install onnxruntime==1.20.0 # CPU wheel; XNNPACK provider is included
# x86_64 with NVIDIA GPU
pip install onnxruntime-gpu==1.20.0
# Confirm providers
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Jetson: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
# Pi 5: ['XnnpackExecutionProvider', 'CPUExecutionProvider']
If you see ['CPUExecutionProvider'] on a Jetson, you installed the wrong wheel.
2. Quantizing your model to INT8
Most edge gains come from quantization, not from a faster provider. FP32 to INT8 is a 4x reduction in memory bandwidth and 2-4x speedup on most ops.
2.1 Static quantization with a calibration dataset
# quantize_static.py
from onnxruntime.quantization import quantize_static, QuantType, CalibrationDataReader
from onnxruntime.quantization.shape_inference import quant_pre_process
import numpy as np
import onnxruntime as ort
class ImageCalibReader(CalibrationDataReader):
def __init__(self, image_paths, input_name, batch_size=1):
self.image_paths = image_paths
self.input_name = input_name
self.batch_size = batch_size
self.idx = 0
def get_next(self):
if self.idx >= len(self.image_paths):
return None
batch = self.image_paths[self.idx:self.idx + self.batch_size]
self.idx += self.batch_size
# Preprocess to NCHW float32 in [0,1]
from PIL import Image
imgs = []
for p in batch:
im = Image.open(p).convert("RGB").resize((640, 640))
arr = np.asarray(im, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1) # HWC -> CHW
imgs.append(arr)
return {self.input_name: np.stack(imgs, axis=0)}
if __name__ == "__main__":
# Step 1: pre-process model (fold constants, infer shapes)
quant_pre_process("yolov8n.onnx", "yolov8n.preproc.onnx")
# Step 2: collect calibration images
import glob
calib_paths = glob.glob("calib_images/*.jpg")[:200]
reader = ImageCalibReader(calib_paths, input_name="images")
# Step 3: quantize
quantize_static(
model_input="yolov8n.preproc.onnx",
model_output="yolov8n.int8.onnx",
calibration_data_reader=reader,
quant_format="QDQ", # QDQ is portable; QOperator is faster but provider-specific
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
per_channel=True,
reduce_range=False,
)
A few notes from production. per_channel=True gives meaningfully better accuracy at no cost. quant_format="QDQ" is portable across providers; "QOperator" is faster on CUDA but won’t work on XNNPACK. Use 100-500 calibration images that match your production distribution; 10 is too few, 5000 is overkill.
2.2 Quantization-aware training, when static isn’t enough
If post-training quantization drops your accuracy by more than 2-3 points, you need QAT. That’s a separate (longer) topic. Train in PyTorch with torch.ao.quantization, then export. ONNX Runtime accepts the result like any other model.
3. Session configuration that’s not the default
The default InferenceSession() constructor gives you bad defaults for edge. Here’s a config that’s been tested on Pi 5 and Jetson:
# infer_session.py
import onnxruntime as ort
def make_session(model_path: str, is_jetson: bool) -> ort.InferenceSession:
so = ort.SessionOptions()
# Threading: explicit, not whatever ORT guesses
so.intra_op_num_threads = 4 # parallelism within an op
so.inter_op_num_threads = 1 # parallelism between ops; usually 1 is best
# Graph optimization: enable all
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Memory: arena enabled (default), but cap it
so.enable_mem_pattern = True
so.enable_cpu_mem_arena = True
# Execution mode: sequential is faster for most models on edge
so.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
# Logging: warnings only in production
so.log_severity_level = 2
if is_jetson:
providers = [
("TensorrtExecutionProvider", {
"trt_engine_cache_enable": True,
"trt_engine_cache_path": "/var/cache/trt_engines",
"trt_fp16_enable": True,
"trt_int8_enable": True,
"trt_int8_calibration_table_name": "calib.flatbuffers",
"trt_max_workspace_size": 1 << 30, # 1 GB
}),
("CUDAExecutionProvider", {
"device_id": 0,
"arena_extend_strategy": "kNextPowerOfTwo",
"gpu_mem_limit": 2 * 1024 * 1024 * 1024, # 2 GB
"cudnn_conv_algo_search": "EXHAUSTIVE",
"do_copy_in_default_stream": True,
}),
"CPUExecutionProvider",
]
else:
providers = [
("XnnpackExecutionProvider", {"intra_op_num_threads": 4}),
"CPUExecutionProvider",
]
return ort.InferenceSession(model_path, sess_options=so, providers=providers)
The TensorRT provider’s first call is slow. It compiles the model into a TRT engine, which takes 30-300 seconds depending on model size. trt_engine_cache_enable=True caches the compiled engine to disk so subsequent runs are instant.
4. The inference loop, doing it right
Here’s a Python loop that’s been profiled and is what I run in production:
# bench_infer.py
import numpy as np
import time
from infer_session import make_session
sess = make_session("yolov8n.int8.onnx", is_jetson=False)
input_name = sess.get_inputs()[0].name
input_shape = sess.get_inputs()[0].shape # ['N', 3, 640, 640]
# Pre-allocate input buffer; this avoids per-call allocation
batch = np.zeros((1, 3, 640, 640), dtype=np.float32)
# Bind to IO binding for zero-copy on GPU providers
io_binding = sess.io_binding()
io_binding.bind_cpu_input(input_name, batch)
for out in sess.get_outputs():
io_binding.bind_output(out.name)
# Warmup
for _ in range(20):
sess.run_with_iobinding(io_binding)
# Benchmark
N = 1000
start = time.perf_counter()
for _ in range(N):
sess.run_with_iobinding(io_binding)
elapsed = time.perf_counter() - start
print(f"{N/elapsed:.1f} FPS, {elapsed/N*1000:.2f} ms/iter")
run_with_iobinding is the API you want on edge. It avoids the implicit input copy that run() does. On Jetson with CUDA provider, this is a 20-30% speedup on small models because the copies dominate.
5. Calling ONNX Runtime from Go
ONNX Runtime ships a C API, and there are CGo bindings. The maintained one is github.com/yalue/onnxruntime_go (v1.16.0 in April 2025).
// main.go
package main
import (
"fmt"
"log"
"time"
ort "github.com/yalue/onnxruntime_go"
)
func main() {
ort.SetSharedLibraryPath("/usr/local/lib/libonnxruntime.so.1.20.0")
if err := ort.InitializeEnvironment(); err != nil {
log.Fatal(err)
}
defer ort.DestroyEnvironment()
inputShape := ort.NewShape(1, 3, 640, 640)
inputData := make([]float32, 1*3*640*640)
inputTensor, err := ort.NewTensor(inputShape, inputData)
if err != nil { log.Fatal(err) }
defer inputTensor.Destroy()
// Output shape depends on model; YOLOv8n produces 1x84x8400
outputShape := ort.NewShape(1, 84, 8400)
outputData := make([]float32, 1*84*8400)
outputTensor, err := ort.NewTensor(outputShape, outputData)
if err != nil { log.Fatal(err) }
defer outputTensor.Destroy()
sess, err := ort.NewAdvancedSession(
"yolov8n.int8.onnx",
[]string{"images"},
[]string{"output0"},
[]ort.Value{inputTensor},
[]ort.Value{outputTensor},
nil, // default session options
)
if err != nil { log.Fatal(err) }
defer sess.Destroy()
// Warmup
for i := 0; i < 20; i++ {
if err := sess.Run(); err != nil { log.Fatal(err) }
}
// Bench
const N = 1000
start := time.Now()
for i := 0; i < N; i++ {
if err := sess.Run(); err != nil { log.Fatal(err) }
}
elapsed := time.Since(start)
fmt.Printf("%.1f FPS, %.2f ms/iter\n", float64(N)/elapsed.Seconds(), float64(elapsed.Microseconds())/float64(N)/1000.0)
}
The Go wrapper is thin over the C API, which means the same provider selection and quantization apply. The Go side is for when you want to colocate inference with your network or aggregation logic without paying the Python startup cost.
6. Throughput numbers from the field
Real numbers from April 2025 benchmarks on YOLOv8n at 640x640, INT8 quantized, batch=1.
+----------------------+-----------+-----------+-----------+
| Hardware | Provider | FPS | Latency |
+----------------------+-----------+-----------+-----------+
| Jetson Orin Nano Sup | TensorRT | 162 | 6.2 ms |
| Jetson Orin Nano Sup | CUDA | 88 | 11.4 ms |
| Raspberry Pi 5 | XNNPACK | 23 | 43 ms |
| Raspberry Pi 5 | CPU | 11 | 91 ms |
| x86 Ryzen 7700X | CPU | 95 | 10.5 ms |
| x86 + RTX 4070 | CUDA | 720 | 1.4 ms |
+----------------------+-----------+-----------+-----------+
Headline takeaways. TensorRT on Jetson is 2x CUDA, so always use it if your model converts cleanly. XNNPACK on Pi 5 is 2x naive CPU, so don’t skip it. INT8 vs FP32 is roughly 2x on most providers; do the quantization work.
7. Common Pitfalls
Pitfall 1, leaving threading at the default
The default intra_op_num_threads is “number of cores,” which on a Jetson is 6. That’s almost always wrong. The accelerator provider doesn’t use CPU threads much, and oversubscribing causes context-switching overhead. Set intra_op_num_threads = 2 for CUDA/TRT, intra_op_num_threads = 4 for XNNPACK on Pi 5.
Pitfall 2, creating a session per request
InferenceSession() is heavy. It loads the model, optimizes the graph, and (on TRT) compiles. Per-request session creation will kill your throughput. Create one session at startup, share it across requests. The session is thread-safe for run() calls.
Pitfall 3, forgetting to disable mem_pattern on dynamic shapes
If your input shape varies (e.g., variable batch size), enable_mem_pattern=True will re-plan memory on every shape change. Set enable_mem_pattern=False for dynamic-shape models. Catches a lot of people who deploy LLMs to edge.
Pitfall 4, TensorRT engine cache invalidation
The TRT engine cache key depends on model hash, GPU model, TRT version, and a few other things. If you update any of those, the cache silently rebuilds, which looks like a 5-minute startup hang. Either pre-build the engine and ship it with your app, or warn loudly on rebuild.
8. Troubleshooting
“No CUDA provider available” on Jetson
You installed onnxruntime instead of onnxruntime-gpu, or you installed the x86_64 GPU wheel which doesn’t include the right CUDA bindings for Tegra. Reinstall from the Jetson AI Lab index.
Inference works but accuracy dropped 10 points after quantization
Almost always a calibration issue. Either your calibration images don’t match production distribution, or you forgot per_channel=True. Re-run with more diverse calibration data. If that doesn’t help, you need QAT.
Memory grows over time
You’re probably creating tensors per-request without destroying them. The Go bindings make this explicit (defer tensor.Destroy()), the Python bindings don’t, but they still leak if you keep references around. Use io_binding and reuse buffers.
9. Wrapping Up
ONNX Runtime 1.20 on edge is a mature, well-supported toolchain in April 2025. The pattern is: pick the right provider for the hardware, quantize to INT8 with real calibration data, configure threads and memory explicitly, and reuse sessions. Do that and you’ll get within 20% of the hardware-native runtime for a fraction of the porting effort.
Next post moves to streaming inference pipelines, where ONNX Runtime is one node in a Kafka-fed system. The single-board benchmarks here are the building block; next we wire them together.
The official ONNX Runtime docs are at onnxruntime.ai/docs and the model zoo at github.com/onnx/models is genuinely useful for getting started.