background-shape
Rust article cover illustration on a gradient background
January 20, 2026 · 10 min read · by Muhammad Amal programming
Advertisement

TL;DR — A WebAssembly inference runtime gives you a sandboxed, portable host for edge agents / Wasmtime 28 plus the WASI-NN proposal lets guest modules call into a native backend without bundling the model / I cover host setup, the WIT interface, ONNX loading, resource limits, and the failure modes you’ll actually hit.

I spent most of last quarter trying to get a fleet of small agents to run the same model binary on three different boards: an x86 gateway, an ARM64 industrial PC, and a Raspberry Pi 5. Every approach that involved shipping a native .so per target turned into a cross-compilation swamp. The breakthrough was inverting the problem. Instead of shipping the inference engine to every device, ship a thin WebAssembly module that calls an inference engine the host already provides.

That’s what a WebAssembly inference runtime is: a host process that embeds a Wasm engine, exposes a neural-network capability to guest modules through a stable interface, and runs untrusted agent logic inside a sandbox. The guest stays tiny and architecture-neutral. The heavy tensor math lives in a native backend the host picks at startup.

Advertisement

This post builds that runtime in Rust with Wasmtime 28 and the WASI-NN proposal. By the end you’ll have a host that loads an ONNX model, hands it to a sandboxed agent, runs inference, and enforces memory and fuel limits so a misbehaving agent can’t take the box down. If you want the model side of the story first, my walkthrough on running Phi-4-mini on a Raspberry Pi pairs well with this.

Why WebAssembly for edge inference

The pitch is not “Wasm is fast.” Native code is faster. The pitch is isolation plus portability with a real capability model.

An edge agent is often third-party logic: a customer’s anomaly-detection rule, a plugin, a model-routing policy. You do not want that running with full process privileges next to your network stack. A Wasm sandbox gives you a deny-by-default boundary. The guest can’t open a socket, read the filesystem, or call inference unless the host explicitly grants it.

Portability matters because edge fleets are heterogeneous. One agent.wasm runs unmodified on x86 and ARM. The host absorbs the architecture difference. And WASI-NN keeps the multi-hundred-MB model and the ONNX Runtime dependency out of the guest entirely — the guest references a model by handle, not by bytes.

The cost is real: a WASI-NN call crosses the sandbox boundary, and you pay marshalling overhead on every tensor. For a model that runs in 40 ms, a 0.3 ms boundary crossing is noise. For a model that runs in 80 microseconds, it is not. Know which regime you’re in before you commit.

Project setup

Pin everything. Wasm tooling moved fast through 2025 and minor version drift between wasmtime and wasmtime-wasi will bite you.

# Cargo.toml
[package]
name = "wasm-infer-runtime"
version = "0.1.0"
edition = "2021"
rust-version = "1.84"

[dependencies]
wasmtime = { version = "28.0", features = ["component-model", "async"] }
wasmtime-wasi = "28.0"
wasmtime-wasi-nn = "28.0"
anyhow = "1.0"
tokio = { version = "1.43", features = ["rt-multi-thread", "macros"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }

The WASI-NN crate ships with backend feature flags. For ONNX you want the ONNX Runtime backend; confirm the native library is discoverable at runtime:

# Debian/Ubuntu on the gateway
apt-get install -y libonnxruntime1.20 libonnxruntime-dev
# Verify the loader can see it
ldconfig -p | grep onnxruntime

If ldconfig shows nothing, set ORT_DYLIB_PATH to the absolute path of libonnxruntime.so before launching the host. A missing backend library produces a confusing “graph encoding not supported” error rather than an honest “library not found,” so check this first.

The guest interface

WASI-NN is defined as a WIT (WebAssembly Interface Types) world. The guest imports it; the host provides it. You don’t write this file from scratch — it comes from the WASI-NN proposal repository — but you should understand its shape:

// wit/wasi-nn.wit (abridged  the real file is larger)
package wasi:nn@0.2.0-rc-2024-10-28;

interface tensor {
  enum tensor-type { fp16, fp32, fp64, u8, i32, i64 }
  resource tensor {
    constructor(dimensions: list<u32>, ty: tensor-type, data: list<u8>);
    dimensions: func() -> list<u32>;
    ty: func() -> tensor-type;
    data: func() -> list<u8>;
  }
}

interface graph {
  use tensor.{tensor};
  enum graph-encoding { onnx, tensorflow, pytorch, openvino, ggml }
  enum execution-target { cpu, gpu, tpu }
  resource graph {
    init-execution-context: func() -> result<graph-execution-context, error>;
  }
}

The contract: the host owns graph resources. The guest receives an opaque handle, never the model bytes. That single design choice is what keeps the agent module small and the model swappable without recompiling agents.

Building the host

The host is the interesting part. It does five things: configure the engine, register WASI and WASI-NN, preload the model, instantiate the agent, and enforce limits.

// src/main.rs
use anyhow::{Context, Result};
use std::path::PathBuf;
use std::sync::Arc;
use wasmtime::component::{Component, Linker, ResourceTable};
use wasmtime::{Config, Engine, Store};
use wasmtime_wasi::{WasiCtx, WasiCtxBuilder, WasiView};
use wasmtime_wasi_nn::wit::WasiNnView;
use wasmtime_wasi_nn::{backend, Backend, Registry};

/// Everything a single agent instance can touch lives here.
struct AgentState {
    wasi: WasiCtx,
    table: ResourceTable,
    nn: Arc<wasmtime_wasi_nn::WasiNnCtx>,
}

impl WasiView for AgentState {
    fn ctx(&mut self) -> &mut WasiCtx {
        &mut self.wasi
    }
    fn table(&mut self) -> &mut ResourceTable {
        &mut self.table
    }
}

fn build_engine() -> Result<Engine> {
    let mut config = Config::new();
    config.wasm_component_model(true);
    config.async_support(true);
    // Fuel lets us bound CPU per call deterministically.
    config.consume_fuel(true);
    // Epoch interruption is the escape hatch for tight native loops.
    config.epoch_interruption(true);
    Engine::new(&config).context("failed to construct Wasmtime engine")
}

Note both consume_fuel and epoch_interruption. Fuel meters Wasm instructions deterministically; epochs interrupt a guest that’s stuck inside a long host call. You want both — fuel alone won’t stop an agent blocked in a multi-second inference call.

Loading the model into a registry

WASI-NN’s load_by_name resolves a model from a host-side registry. We populate that registry at startup so agents only ever pass a string.

// src/registry.rs
use anyhow::{Context, Result};
use std::path::Path;
use wasmtime_wasi_nn::backend::onnx::OnnxBackend;
use wasmtime_wasi_nn::{Backend, InMemoryRegistry, Registry};

/// Load every `.onnx` file in `dir` under its file stem as the model name.
pub fn build_registry(dir: &Path) -> Result<impl Registry> {
    let mut registry = InMemoryRegistry::new();
    let mut backend: Backend = OnnxBackend::default().into();

    for entry in std::fs::read_dir(dir)
        .with_context(|| format!("reading model dir {}", dir.display()))?
    {
        let path = entry?.path();
        if path.extension().and_then(|e| e.to_str()) != Some("onnx") {
            continue;
        }
        let name = path
            .file_stem()
            .and_then(|s| s.to_str())
            .context("model file name is not valid UTF-8")?
            .to_string();

        let bytes = std::fs::read(&path)
            .with_context(|| format!("reading model {}", path.display()))?;
        let graph = backend
            .load(&[&bytes], wasmtime_wasi_nn::ExecutionTarget::Cpu)
            .with_context(|| format!("backend rejected model {name}"))?;

        registry.put(&name, graph);
        tracing::info!(model = %name, size_kb = bytes.len() / 1024, "registered model");
    }
    Ok(registry)
}

Preloading at startup matters. A cold ONNX graph load can take 200 ms to several seconds. Do it once when the host boots, not on the first agent request, or your p99 latency will be dominated by warmup.

Wiring it together with limits

This is the part most tutorials skip. An unbounded Wasm store will happily grow its linear memory until the OOM killer arrives.

// src/main.rs (continued)
use wasmtime::StoreLimitsBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt()
        .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
        .init();

    let engine = build_engine()?;
    let registry = crate::registry::build_registry(
        &PathBuf::from(std::env::var("MODEL_DIR").unwrap_or_else(|_| "./models".into())),
    )?;

    // Component + linker are shared; instantiate per request.
    let component = Component::from_file(&engine, "agent.wasm")
        .context("loading agent component")?;
    let mut linker: Linker<AgentState> = Linker::new(&engine);
    wasmtime_wasi::add_to_linker_async(&mut linker)?;
    wasmtime_wasi_nn::wit::add_to_linker(&mut linker, |s: &mut AgentState| {
        WasiNnView::new(&mut s.table, &s.nn)
    })?;

    let nn = Arc::new(wasmtime_wasi_nn::WasiNnCtx::new(
        vec![wasmtime_wasi_nn::backend::onnx::OnnxBackend::default().into()],
        Box::new(registry),
    ));

    let state = AgentState {
        wasi: WasiCtxBuilder::new().inherit_stderr().build(),
        table: ResourceTable::new(),
        nn,
    };

    let mut store = Store::new(&engine, state);

    // Hard caps: 128 MiB linear memory, one memory, no growth past the cap.
    store.limiter(|_s| {
        Box::leak(Box::new(
            StoreLimitsBuilder::new()
                .memory_size(128 << 20)
                .memories(1)
                .instances(1)
                .build(),
        ))
    });
    // 5 billion fuel units per agent run — tune against a real workload.
    store.set_fuel(5_000_000_000)?;
    // Tick the epoch from a watchdog thread; deadline = 2 ticks.
    store.set_epoch_deadline(2);

    let watchdog = engine.clone();
    std::thread::spawn(move || loop {
        std::thread::sleep(std::time::Duration::from_millis(500));
        watchdog.increment_epoch();
    });

    let instance = linker
        .instantiate_async(&mut store, &component)
        .await
        .context("instantiating agent")?;

    let run = instance
        .get_typed_func::<(String,), (Result<String, String>,)>(&mut store, "run")?;
    let (result,) = run.call_async(&mut store, ("infer".into(),)).await?;
    match result {
        Ok(out) => tracing::info!(output = %out, "agent finished"),
        Err(e) => tracing::error!(error = %e, "agent reported failure"),
    }
    Ok(())
}

The StoreLimitsBuilder boxing-and-leaking pattern looks ugly. It’s the documented way to give a 'static limiter to a store that outlives the call; just be aware each store leaks one small struct, so don’t create millions of stores in a tight loop without a pool.

The agent guest

The agent is deliberately boring. It builds an input tensor, calls compute, reads the output. Compiled with cargo component build --release it lands around 80 KB.

// agent/src/lib.rs
wit_bindgen::generate!({ world: "agent" });

use crate::wasi::nn::graph::{load_by_name, ExecutionTarget};
use crate::wasi::nn::tensor::{Tensor, TensorType};

struct Agent;

impl Guest for Agent {
    fn run(_cmd: String) -> Result<String, String> {
        // Resolve the model by the name the host registered.
        let graph = load_by_name("classifier")
            .map_err(|e| format!("load_by_name failed: {e:?}"))?;
        let ctx = graph
            .init_execution_context()
            .map_err(|e| format!("init context failed: {e:?}"))?;

        // A 1x3 fp32 feature vector, little-endian.
        let features: [f32; 3] = [0.81, 0.12, 0.44];
        let mut bytes = Vec::with_capacity(12);
        for f in features {
            bytes.extend_from_slice(&f.to_le_bytes());
        }
        let input = Tensor::new(&[1, 3], TensorType::Fp32, &bytes);

        ctx.set_input("input", input)
            .map_err(|e| format!("set_input failed: {e:?}"))?;
        ctx.compute()
            .map_err(|e| format!("compute failed: {e:?}"))?;

        let output = ctx
            .get_output("output")
            .map_err(|e| format!("get_output failed: {e:?}"))?;
        let raw = output.data();
        let scores: Vec<f32> = raw
            .chunks_exact(4)
            .map(|c| f32::from_le_bytes(c.try_into().unwrap()))
            .collect();
        let best = scores
            .iter()
            .enumerate()
            .max_by(|a, b| a.1.total_cmp(b.1))
            .map(|(i, _)| i)
            .ok_or("empty output tensor")?;
        Ok(format!("class={best} scores={scores:?}"))
    }
}

export!(Agent);

The tensor name strings ("input", "output") must match your ONNX graph’s I/O node names exactly. Inspect them with python -c "import onnx; m=onnx.load('classifier.onnx'); print([i.name for i in m.graph.input], [o.name for o in m.graph.output])".

Common Pitfalls

  • Tensor layout mismatch. WASI-NN tensors are raw bytes. If the host backend expects NCHW and you build NHWC, you get silent garbage, not an error. Validate output ranges against a known-good native run.
  • Endianness assumptions. Wasm is little-endian; ONNX Runtime on your hardware almost certainly is too, so to_le_bytes is correct. Don’t copy to_ne_bytes from a non-Wasm example.
  • Fuel that never refills. set_fuel is a one-time grant. After a long-running host you must call it again per request or the second call traps with “all fuel consumed.”
  • Reusing one store across agents. Resource tables and linear memory are per-store. Sharing a store between tenants leaks state and breaks the isolation you adopted Wasm for. One store per agent run.
  • Forgetting the epoch watchdog. Without increment_epoch ticking, set_epoch_deadline is inert and a stuck guest hangs forever.

Troubleshooting

Symptom: Error: unknown import: wasi:nn/graph at instantiation. Cause: the WASI-NN interface was never added to the linker, or a WIT version mismatch between the guest’s generated bindings and wasmtime-wasi-nn. Fix: confirm wasmtime_wasi_nn::wit::add_to_linker runs before instantiate_async, and pin the guest’s wit directory to the same wasi-nn package version the host crate expects.

Symptom: compute returns runtime-error with no detail. Cause: input tensor shape disagrees with the model’s expected shape, or the ONNX backend library is the wrong ABI. Fix: log the model’s expected input shape at registry-load time and compare. Verify libonnxruntime major version matches the wasmtime-wasi-nn feature you built against.

Symptom: agent traps with “epoch deadline reached” during normal inference. Cause: the deadline is too tight for the model’s real compute time, or the watchdog tick interval is shorter than one inference. Fix: measure a warm inference, then set the deadline so tick_interval * deadline comfortably exceeds it. A 40 ms model under a 500 ms tick with deadline 2 has a full second of headroom.

Symptom: first request after boot is 50x slower than the rest. Cause: model graph loaded lazily on first load_by_name. Fix: preload in build_registry at startup, as shown above, so the cost is paid before traffic arrives.

Wrapping Up

You now have a WebAssembly inference runtime that runs untrusted edge agents in a sandbox, hands them a shared native ONNX backend through WASI-NN, and enforces hard memory and CPU bounds. The next step is a store pool so you can serve concurrent agents without per-request instantiation cost, and swapping the ONNX backend for an OpenVINO or GGML one when your hardware justifies it.

Advertisement