Deploying Models with TFLite Micro on Constrained Devices

Edge ai article cover illustration on a gradient background

April 28, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — TFLite Micro is what you reach for when you need inference on a chip with 256 KB of RAM and no operating system. Convert to a fully INT8 quantized .tflite, size the tensor arena empirically, link against CMSIS-NN for ARM acceleration, and accept that you’ll write some C++. A keyword spotter fits in under 32 KB and runs at sub-50ms on a Cortex-M4 at 80 MHz.

There’s a tier of edge that the rest of this series hasn’t touched. Below the Raspberry Pi 5, below the Coral, below the Jetson Nano, lives the world of microcontrollers. STM32, nRF52, ESP32, RP2040. Devices with kilobytes of RAM, no Linux, often no floating-point hardware. You still want to run a model on them, because the alternative is shipping raw audio or sensor data over a 9600 baud LoRa link, which is even worse.

This post is a practical guide to TensorFlow Lite Micro, the inference runtime that targets these devices. We’ll cover the full deployment path: train a small model, convert it, quantize it, embed it in a C++ project, and run it on a real Cortex-M4 board. Memory and latency are the two things you’ll fight; we’ll measure both.

This is the smallest scale tier in the series. We’ve already covered edge AI hardware in April 2025 for the SBC tier; this is the one below.

1. What “constrained” actually means in 2025

The numbers have shifted in the last few years. A “constrained device” used to mean 32 KB of RAM. The Cortex-M33 chips shipping in 2024-2025 often have 512 KB to 1 MB. Still constrained relative to a Pi, but enough for genuinely interesting models.

+--------------------+--------+--------+----------+----------------+
| MCU                | RAM    | Flash  | FPU      | Typical clock  |
+--------------------+--------+--------+----------+----------------+
| STM32F4 (M4)       | 192 KB | 1 MB   | single   | 168 MHz        |
| STM32H7 (M7)       | 1 MB   | 2 MB   | double   | 480 MHz        |
| nRF52840 (M4)      | 256 KB | 1 MB   | single   | 64 MHz         |
| ESP32-S3 (Xtensa)  | 512 KB | 8 MB   | single   | 240 MHz        |
| RP2040 (M0+)       | 264 KB | external| none    | 133 MHz        |
| STM32U5 (M33)      | 786 KB | 2 MB   | single   | 160 MHz        |
+--------------------+--------+--------+----------+----------------+

Models that fit in this tier are small CNNs for image classification (CIFAR-scale), keyword spotters, anomaly detectors, simple regression models for sensor fusion. Anything LLM-shaped is out. Anything bigger than ~500 KB of weights is out.

2. Training a tiny model

The example we’ll deploy is a keyword spotter that recognizes “yes” and “no” from microphone input. The training is a Python pipeline; the deployment is C++.

# train_kws.py — simplified for clarity
import tensorflow as tf
import numpy as np

def build_model():
    inputs = tf.keras.Input(shape=(49, 10, 1))  # 49 time x 10 MFCC x 1 channel
    x = tf.keras.layers.Conv2D(8, (4, 10), strides=(2, 4), activation='relu')(inputs)
    x = tf.keras.layers.DepthwiseConv2D((3, 3), padding='same', activation='relu')(x)
    x = tf.keras.layers.Conv2D(8, (1, 1), activation='relu')(x)
    x = tf.keras.layers.Flatten()(x)
    outputs = tf.keras.layers.Dense(3, activation='softmax')(x)  # yes, no, unknown
    return tf.keras.Model(inputs, outputs)

model = build_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# ... train with Speech Commands dataset
model.save('kws_float.keras')

This model has about 4000 parameters. Trained on the Speech Commands dataset, it hits ~93% accuracy on “yes/no” with the rest as “unknown.”

3. Converting and quantizing

The model needs to be a TFLite flatbuffer with INT8 quantization. Both the weights and the activations.

# convert.py
import tensorflow as tf
import numpy as np

model = tf.keras.models.load_model('kws_float.keras')

def representative_dataset():
    # 100 representative samples from training data
    for _ in range(100):
        sample = np.random.rand(1, 49, 10, 1).astype(np.float32)
        # real: load actual MFCC samples
        yield [sample]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()

with open('kws.tflite', 'wb') as f:
    f.write(tflite_model)

print(f"Model size: {len(tflite_model)} bytes")
# Typical: 18,000-25,000 bytes

The key flags: TFLITE_BUILTINS_INT8 says “use only int8 ops, fail if any can’t be quantized.” inference_input_type=tf.int8 says “no float conversion at the boundary, the caller will pass int8.” Both matter on MCUs without float hardware.

3.1 Embedding the model

TFLite Micro expects the model as a C array. The standard tool is xxd:

xxd -i kws.tflite > kws_model.cc

# Or with a cleaner name:
echo "alignas(8) const unsigned char kws_model_data[] = {" > kws_model.cc
xxd -i < kws.tflite >> kws_model.cc
echo "};" >> kws_model.cc
echo "const unsigned int kws_model_data_len = sizeof(kws_model_data);" >> kws_model.cc

This puts the model bytes in flash (read-only). The alignas(8) is important; some MCU tooling complains about unaligned access to flatbuffer offsets.

4. The C++ inference harness

TFLite Micro is a header-only-ish C++ library you vendor into your firmware. We’re not going to set up a full PlatformIO project here, but the inference code itself is portable.

// main.cc
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"

extern "C" {
  extern const unsigned char kws_model_data[];
  extern const unsigned int kws_model_data_len;
}

// Tensor arena: scratch memory for activations and intermediate tensors.
// Size this empirically; see section 5.
constexpr int kTensorArenaSize = 32 * 1024;
alignas(16) uint8_t tensor_arena[kTensorArenaSize];

namespace {
  const tflite::Model* model = nullptr;
  tflite::MicroInterpreter* interpreter = nullptr;
  TfLiteTensor* input = nullptr;
  TfLiteTensor* output = nullptr;
}

void setup_model() {
  model = tflite::GetModel(kws_model_data);
  if (model->version() != TFLITE_SCHEMA_VERSION) {
    // version mismatch; halt
    while (1) {}
  }

  // Register only the ops we need. Keep this list small to save flash.
  static tflite::MicroMutableOpResolver<6> resolver;
  resolver.AddConv2D();
  resolver.AddDepthwiseConv2D();
  resolver.AddFullyConnected();
  resolver.AddSoftmax();
  resolver.AddReshape();
  resolver.AddQuantize();

  static tflite::MicroInterpreter static_interpreter(
      model, resolver, tensor_arena, kTensorArenaSize);
  interpreter = &static_interpreter;

  TfLiteStatus status = interpreter->AllocateTensors();
  if (status != kTfLiteOk) {
    while (1) {}
  }

  input = interpreter->input(0);
  output = interpreter->output(0);
}

int classify(const int8_t* mfcc_features) {
  // Copy features into input tensor
  memcpy(input->data.int8, mfcc_features, input->bytes);

  if (interpreter->Invoke() != kTfLiteOk) {
    return -1;
  }

  // Find argmax
  int8_t* out = output->data.int8;
  int best = 0;
  int8_t best_val = out[0];
  for (int i = 1; i < output->bytes; i++) {
    if (out[i] > best_val) {
      best_val = out[i];
      best = i;
    }
  }
  return best;
}

int main() {
  setup_model();

  // Main loop: read MFCC, classify, react
  int8_t mfcc[49 * 10];
  while (1) {
    capture_mfcc(mfcc);             // your audio frontend
    int label = classify(mfcc);
    switch (label) {
      case 0: trigger_yes(); break;
      case 1: trigger_no(); break;
      default: break;                // unknown, do nothing
    }
  }
}

That’s the entire inference path. About 80 lines of C++. The bulk of the firmware project is the audio capture and MFCC computation, not the inference.

4.1 Op resolver, the flash-saving trick

MicroMutableOpResolver<N> only includes the kernels you Add*(). If your model uses Conv2D and FullyConnected but not LSTM, don’t register LSTM. Each kernel costs flash; for a small MCU, this matters.

The “BuiltinOpResolver” (which includes everything) costs about 200 KB of flash. The mutable resolver with 6 ops above costs about 35 KB.

5. Sizing the tensor arena

The tensor arena is the scratch RAM TFLite Micro uses for activations. Too small and AllocateTensors() fails. Too big and you’ve wasted RAM that the rest of your firmware needs.

The right approach is empirical. Start with a generous size, run inference, then shrink to the actual usage.

// Right after AllocateTensors() succeeds:
size_t used = interpreter->arena_used_bytes();
printf("Arena used: %zu bytes\n", used);
// e.g., "Arena used: 7152 bytes"

Once you know the real usage, set kTensorArenaSize to that plus a 10% margin. For the keyword spotter above, the arena ends up around 7-8 KB. The interpreter object itself plus the op kernels take another few KB.

Memory breakdown for the keyword spotter on Cortex-M4:
  Model in flash:       ~20 KB
  Op kernels in flash:  ~35 KB
  Tensor arena (RAM):   ~8 KB
  Interpreter (RAM):    ~3 KB
  Total RAM:            ~11 KB
  Total Flash:          ~55 KB

Fits comfortably on an STM32F4 with room to spare.

6. CMSIS-NN, for ARM acceleration

CMSIS-NN is ARM’s hand-tuned kernel library for Cortex-M. TFLite Micro can use it as a drop-in replacement for the reference kernels. Speedup is 2-4x on most ops.

To enable, build TFLite Micro with OPTIMIZED_KERNEL_DIR=cmsis_nn:

# From the tflite-micro source tree
make -f tensorflow/lite/micro/tools/make/Makefile \
  TARGET=cortex_m_generic \
  TARGET_ARCH=cortex-m4+fp \
  OPTIMIZED_KERNEL_DIR=cmsis_nn \
  microlite

The output is a libmicrolite.a you link into your firmware. The CMSIS-NN-backed Conv2D on a Cortex-M4 at 168 MHz runs the keyword spotter in about 18ms per inference. The reference kernels take 70ms for the same model.

7. Custom ops, when stock isn’t enough

If your model uses an op TFLite Micro doesn’t include, you have two choices: simplify the model to use only supported ops, or write a custom op.

Writing a custom op looks like this:

namespace tflite {
namespace ops {
namespace micro {
namespace my_custom_op {

void* Init(TfLiteContext*, const char*, size_t) { return nullptr; }
void Free(TfLiteContext*, void*) {}

TfLiteStatus Prepare(TfLiteContext* context, TfLiteNode* node) {
    // Validate shapes, allocate any persistent state
    return kTfLiteOk;
}

TfLiteStatus Eval(TfLiteContext* context, TfLiteNode* node) {
    const TfLiteTensor* in = GetInput(context, node, 0);
    TfLiteTensor* out = GetOutput(context, node, 0);
    // Your computation here
    return kTfLiteOk;
}

}  // namespace my_custom_op

TfLiteRegistration Register_MY_CUSTOM_OP() {
    return {my_custom_op::Init, my_custom_op::Free,
            my_custom_op::Prepare, my_custom_op::Eval};
}

}}}  // namespaces

// In your resolver setup:
resolver.AddCustom("MyCustomOp", tflite::ops::micro::Register_MY_CUSTOM_OP());

Custom ops are a last resort. They’re tied to your hardware, hard to test, and break when you update TFLite Micro.

8. Common Pitfalls

Pitfall 1, training with augmentation TFLite can’t represent

Random Gaussian noise during training is fine. But if your preprocessing pipeline uses any op that doesn’t have a TFLite Micro equivalent (some advanced normalizations, complex resamplers), you’ll convert to TFLite, then deploy, and the inference will fail at runtime with a cryptic kernel error. Validate the conversion before training a full epoch.

Pitfall 2, mixing float and int8 at the boundary

If you set inference_input_type=tf.int8 and then pass a float buffer to input->data.int8, the inference runs on garbage. Always quantize the input on the firmware side. The quantization scale and zero point are in the model; read them with input->params.scale and input->params.zero_point.

int8_t quantize(float x, float scale, int zero_point) {
    int32_t q = (int32_t)(x / scale + zero_point);
    if (q < -128) q = -128;
    if (q > 127) q = 127;
    return (int8_t)q;
}

Pitfall 3, the arena is fine for inference but stack overflows

The MCU’s main stack might be 4 KB by default. Inference can deeply recurse during graph traversal. If you see hard faults during Invoke(), bump your stack size before suspecting TFLite Micro.

Pitfall 4, forgetting `alignas` on the arena

tensor_arena must be aligned, typically to 16 bytes. Without alignment, some kernels (especially CMSIS-NN ones using SIMD) hard-fault. alignas(16) on the array declaration is the fix.

9. Troubleshooting

`AllocateTensors()` returns kTfLiteError with no detail

Almost always tensor arena too small. Build with -DTF_LITE_STATIC_MEMORY and add a debug print of interpreter->arena_used_bytes() after a successful allocation on a larger arena, then size down.

Inference returns reasonable numbers but the classification is always class 0

Quantization input mismatch. Print input->params.scale and input->params.zero_point and verify your firmware applies the same scale and zero point. A 10x mismatch in scale will silently make every input look like the same value to the model.

CMSIS-NN build fails with “unknown architecture”

The TARGET_ARCH flag has to match your chip exactly. Cortex-M4 with FP is cortex-m4+fp. Without FP is cortex-m4. The TFLite Micro makefile is pedantic about this and the error message isn’t helpful.

10. Wrapping Up

TFLite Micro is the only sensible answer for inference on microcontrollers in April 2025. The toolchain is mature, CMSIS-NN gives you good performance on ARM, and the deployment story is straightforward if unglamorous. The keyword spotter we built fits in 55 KB of flash, 11 KB of RAM, and runs at sub-20ms on a $5 chip. That’s a real product, not a demo.

The next and final post in this series steps way back, looking at observability for edge fleets at scale. Once you’ve got TFLite Micro running on a thousand microcontrollers, the question “is fleet 47 working?” becomes the hard part.

The TFLite Micro repository moved out of TensorFlow proper and now lives at github.com/tensorflow/tflite-micro . The README is the right starting point.

1. What “constrained” actually means in 2025

2. Training a tiny model

3. Converting and quantizing

3.1 Embedding the model

4. The C++ inference harness

4.1 Op resolver, the flash-saving trick

5. Sizing the tensor arena

6. CMSIS-NN, for ARM acceleration

7. Custom ops, when stock isn’t enough

8. Common Pitfalls

Pitfall 1, training with augmentation TFLite can’t represent

Pitfall 2, mixing float and int8 at the boundary

Pitfall 3, the arena is fine for inference but stack overflows

Pitfall 4, forgetting alignas on the arena

9. Troubleshooting

AllocateTensors() returns kTfLiteError with no detail

Inference returns reasonable numbers but the classification is always class 0

CMSIS-NN build fails with “unknown architecture”

10. Wrapping Up

Related posts

Observability for Edge Fleets at Scale, Patterns That Work

Bridging OPC UA and Modbus to MQTT in Go, A Step by Step Guide

Streaming Inference Pipelines with Kafka and Go, A Production Walkthrough

ONNX Runtime on Edge Devices, A Comprehensive Tutorial

Advanced MQTT Clustering with EMQX 5.8, A Production Guide

Real Time Telemetry Processing in Go 1.24, A Hands On Tutorial

Edge AI Hardware in April 2025, Jetson, Coral, and Raspberry Pi 5 AI Hat

Architecting Computer Vision Quality Control at the Industrial Edge

Let’s Start a Project

Pitfall 4, forgetting `alignas` on the arena

`AllocateTensors()` returns kTfLiteError with no detail