Digital Twins for Industrial Systems, Real Telemetry Edition

Iiot article cover illustration on a gradient background

August 23, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — A digital twin is just a queryable state model of an asset, kept in sync by telemetry. Skip the 3D renderings until the data layer is right. The value is in queries, not visuals.

“Digital twin” might be the most over-loaded term in industrial software. Vendors use it for everything from a 3D CAD viewer to an enterprise-wide asset model platform. I’m going to use it in a narrow, useful sense. A digital twin is a queryable, structured representation of an asset’s state and history, kept in sync with reality by live telemetry, and exposing an API that other systems can use.

That definition is deliberately unglamorous. No fancy visualization. No physics simulation. Those are layers you can add when the data layer is solid. In 2024, the data layer is the lever, because most twins fail not because of bad UX but because the telemetry feeding them is unreliable.

This post is the architecture I use for an industrial digital twin built on real telemetry. It assumes the rest of the stack from earlier posts is in place. MQTT broker, Kafka, time-series database, edge gateways doing protocol translation.

What a Twin Actually Is

Strip away the marketing and a twin has four parts.

A static structure model, the asset hierarchy and metadata. What sites, areas, lines, machines exist, how they relate, their nameplate data.

A dynamic state model, the current values of every relevant signal, plus a small derived state, like operating mode, fault flags, recent throughput.

A history, accessible by time range, for the signals that matter. Not every raw sample, but enough resolution to answer business questions.

An API, REST or GraphQL or whatever your platform speaks, that lets other systems read the structure, the current state, and the history.

That’s it. Everything else is optional.

The State Model

The state model is where most twins get cluttered. The temptation is to put every raw tag into the twin’s state. Don’t. The twin should expose a curated set of named signals, each with units, ranges, and a stable identity, derived from raw telemetry by a known transformation.

For a packaging line twin, the state might be something like:

id: "plant/jkt1/packaging/line3"
type: "packaging_line"
state:
  mode: "RUNNING"          # one of {RUNNING, IDLE, FAULT, STOP}
  speed_units_per_min: 218
  current_product: "SKU-7724"
  current_batch: "B-2024-08-23-014"
  uptime_seconds_today: 21420
  fault_codes_active: []
  oee_estimate: 0.83
metrics:
  press1_bar: 5.42
  temp1_c: 72.4
  vibration_mms_rms: 1.2

The state has business meaning. mode is a state machine, not a raw boolean. oee_estimate is a calculated rollup. These are what consumers actually want to query. They’re computed by a stream processor, not by the twin’s read path.

Sync Pattern

The twin’s state is updated by a stream processor that subscribes to the Kafka telemetry topic and writes to two places. The current state lives in a fast key-value store (Redis or Postgres with ON CONFLICT DO UPDATE works), keyed by the asset’s canonical ID. The history lives in the time-series database for full fidelity.

# Sketch of a Faust-style consumer.
@app.agent(telemetry_topic)
async def update_twin(stream):
    async for ev in stream:
        asset = asset_for(ev.metric_id)
        async with state_store.lock(asset.id):
            cur = await state_store.get(asset.id) or new_state(asset)
            cur = transition(cur, ev)            # business-state machine
            cur.metrics[ev.metric_name] = ev.value
            cur.updated_at = ev.ts
            await state_store.set(asset.id, cur)
        await history_store.append(asset.id, ev)

transition is the business state machine. Given the previous state and a new event, return the new state. This is where mode changes (running to fault, idle to running) are computed. Keep it pure and testable.

The lock per asset matters. Without it, concurrent updates on the same asset will corrupt state. Per-asset locking is cheap if the keys are well distributed.

API

Two endpoints carry most of the value.

GET /twin/{id} returns the current state. Latency target is 50 ms p99. Backed by the KV store.

GET /twin/{id}/history?metric=...&from=...&to=...&interval=... returns a downsampled history. Latency target is a second or two for a 24-hour window at 1-minute resolution. Backed by the time-series database, using continuous aggregates.

A third endpoint, GET /twin/{id}/events, returns business-state transitions. Mode changes, fault entries and exits, batch starts and stops. These are derived events emitted by the stream processor and stored separately, because they’re the events humans actually want to query.

A small GraphQL layer on top is nice if your consumers are diverse. A REST shape is fine for most workloads.

Hierarchies

Industrial assets nest. Plants contain areas contain lines contain machines contain sub-assemblies. The twin needs to express this without being a pain to query.

I model it as a directed acyclic graph of asset IDs, with parent and children relationships. Aggregations roll up the tree on demand or are pre-computed on a schedule. A line’s oee_estimate is computed from its machines. A plant’s units_produced_today is summed from its lines.

-- A continuous aggregate over a line, joined to the asset hierarchy.
CREATE MATERIALIZED VIEW line_5m WITH (timescaledb.continuous) AS
SELECT
    time_bucket('5 minutes', t.ts) AS bucket,
    a.line_id,
    sum(t.value) FILTER (WHERE t.metric_id = 901) AS units_produced,
    avg(t.value) FILTER (WHERE t.metric_id = 1024) AS avg_press_bar
FROM telemetry t
JOIN assets a ON a.device_id = t.device_id
GROUP BY bucket, a.line_id;

Pre-computing the aggregates at the line level is what makes the twin API fast for dashboards.

Use Cases That Pay Back

Not every twin earns its keep. The ones I’ve seen genuinely return value:

OEE and downtime analysis. The twin’s mode state, derived from telemetry, gives you accurate uptime, downtime, and fault categorization. The traditional alternative is operator-entered codes, which are wrong about half the time.

Cross-shift handover dashboards. A single-page view of “what changed in the last shift” backed by the twin’s events endpoint is something control room teams will actually use.

Maintenance scheduling. The twin knows how long a machine has been running, how many cycles, recent vibration trends. Feeding that into a CMMS for condition-based maintenance is a clear win.

What rarely pays back, in my experience, is real-time 3D visualization of the plant. It looks great in demos. Operators ignore it because they already know what the plant looks like.

When to Add Physics

Physics-based twins, where a model predicts behavior based on first principles, are useful in narrow cases. Simulating heat exchanger fouling. Predicting tank level given inflow and outflow. Computing virtual sensors when a physical sensor is impractical.

Add physics only after the data-driven twin is solid. Physics models need calibration against real data, and the real data needs to be trustworthy first. If you do this in the wrong order, you’ll spend months debugging models when the problem is dirty telemetry.

Common Pitfalls

Putting raw tags in the twin’s state. Curate. Derive business signals. The twin is a product, not a dump.
No clear ownership of the asset model. The structure model needs an owner who decides when an asset is added, renamed, or retired. Without an owner, drift sets in fast.
State store as the source of truth. The Kafka log is the source of truth. The state store is a cache. Make sure you can rebuild it from the log.
One giant twin per plant. Twins should be per-asset, with hierarchy expressing the relationships. A single giant blob is hard to update and harder to consume.
Calling everything a twin. If it’s just a Grafana dashboard, call it a dashboard. The word doesn’t add value.

Wrapping Up

Digital twins built on real telemetry are useful when you’re disciplined about scope. A curated state model, a clean sync pattern, a small API, and a hierarchy that reflects reality. That’s the whole thing. Skip the visualization layer until the data layer earns it.

For the upstream pieces, the earlier posts on IIoT reference architecture , time series for IIoT and OPC UA bridging cover where the data comes from. The Digital Twin Definition Language (DTDL) v3 spec from Microsoft is a useful reference if you want a formal modeling language for the structure side.

Build the twin you’d actually use. Skip the one you’d demo. That’s the test.

What a Twin Actually Is

The State Model

Sync Pattern

API

Hierarchies

Use Cases That Pay Back

When to Add Physics

Common Pitfalls

Wrapping Up

Related posts

Industrial IoT in 2024, A Reference Architecture That Scales

Translating Business Impact into Architecture Decisions

When Pro Code Wins Over Low Code, A Decision Matrix

Securing the OT IT Boundary in Industrial Networks

Real Time Anomaly Detection for IIoT Telemetry

Bridging OPC UA to MQTT and Kafka, A 2024 Playbook

Time Series for IIoT, Timescale and InfluxDB Compared

Edge AI on Industrial Gateways, TFLite and ONNX in 2024

Let’s Start a Project