Observability for Edge Fleets at Scale, Patterns That Work
TL;DR — Edge observability is not data-center observability with worse network. It’s a different problem. Run an OpenTelemetry Collector on every device, aggregate and sample locally, push (don’t pull) to a central collector, and design alerts around fleet-wide patterns not per-device. The patterns in this post survive WAN outages without flooding your pager.
You’ve spent the month building edge AI pipelines. The hardware is humming, the broker is clustered, the inference workers are batching, the bridge is talking to PLCs. Production. Two weeks in, the customer calls. “Site 47 hasn’t sent data since Tuesday.” You SSH in (assuming you can). Now imagine that conversation happening 50 times a year across 2000 sites. That’s the observability problem.
This post is the capstone of the April series. We’ve built the pieces; this is how you keep them running. OpenTelemetry Collector 0.115 (April 2025 release) is the workhorse. Prometheus 3.0 (November 2024) on the central side. The patterns here come from real fleets, not theoretical reasoning.
If you’ve followed the series, this post ties together the streaming inference pipeline and the industrial bridge into a single observable system.
1. The four constraints of edge observability
The textbook observability stack assumes a few things that don’t hold at the edge.
Data-center assumption Edge reality
---------------------------------------------------------------
Network is reliable WAN drops constantly
Disk is unlimited 8 GB eMMC, full of logs
CPU is plentiful Inference owns the cores
Operator can SSH in NAT, firewalls, no SSH
Every observability decision at the edge falls out of these four constraints. You can’t pull metrics if the device is behind NAT. You can’t ship every log line if the WAN is 4G with a data cap. You can’t run a heavy agent if the inference workload needs the CPU. And you can’t debug live because you literally can’t reach the device.
2. The architecture, end to end
+--------------------------------- One edge site ---------------------------------+
| |
| +-------------+ +-------------+ +-------------+ +-------------+ |
| | bridge | | inference | | mqtt broker | | system | |
| | (Go) | | worker (Go) | | (mosquitto) | | metrics | |
| +------+------+ +------+------+ +------+------+ +------+------+ |
| | | | | |
| | OTLP (gRPC, loopback) |
| v v v v |
| +-----------------+-----------------+-----------------+ |
| | |
| +-------+-------+ |
| | otelcol agent | |
| | (per device) | |
| | + sampling | |
| | + aggregation | |
| | + local disk | |
| | buffer | |
| +-------+-------+ |
| | |
+----------------------------------|----------------------------------------------+
| OTLP/HTTP over TLS (push)
v
+-----------------+
| otelcol gateway |
| (central) |
+--------+--------+
|
+--------------------+--------------------+
v v v
+-----------+ +-----------+ +-----------+
| Prometheus| | Loki 3.3 | | Tempo 2.6 |
| 3.0 | | | | |
+-----------+ +-----------+ +-----------+
Two collectors. The agent on every device aggregates and buffers. The central gateway is the funnel everything pushes to. Backends are whatever you like; the example uses Prometheus, Loki, and Tempo.
3. The edge agent config
This is the OpenTelemetry Collector configuration that runs on every device. About 80 lines and worth understanding line by line.
# /etc/otelcol/agent.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 127.0.0.1:4317
http:
endpoint: 127.0.0.1:4318
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
filesystem:
network:
load:
prometheus:
config:
scrape_configs:
- job_name: 'local-services'
scrape_interval: 30s
static_configs:
- targets: ['127.0.0.1:9100', '127.0.0.1:9101', '127.0.0.1:9102']
processors:
resource:
attributes:
- key: site.id
value: ${env:SITE_ID}
action: insert
- key: device.id
value: ${env:DEVICE_ID}
action: insert
- key: deployment.environment
value: ${env:ENVIRONMENT}
action: insert
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048
memory_limiter:
check_interval: 5s
limit_mib: 128
spike_limit_mib: 32
filter/logs:
logs:
log_record:
- 'severity_number < SEVERITY_NUMBER_WARN'
tail_sampling:
decision_wait: 5s
num_traces: 1000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 500}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 5}
exporters:
otlphttp:
endpoint: https://otel-gateway.example.com:4318
tls:
insecure: false
headers:
authorization: Bearer ${env:OTEL_TOKEN}
sending_queue:
enabled: true
num_consumers: 4
queue_size: 10000
storage: file_storage/queue
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 5m
max_elapsed_time: 24h
extensions:
file_storage/queue:
directory: /var/lib/otelcol/queue
timeout: 10s
service:
extensions: [file_storage/queue]
telemetry:
logs:
level: warn
pipelines:
metrics:
receivers: [otlp, hostmetrics, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [otlphttp]
traces:
receivers: [otlp]
processors: [memory_limiter, resource, tail_sampling, batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, filter/logs, batch]
exporters: [otlphttp]
The interesting bits:
file_storage/queueextension. This is what survives WAN outages. The collector buffers to disk and retries for up to 24 hours. After that, you’ve got bigger problems than missing telemetry.memory_limiteris non-negotiable on edge. Without it, a backlog buildup will OOM the device.tail_samplingkeeps all error and slow traces, samples the rest at 5%. Smart sampling for low-bandwidth links.filter/logsdrops INFO and DEBUG logs at the source. They’re rarely useful at fleet scale and consume bandwidth.
4. Resource attributes, the key to fleet queries
Every signal gets site.id, device.id, and deployment.environment attached at the resource level. This is what makes fleet-wide queries possible.
# Sites with high inference error rate
sum(rate(inference_errors_total[5m])) by (site_id)
/ sum(rate(inference_requests_total[5m])) by (site_id) > 0.01
# Devices that haven't reported in 5 minutes
absent_over_time(up{job="otelcol"}[5m])
# P99 inference latency per model, fleet-wide
histogram_quantile(0.99,
sum(rate(inference_latency_bucket[5m])) by (le, model_id))
Make sure your application code adds the same attributes when it emits OTLP. The Go SDK makes this easy:
import (
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)
res, _ := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName("inference-worker"),
semconv.ServiceVersion("1.4.2"),
attribute.String("site.id", os.Getenv("SITE_ID")),
attribute.String("device.id", os.Getenv("DEVICE_ID")),
),
)
Resource attributes propagate to every metric, trace, and log this process emits. Set them once at startup.
5. Push, not pull
Prometheus’s pull model is a non-starter for edge devices behind NAT. You can’t dial into them. Instead, the agent pushes to a central gateway via OTLP/HTTP over TLS.
The central gateway is also an OpenTelemetry Collector, just configured differently.
# /etc/otelcol/gateway.yaml — simplified
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
auth:
authenticator: bearertokenauth
extensions:
bearertokenauth:
scheme: "Bearer"
tokens: ${file:/etc/otelcol/tokens.txt}
processors:
batch:
timeout: 5s
exporters:
prometheusremotewrite:
endpoint: https://prometheus.internal/api/v1/write
tls:
ca_file: /etc/ssl/internal-ca.crt
loki:
endpoint: https://loki.internal/loki/api/v1/push
otlphttp/tempo:
endpoint: https://tempo.internal:4318
service:
extensions: [bearertokenauth]
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/tempo]
Each edge device gets a unique bearer token. Compromised devices can be revoked without redeploying the whole fleet.
6. Alerting that survives WAN outages
The naive alert is “device X is down.” Run that across 2000 devices when the AT&T fiber backbone has a bad day and you get 2000 simultaneous pages.
Better alerts work on aggregates and ratios:
# alerts.yaml (Prometheus)
groups:
- name: edge-fleet
rules:
- alert: HighFractionOfFleetUnreachable
expr: |
count(absent_over_time(up{job="otelcol"}[5m])) /
count(up{job="otelcol"}) > 0.1
for: 10m
annotations:
summary: "10% of edge fleet is unreachable"
description: "Likely WAN issue, not individual devices."
labels:
severity: critical
- alert: SiteUnreachable
expr: |
sum by (site_id) (up{job="otelcol"}) == 0
for: 30m
annotations:
summary: "Site {{ $labels.site_id }} fully unreachable for 30m"
labels:
severity: warning
- alert: InferenceErrorRateHigh
expr: |
sum by (site_id, model_id) (rate(inference_errors_total[5m])) /
sum by (site_id, model_id) (rate(inference_requests_total[5m])) > 0.05
for: 15m
annotations:
summary: "Inference error rate > 5% on {{ $labels.site_id }}/{{ $labels.model_id }}"
labels:
severity: warning
- alert: DLQFlooding
expr: |
sum by (site_id) (rate(kafka_messages_in_topic{topic="inference.dlq"}[10m])) > 100
for: 15m
annotations:
summary: "Site {{ $labels.site_id }} dead-lettering >100 msg/s"
labels:
severity: critical
The “HighFractionOfFleetUnreachable” alert is the one that catches WAN outages. When 10% of the fleet disappears simultaneously, it’s almost never 10% of the devices failing.
6.1 Synthetic checks from inside the device
The device can self-report health to the central API independent of the metrics pipeline. A 30-second heartbeat with structured status:
// Heartbeat goroutine
type Heartbeat struct {
SiteID string `json:"site_id"`
DeviceID string `json:"device_id"`
Version string `json:"version"`
Uptime int64 `json:"uptime_seconds"`
InferenceOK bool `json:"inference_ok"`
BrokerOK bool `json:"broker_ok"`
DiskFreeBytes uint64 `json:"disk_free_bytes"`
}
func sendHeartbeats(ctx context.Context) {
t := time.NewTicker(30 * time.Second)
defer t.Stop()
client := &http.Client{Timeout: 10 * time.Second}
for {
select {
case <-ctx.Done(): return
case <-t.C:
hb := buildHeartbeat()
body, _ := json.Marshal(hb)
req, _ := http.NewRequestWithContext(ctx, "POST",
"https://api.internal/heartbeat", bytes.NewReader(body))
req.Header.Set("Authorization", "Bearer "+os.Getenv("DEVICE_TOKEN"))
req.Header.Set("Content-Type", "application/json")
resp, err := client.Do(req)
if err != nil { continue }
resp.Body.Close()
}
}
}
The heartbeat is the single endpoint you can hit when troubleshooting. “When did this device last call home?” gets you the timestamp instantly, no Prometheus query needed.
7. Common Pitfalls
Pitfall 1, no disk buffer on the agent
Without file_storage/queue, a 1-hour WAN outage means 1 hour of lost telemetry. With it, you buffer to disk and ship later. Cost: maybe 100 MB of disk per device. Benefit: complete telemetry across outages. This is non-negotiable for any fleet outside a data center.
Pitfall 2, cardinality explosion from per-device labels
Adding device.id to every metric is fine. Adding request.id is not. High-cardinality labels (anything with thousands of unique values) blow up your Prometheus index. Use traces or logs for high-cardinality data, metrics for aggregates only.
Pitfall 3, alert fatigue from per-device alerts
“Device X is down” alerts are noise. Real outages take down many devices at once or none. Alert on patterns (fleet fraction, regional clusters, model versions) not individuals. Individual device problems can be a daily report, not a page.
Pitfall 4, no log retention budget
Loki happily ingests gigabytes of logs per day from a large fleet. Without a retention policy, you’ll fill a hundred TB and not realize it until the bill arrives. Set retention to 7-14 days for INFO, 90 days for WARN+, and archive only what’s needed for compliance.
8. Troubleshooting
Agent OOM-killed on the edge device
Almost always the disk buffer growing without bound. The sending_queue.queue_size is in records, not bytes; a misconfigured exporter can push the queue beyond your RAM if memory_limiter isn’t catching it. Set memory_limiter.limit_mib to a hard ceiling at 30-50% of device RAM.
Metrics arrive at the gateway but Prometheus shows nothing
Either the prometheusremotewrite exporter URL is wrong (check the gateway logs for HTTP errors), or Prometheus is rejecting samples due to out-of-order timestamps. Devices with bad clocks (no NTP) can produce timestamps from 1970 or the future. Run chrony or systemd-timesyncd on every device.
Traces are sampled away, can’t reproduce a problem
Tail sampling samples after the trace finishes, which means a 1-second issue might be classified as fast and dropped. Lower the latency threshold or add an explicit attribute on errors and sample on that attribute.
9. Cost control, before it surprises you
Edge observability costs grow with fleet size in a roughly linear-but-painful way. A few patterns that have kept the bill under control on fleets I’ve worked with.
Metrics, not events, for high-volume sensor data. If you’ve got a thousand temperature sensors reporting once per second, that’s not metrics; that’s a fire hose. Push the raw readings to your time-series storage (Influx, Timescale) but only emit OpenTelemetry metrics for aggregates (per-minute mean, p99, error rate). The dashboards stay snappy and Prometheus stays small.
Logs are the most expensive signal per byte. Loki indexes by labels, not by content, but storage is still real. Drop INFO-level logs at the source via the filter/logs processor. Sample DEBUG logs aggressively if you keep them at all. A 5% sample of DEBUG plus 100% of WARN+ usually keeps you under budget while still being useful for debugging.
Trace sampling needs review every few months. Defaults that worked at 100 devices break at 10,000. Tail sampling at 5% probabilistic plus all errors will produce 50x as many traces when your fleet grows 50x. Either ratchet down the sampling percentage or add a quota per site.
10. Wrapping Up
Edge observability is about constraints, not collection. Run a real agent on every device, aggregate locally, push to a central funnel, alert on fleet patterns. The OpenTelemetry Collector is a good fit for the agent role; you don’t need exotic tools.
This post closes the April 2025 series. We started with hardware, moved through telemetry, brokering, inference runtimes, streaming pipelines, industrial bridges, and microcontrollers. Observability is the layer that makes all of it operable. Without it you’ve got a demo. With it, you’ve got a system.
For the canonical reference on OpenTelemetry, opentelemetry.io/docs/collector is the source of truth. The Prometheus 3.0 release notes at prometheus.io cover the breaking changes from 2.x worth knowing.