background-shape
Observability for Edge Fleets at Scale, Patterns That Work
April 30, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Edge observability is not data-center observability with worse network. It’s a different problem. Run an OpenTelemetry Collector on every device, aggregate and sample locally, push (don’t pull) to a central collector, and design alerts around fleet-wide patterns not per-device. The patterns in this post survive WAN outages without flooding your pager.

You’ve spent the month building edge AI pipelines. The hardware is humming, the broker is clustered, the inference workers are batching, the bridge is talking to PLCs. Production. Two weeks in, the customer calls. “Site 47 hasn’t sent data since Tuesday.” You SSH in (assuming you can). Now imagine that conversation happening 50 times a year across 2000 sites. That’s the observability problem.

This post is the capstone of the April series. We’ve built the pieces; this is how you keep them running. OpenTelemetry Collector 0.115 (April 2025 release) is the workhorse. Prometheus 3.0 (November 2024) on the central side. The patterns here come from real fleets, not theoretical reasoning.

If you’ve followed the series, this post ties together the streaming inference pipeline and the industrial bridge into a single observable system.

1. The four constraints of edge observability

The textbook observability stack assumes a few things that don’t hold at the edge.

Data-center assumption              Edge reality
---------------------------------------------------------------
Network is reliable                 WAN drops constantly
Disk is unlimited                   8 GB eMMC, full of logs
CPU is plentiful                    Inference owns the cores
Operator can SSH in                 NAT, firewalls, no SSH

Every observability decision at the edge falls out of these four constraints. You can’t pull metrics if the device is behind NAT. You can’t ship every log line if the WAN is 4G with a data cap. You can’t run a heavy agent if the inference workload needs the CPU. And you can’t debug live because you literally can’t reach the device.

2. The architecture, end to end

+--------------------------------- One edge site ---------------------------------+
|                                                                                 |
|  +-------------+   +-------------+   +-------------+   +-------------+          |
|  | bridge      |   | inference   |   | mqtt broker |   | system      |          |
|  | (Go)        |   | worker (Go) |   | (mosquitto) |   | metrics     |          |
|  +------+------+   +------+------+   +------+------+   +------+------+          |
|         |                 |                 |                 |                 |
|         | OTLP (gRPC, loopback)                                                 |
|         v                 v                 v                 v                 |
|         +-----------------+-----------------+-----------------+                 |
|                                  |                                              |
|                          +-------+-------+                                      |
|                          | otelcol agent |                                      |
|                          | (per device)  |                                      |
|                          | + sampling    |                                      |
|                          | + aggregation |                                      |
|                          | + local disk  |                                      |
|                          |   buffer      |                                      |
|                          +-------+-------+                                      |
|                                  |                                              |
+----------------------------------|----------------------------------------------+
                                   |  OTLP/HTTP over TLS (push)
                                   v
                          +-----------------+
                          | otelcol gateway |
                          | (central)       |
                          +--------+--------+
                                   |
              +--------------------+--------------------+
              v                    v                    v
        +-----------+        +-----------+        +-----------+
        | Prometheus|        | Loki 3.3  |        | Tempo 2.6 |
        | 3.0       |        |           |        |           |
        +-----------+        +-----------+        +-----------+

Two collectors. The agent on every device aggregates and buffers. The central gateway is the funnel everything pushes to. Backends are whatever you like; the example uses Prometheus, Loki, and Tempo.

3. The edge agent config

This is the OpenTelemetry Collector configuration that runs on every device. About 80 lines and worth understanding line by line.

# /etc/otelcol/agent.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 127.0.0.1:4317
      http:
        endpoint: 127.0.0.1:4318
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      filesystem:
      network:
      load:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'local-services'
          scrape_interval: 30s
          static_configs:
            - targets: ['127.0.0.1:9100', '127.0.0.1:9101', '127.0.0.1:9102']

processors:
  resource:
    attributes:
      - key: site.id
        value: ${env:SITE_ID}
        action: insert
      - key: device.id
        value: ${env:DEVICE_ID}
        action: insert
      - key: deployment.environment
        value: ${env:ENVIRONMENT}
        action: insert
  batch:
    timeout: 10s
    send_batch_size: 1024
    send_batch_max_size: 2048
  memory_limiter:
    check_interval: 5s
    limit_mib: 128
    spike_limit_mib: 32
  filter/logs:
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_WARN'
  tail_sampling:
    decision_wait: 5s
    num_traces: 1000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 500}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

exporters:
  otlphttp:
    endpoint: https://otel-gateway.example.com:4318
    tls:
      insecure: false
    headers:
      authorization: Bearer ${env:OTEL_TOKEN}
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 10000
      storage: file_storage/queue
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 5m
      max_elapsed_time: 24h

extensions:
  file_storage/queue:
    directory: /var/lib/otelcol/queue
    timeout: 10s

service:
  extensions: [file_storage/queue]
  telemetry:
    logs:
      level: warn
  pipelines:
    metrics:
      receivers: [otlp, hostmetrics, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, tail_sampling, batch]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, filter/logs, batch]
      exporters: [otlphttp]

The interesting bits:

  • file_storage/queue extension. This is what survives WAN outages. The collector buffers to disk and retries for up to 24 hours. After that, you’ve got bigger problems than missing telemetry.
  • memory_limiter is non-negotiable on edge. Without it, a backlog buildup will OOM the device.
  • tail_sampling keeps all error and slow traces, samples the rest at 5%. Smart sampling for low-bandwidth links.
  • filter/logs drops INFO and DEBUG logs at the source. They’re rarely useful at fleet scale and consume bandwidth.

4. Resource attributes, the key to fleet queries

Every signal gets site.id, device.id, and deployment.environment attached at the resource level. This is what makes fleet-wide queries possible.

# Sites with high inference error rate
sum(rate(inference_errors_total[5m])) by (site_id)
  / sum(rate(inference_requests_total[5m])) by (site_id) > 0.01

# Devices that haven't reported in 5 minutes
absent_over_time(up{job="otelcol"}[5m])

# P99 inference latency per model, fleet-wide
histogram_quantile(0.99,
  sum(rate(inference_latency_bucket[5m])) by (le, model_id))

Make sure your application code adds the same attributes when it emits OTLP. The Go SDK makes this easy:

import (
    "go.opentelemetry.io/otel/sdk/resource"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

res, _ := resource.New(ctx,
    resource.WithAttributes(
        semconv.ServiceName("inference-worker"),
        semconv.ServiceVersion("1.4.2"),
        attribute.String("site.id", os.Getenv("SITE_ID")),
        attribute.String("device.id", os.Getenv("DEVICE_ID")),
    ),
)

Resource attributes propagate to every metric, trace, and log this process emits. Set them once at startup.

5. Push, not pull

Prometheus’s pull model is a non-starter for edge devices behind NAT. You can’t dial into them. Instead, the agent pushes to a central gateway via OTLP/HTTP over TLS.

The central gateway is also an OpenTelemetry Collector, just configured differently.

# /etc/otelcol/gateway.yaml — simplified
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
        auth:
          authenticator: bearertokenauth

extensions:
  bearertokenauth:
    scheme: "Bearer"
    tokens: ${file:/etc/otelcol/tokens.txt}

processors:
  batch:
    timeout: 5s

exporters:
  prometheusremotewrite:
    endpoint: https://prometheus.internal/api/v1/write
    tls:
      ca_file: /etc/ssl/internal-ca.crt
  loki:
    endpoint: https://loki.internal/loki/api/v1/push
  otlphttp/tempo:
    endpoint: https://tempo.internal:4318

service:
  extensions: [bearertokenauth]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/tempo]

Each edge device gets a unique bearer token. Compromised devices can be revoked without redeploying the whole fleet.

6. Alerting that survives WAN outages

The naive alert is “device X is down.” Run that across 2000 devices when the AT&T fiber backbone has a bad day and you get 2000 simultaneous pages.

Better alerts work on aggregates and ratios:

# alerts.yaml (Prometheus)
groups:
  - name: edge-fleet
    rules:
      - alert: HighFractionOfFleetUnreachable
        expr: |
          count(absent_over_time(up{job="otelcol"}[5m])) /
          count(up{job="otelcol"}) > 0.1
        for: 10m
        annotations:
          summary: "10% of edge fleet is unreachable"
          description: "Likely WAN issue, not individual devices."
        labels:
          severity: critical

      - alert: SiteUnreachable
        expr: |
          sum by (site_id) (up{job="otelcol"}) == 0
        for: 30m
        annotations:
          summary: "Site {{ $labels.site_id }} fully unreachable for 30m"
        labels:
          severity: warning

      - alert: InferenceErrorRateHigh
        expr: |
          sum by (site_id, model_id) (rate(inference_errors_total[5m])) /
          sum by (site_id, model_id) (rate(inference_requests_total[5m])) > 0.05
        for: 15m
        annotations:
          summary: "Inference error rate > 5% on {{ $labels.site_id }}/{{ $labels.model_id }}"
        labels:
          severity: warning

      - alert: DLQFlooding
        expr: |
          sum by (site_id) (rate(kafka_messages_in_topic{topic="inference.dlq"}[10m])) > 100
        for: 15m
        annotations:
          summary: "Site {{ $labels.site_id }} dead-lettering >100 msg/s"
        labels:
          severity: critical

The “HighFractionOfFleetUnreachable” alert is the one that catches WAN outages. When 10% of the fleet disappears simultaneously, it’s almost never 10% of the devices failing.

6.1 Synthetic checks from inside the device

The device can self-report health to the central API independent of the metrics pipeline. A 30-second heartbeat with structured status:

// Heartbeat goroutine
type Heartbeat struct {
    SiteID   string `json:"site_id"`
    DeviceID string `json:"device_id"`
    Version  string `json:"version"`
    Uptime   int64  `json:"uptime_seconds"`
    InferenceOK bool `json:"inference_ok"`
    BrokerOK    bool `json:"broker_ok"`
    DiskFreeBytes uint64 `json:"disk_free_bytes"`
}

func sendHeartbeats(ctx context.Context) {
    t := time.NewTicker(30 * time.Second)
    defer t.Stop()
    client := &http.Client{Timeout: 10 * time.Second}
    for {
        select {
        case <-ctx.Done(): return
        case <-t.C:
            hb := buildHeartbeat()
            body, _ := json.Marshal(hb)
            req, _ := http.NewRequestWithContext(ctx, "POST",
                "https://api.internal/heartbeat", bytes.NewReader(body))
            req.Header.Set("Authorization", "Bearer "+os.Getenv("DEVICE_TOKEN"))
            req.Header.Set("Content-Type", "application/json")
            resp, err := client.Do(req)
            if err != nil { continue }
            resp.Body.Close()
        }
    }
}

The heartbeat is the single endpoint you can hit when troubleshooting. “When did this device last call home?” gets you the timestamp instantly, no Prometheus query needed.

7. Common Pitfalls

Pitfall 1, no disk buffer on the agent

Without file_storage/queue, a 1-hour WAN outage means 1 hour of lost telemetry. With it, you buffer to disk and ship later. Cost: maybe 100 MB of disk per device. Benefit: complete telemetry across outages. This is non-negotiable for any fleet outside a data center.

Pitfall 2, cardinality explosion from per-device labels

Adding device.id to every metric is fine. Adding request.id is not. High-cardinality labels (anything with thousands of unique values) blow up your Prometheus index. Use traces or logs for high-cardinality data, metrics for aggregates only.

Pitfall 3, alert fatigue from per-device alerts

“Device X is down” alerts are noise. Real outages take down many devices at once or none. Alert on patterns (fleet fraction, regional clusters, model versions) not individuals. Individual device problems can be a daily report, not a page.

Pitfall 4, no log retention budget

Loki happily ingests gigabytes of logs per day from a large fleet. Without a retention policy, you’ll fill a hundred TB and not realize it until the bill arrives. Set retention to 7-14 days for INFO, 90 days for WARN+, and archive only what’s needed for compliance.

8. Troubleshooting

Agent OOM-killed on the edge device

Almost always the disk buffer growing without bound. The sending_queue.queue_size is in records, not bytes; a misconfigured exporter can push the queue beyond your RAM if memory_limiter isn’t catching it. Set memory_limiter.limit_mib to a hard ceiling at 30-50% of device RAM.

Metrics arrive at the gateway but Prometheus shows nothing

Either the prometheusremotewrite exporter URL is wrong (check the gateway logs for HTTP errors), or Prometheus is rejecting samples due to out-of-order timestamps. Devices with bad clocks (no NTP) can produce timestamps from 1970 or the future. Run chrony or systemd-timesyncd on every device.

Traces are sampled away, can’t reproduce a problem

Tail sampling samples after the trace finishes, which means a 1-second issue might be classified as fast and dropped. Lower the latency threshold or add an explicit attribute on errors and sample on that attribute.

9. Cost control, before it surprises you

Edge observability costs grow with fleet size in a roughly linear-but-painful way. A few patterns that have kept the bill under control on fleets I’ve worked with.

Metrics, not events, for high-volume sensor data. If you’ve got a thousand temperature sensors reporting once per second, that’s not metrics; that’s a fire hose. Push the raw readings to your time-series storage (Influx, Timescale) but only emit OpenTelemetry metrics for aggregates (per-minute mean, p99, error rate). The dashboards stay snappy and Prometheus stays small.

Logs are the most expensive signal per byte. Loki indexes by labels, not by content, but storage is still real. Drop INFO-level logs at the source via the filter/logs processor. Sample DEBUG logs aggressively if you keep them at all. A 5% sample of DEBUG plus 100% of WARN+ usually keeps you under budget while still being useful for debugging.

Trace sampling needs review every few months. Defaults that worked at 100 devices break at 10,000. Tail sampling at 5% probabilistic plus all errors will produce 50x as many traces when your fleet grows 50x. Either ratchet down the sampling percentage or add a quota per site.

10. Wrapping Up

Edge observability is about constraints, not collection. Run a real agent on every device, aggregate locally, push to a central funnel, alert on fleet patterns. The OpenTelemetry Collector is a good fit for the agent role; you don’t need exotic tools.

This post closes the April 2025 series. We started with hardware, moved through telemetry, brokering, inference runtimes, streaming pipelines, industrial bridges, and microcontrollers. Observability is the layer that makes all of it operable. Without it you’ve got a demo. With it, you’ve got a system.

For the canonical reference on OpenTelemetry, opentelemetry.io/docs/collector is the source of truth. The Prometheus 3.0 release notes at prometheus.io cover the breaking changes from 2.x worth knowing.