September Retro, One Stack to Watch Them All
TL;DR — Full Grafana stack now running for the factory project: Prometheus, Loki, Tempo, Alertmanager. ~50 metrics, ~5 GB logs/day, ~1% trace sampling. Costs $40/month. Three SLOs defined; first burn-rate alert fired once (was a real incident). Worth it.
End of September. The factory observability stack from August has matured into a full three-pillar observability platform. This retro is what shipped and what I’d tell future-me.
What’s running
Compose stack on a single $40/month VM:
- Prometheus 2.37 — ~50 metrics, ~3K active series, 30-day retention
- Grafana 9 — 4 dashboards (fleet, line, machine, SLO)
- Loki 2.6 — ~5 GB/day logs, 14-day retention
- Tempo 1.5 — 1% sampling, ~50K traces/day, 7-day retention
- Alertmanager 0.24 — 8 alert rules, 3 receivers (Slack, PagerDuty, email)
- Promtail 2.6 — ships container logs
Total RAM: ~3 GB. Storage: ~25 GB used (compressed). CPU: <5% steady.
Cost-comparison: equivalent on Datadog would be ~$800/month at this volume. We’re at ~$40 + my time.
What worked
RED method for the API service. Three panels covered “is it healthy” cleanly. Operations team picked it up in minutes.
Loki labels staying disciplined. ~10 labels total (service, level, env, etc.). No cardinality issues; queries fast.
Multi-window burn rate alert. First fired during a real incident (a memory leak in the data ingester that caused intermittent 503s). Caught it at ~3 hours into the burn, before any customer complained. Pager goal achieved.
Tempo trace lookup from log lines. Logs include trace_id. One click in Grafana goes from “this error log line” to “the full request trace.” Diagnosed a downstream-API issue in 5 minutes.
SLO-driven prioritization. When budget got tight in week 2 (due to the memory leak above), we had a documented reason to delay a feature. Less argument; more pattern.
What didn’t work
Tempo without trace search. Tempo 1.5’s lookup-by-trace-id-only model meant “find slow traces” required going through metrics first. Tempo 2.0 (later in 2022) adds TraceQL; will re-evaluate then.
OpenTelemetry SDK setup complexity. Auto-instrumentation worked for Go HTTP server. Custom instrumentation for the OPC UA bridge took a day to figure out. Documentation gaps.
Promtail JSON parsing on multi-line logs. Java stack traces. Configured multiline regex; sometimes worked, sometimes didn’t. Spent a half day on it; reverted to “one log line per event” by tweaking the application’s logging config. Easier than wrestling Promtail.
Initial dashboard count. Created ~12 dashboards in week one. Half went unused. Consolidated to 4 in week three. Less is more.
What I’d cut
Tempo for this project. Honestly, at our scale, the value-to-complexity ratio of distributed tracing was lower than expected. Most production issues were single-service. The OPC UA bridge had latency issues a trace would have helped diagnose; the other 90% of issues didn’t.
For a future project with more inter-service complexity, I’d add Tempo back. For this one, I might have skipped it.
Most of the default Node-style metrics. go_gc_duration_seconds, go_memstats_*, etc. — included free but I rarely query them. Could drop to save tiny storage. Don’t.
Multi-window burn rate alert for the lower-tier service. The OPC UA bridge isn’t user-facing in the same way. SLO discipline there is overkill. Simple uptime alert is enough.
What’s load-bearing now
Three things I’d defend in any code review:
- Per-service RED dashboards with threshold-color stat panels. Operations relies on them.
- The two SLO burn-rate alerts (fast + slow) for the API service. Most actionable alerts we have.
- Loki-side
trace_idin every log line. The cross-pillar correlation is genuinely useful.
Two things I’d defend less strongly:
- Tempo. Useful but not load-bearing for our use cases yet.
- Custom per-machine dashboards. Operations doesn’t dig into them often.
Lessons for next time
Start with metrics + alerts. Add logs in week 2. Add tracing only if you actually need it. The three-pillar approach is best built incrementally.
Cardinality budget upfront. “Plan for 5000 series per service” forces discipline at instrumentation time, not after Prometheus OOMs.
SLOs require commitment, not just dashboards. Setting targets nobody acts on is decoration. The action policy (“delay feature work when budget < 25%”) makes the difference.
Self-hosting cost is real but small at small scale. $40/month for full observability is great. The cost is the engineering time to set it up. Time spent in September: ~20 hours total across the month.
What October looks like
October theme: Data Engineering Lite — Data synchronization strategies and ETL pipelines using Python and Go. Shift away from observability into data movement. Same shape: 13 articles, M/W/F.
Why this theme: many “is my data up to date in BigQuery?” questions surface as I move between projects. Want to write down the patterns I’ve used. Mostly Python (pandas, SQLAlchemy) with some Go where perf matters.
See you Monday.