Synthetic Monitoring and Canary Deploys, A Practical Pairing

Sre article cover illustration on a gradient background

June 26, 2024 · 7 min read · by Muhammad Amal programming

TL;DR — Canary deploys catch regressions caused by your change. Synthetic monitoring catches regressions caused by anything else. Wire synthetics as a canary analysis step and your rollouts will fail closed when the surface symptoms appear, not when the SLO does.

Two practices that live in different chapters of the SRE handbook are actually the same machine viewed from different angles. Canary deploys gate a release on metrics; synthetic monitoring gates a service’s health on probe results. If you wire the second into the first, your release pipeline can roll back on a user-shaped failure before any real user sees it.

This is one of the cheapest wins available in 2024. Argo Rollouts (shipping with Argo CD 2.11) treats analysis templates as first-class. Synthetic probes are a couple of containers and a cron schedule. The integration is mostly YAML.

If your team has already adopted SLOs and decent observability, this is the next thing to ship. It closes the last gap in the progressive delivery story.

The two failure modes

Canary deploys catch failure modes that show up in your service’s metrics during the canary window. Higher error rate, higher latency, lower throughput on the canary pods compared to the stable ones. That’s most regressions but not all.

What canary deploys miss:

A response that returns 200 but with empty content (the metric says everything’s fine).
A change in business logic that returns the wrong answer (correctness, not availability).
A dependency that broke unrelated to the deploy but cascades into the canary window.
An auth issue that locks out specific users (you don’t have those users’ traffic in the canary).

Synthetic monitoring catches these. It’s a known-input, known-expected-output probe that runs continuously. If a synthetic fails, you have a reproducible, characterizable defect. If a synthetic fails during a canary, your release should stop.

The mechanism

Argo Rollouts supports AnalysisTemplate resources that gate the rollout’s progression. The template can query Prometheus, but it can also call a webhook. The webhook can be a synthetic probe runner that returns pass/fail.

Here’s the shape:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
  namespace: checkout
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        istio:
          virtualService:
            name: checkout
      steps:
        - setWeight: 10
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: latency-p95
              - templateName: synthetic-checkout-flow
            args:
              - name: service-name
                value: checkout
        - setWeight: 30
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: latency-p95
              - templateName: synthetic-checkout-flow
        - setWeight: 60
        - pause: { duration: 5m }
        - setWeight: 100
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/internal/checkout:{{ .Values.image.tag }}
          ports:
            - containerPort: 8080

Three analysis templates. The first two are standard (error rate and latency from Prometheus). The third runs the synthetic. If any fails, the rollout aborts and reverts traffic to stable.

The synthetic template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: synthetic-checkout-flow
  namespace: checkout
spec:
  args:
    - name: service-name
  metrics:
    - name: synthetic-pass-rate
      interval: 30s
      count: 4
      successCondition: result[0] >= 0.95
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.observability:9090
          query: |
            sum(rate(synthetic_check_success_total{flow="checkout-purchase"}[2m]))
            /
            sum(rate(synthetic_check_total{flow="checkout-purchase"}[2m]))

The synthetic itself is a CronJob or a long-running process that hits the canary endpoint, runs a multi-step flow (add to cart, checkout, verify confirmation page), and emits two Prometheus counters: synthetic_check_total and synthetic_check_success_total, labeled by flow and target (canary vs stable).

A minimal synthetic runner in Go:

package main

import (
	"context"
	"net/http"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	checks = prometheus.NewCounterVec(prometheus.CounterOpts{
		Name: "synthetic_check_total",
		Help: "Total synthetic checks executed.",
	}, []string{"flow", "target"})
	passes = prometheus.NewCounterVec(prometheus.CounterOpts{
		Name: "synthetic_check_success_total",
		Help: "Successful synthetic checks.",
	}, []string{"flow", "target"})
)

func init() {
	prometheus.MustRegister(checks, passes)
}

func runCheckoutFlow(ctx context.Context, target string) error {
	client := &http.Client{Timeout: 5 * time.Second}

	cartURL := "http://" + target + "/api/cart"
	if err := postJSON(ctx, client, cartURL, `{"sku":"TEST-001"}`); err != nil {
		return err
	}

	checkoutURL := "http://" + target + "/api/checkout"
	if err := postJSON(ctx, client, checkoutURL, `{"payment_method":"test"}`); err != nil {
		return err
	}

	return nil
}

func main() {
	go http.ListenAndServe(":9100", promhttp.Handler())

	for {
		for _, target := range []string{"checkout-canary", "checkout-stable"} {
			ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
			err := runCheckoutFlow(ctx, target)
			checks.WithLabelValues("checkout-purchase", target).Inc()
			if err == nil {
				passes.WithLabelValues("checkout-purchase", target).Inc()
			}
			cancel()
		}
		time.Sleep(15 * time.Second)
	}
}

func postJSON(ctx context.Context, c *http.Client, url, body string) error {
	// elided for brevity
	return nil
}

The crucial bit is labeling by target. You want to know whether the synthetic is failing against the canary, the stable, or both. If it’s failing against both, you have a real outage and the rollout abort is a side issue. If it’s failing only against canary, the rollout aborts cleanly.

Designing synthetics that matter

A few rules I’ve found hold up:

One synthetic per user-critical flow. Not per endpoint. The flow is what matters; “login → search → checkout” is the artifact, not three separate endpoint pings.
Synthetics are tests, not probes. Tag them with version, expected behavior, and an owner. They will rot the moment the product changes if you don’t.
Run synthetics against both canary and stable. Failures against canary only mean the canary is broken. Failures against both mean the synthetic is broken or the platform is broken.
Don’t use synthetics as your only correctness signal. They’re a probe set, not a test suite. Synthetics + production observation + integration tests, not synthetics alone.
Budget for synthetic noise. A synthetic that runs 96 times an hour will occasionally fail for transient reasons. Tune successCondition to be ≥95%, not 100%.

The Grafana Synthetic Monitoring docs are a decent vendor reference for the operational concerns. If you don’t want to self-host, the SaaS offerings (Grafana Synthetic Monitoring, Datadog, Checkly) are competent. Self-hosting on a couple of CronJobs is also fine.

The rollback sequence

When the analysis fails:

Argo Rollouts marks the rollout as Degraded.
Traffic is shifted back to the stable replicaset (immediate).
A Kubernetes event is emitted (your auto-remediation pipeline can pick this up).
The team is paged.

A useful follow-up is a Slack notification with which analysis template failed. If it was a synthetic, the on-caller can immediately run the synthetic by hand against staging and reproduce.

This is the bit that distinguishes a real progressive delivery setup from “we do canaries.” Real progressive delivery has automated rollback on a failure signal that isn’t just metrics. Synthetics make that signal richer.

Tying to chaos engineering

The same synthetics you run during canaries are the steady-state probe for chaos experiments . One probe set, two consumers. That’s a real saving on operational cost.

When you run a chaos experiment on the checkout service, the chaos engine’s steady-state check is the same synthetic_check_success_total ratio. When the experiment causes the ratio to drop below threshold, the experiment is marked failed and rolled back. Identical mechanics to the canary case.

Gotchas

Synthetics that share auth credentials with real users. Don’t. Use a service account with a quota-limited test customer. Otherwise your synthetics show up in business analytics as fake purchases.
Synthetics that mutate real data. Bad. Use a test account with a feature flag that skips downstream side effects (billing, email).
No retry budget on the synthetic. A flaky network will fail synthetics randomly. One failed probe shouldn’t abort a rollout; that’s why failureLimit exists.
Synthetic against stable only. You’ll never catch canary-specific issues. Always run against the canary endpoint.
Synthetic ignored when failing. The fastest way to teach a team to ignore signal is to leave a flaky synthetic on. Fix or remove. Don’t silence.
No synthetic for the rollback path. If your rollback is itself broken, you find out at the worst time. Run a synthetic against the stable service permanently, not just during deploys.
Confusing synthetics with E2E tests. E2E tests run in CI against staging. Synthetics run in production against the live service. Different consumers, different SLAs.

Wrapping Up

Synthetic monitoring and canary deploys are stronger together because they catch different failure modes. Wire the synthetic pass rate into the canary analysis and your rollouts will fail closed on the same signals that real users would see. The cost is a couple of templates, a small Go binary, and the discipline to keep the synthetics honest.

This is the last post in this month’s reliability run. We covered Digital Immune Systems , SLOs , chaos engineering , auto-remediation , eBPF and OTel , service mesh resilience , and blameless postmortems . If you implement even half of those well, you’ll have a reliability program that’s better than most. Next month I’ll shift back to language and runtime topics.

The two failure modes

The mechanism

The synthetic template

Designing synthetics that matter

The rollback sequence

Tying to chaos engineering

Gotchas

Wrapping Up

Related posts

Blameless Postmortems That Actually Change Behavior

Service Mesh Resilience, Istio Ambient vs Linkerd in 2024

eBPF Plus OpenTelemetry, The Observability Pairing for 2024

Auto Remediation on Kubernetes, Argo Events and Policy as Code

Chaos Engineering on Kubernetes, Litmus and Chaos Mesh in 2024

SLOs and Error Budgets That Engineers Actually Use

Digital Immune Systems for Engineers, What Gartner Got Right

Building an SRE Copilot for On Call Engineers

Let’s Start a Project