Service Mesh Resilience, Istio Ambient vs Linkerd in 2024

Sre article cover illustration on a gradient background

June 19, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — Istio Ambient hit GA in 1.22. Linkerd 2.15 stayed Linkerd. Ambient wins on flexibility and zero-sidecar overhead. Linkerd wins on operational simplicity and tail latency. The choice depends on whether you want a Swiss Army knife or a scalpel.

I’ve been running service meshes in production since the Istio 1.0 sidecar days. The biggest practical change in 2024 is that Istio Ambient went GA in 1.22, which means you can finally run Istio without a sidecar per pod. That removes one of the biggest objections to Istio and reopens the comparison with Linkerd, which has always been the simpler choice.

This post is about resilience: retries, timeouts, circuit breakers, mTLS, and the kinds of failure modes that the mesh is supposed to mitigate. It’s not a feature checklist; both projects do most of the same things. It’s about how they feel in operation when something is breaking.

If you haven’t read the eBPF and OpenTelemetry observability post yet, that’s the layer that tells you when the mesh’s resilience features are firing. Mesh without observability is faith-based reliability.

Two meshes, two postures

Linkerd has always been the minimalist mesh. Rust data plane (linkerd2-proxy), Go control plane, no extension points beyond what’s strictly needed. The 2.15 release continued the trend: smaller, faster, fewer knobs. The opinion is that most teams should not be tuning a mesh.

Istio is the opposite. Envoy data plane (huge feature surface), Go control plane (istiod), and now two deployment modes: classic sidecar and Ambient (which uses per-node ztunnel for L4 and optional per-namespace waypoint proxies for L7). The opinion is that you’ll want every feature eventually.

Both ship mTLS by default in 2024. Both do retries, timeouts, and outlier detection. The difference is in how much you have to know to use them.

Resilience as policy, not code

A common refrain when teams adopt a mesh: “great, now retries live in the mesh; we don’t need them in code.” Half right. The mesh handles transport-level retries (connection reset, 5xx, gRPC UNAVAILABLE) and timeouts well. It doesn’t handle idempotency. That’s still your job in code.

Concretely, mesh retries should only fire on:

TCP connect errors.
HTTP 502/503/504.
gRPC UNAVAILABLE.

Never on 400, 404, or gRPC INVALID_ARGUMENT. The first set is “transient infrastructure problem”; the second is “your request is wrong, retrying won’t help.”

Istio Ambient retry policy

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: checkout-route
  namespace: checkout
spec:
  parentRefs:
    - name: checkout-waypoint
      kind: Service
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      timeouts:
        request: 2s
        backendRequest: 800ms
      backendRefs:
        - name: checkout
          port: 8080
          weight: 100
      filters:
        - type: ExtensionRef
          extensionRef:
            group: networking.istio.io
            kind: RetryPolicy
            name: idempotent-only
---
apiVersion: networking.istio.io/v1
kind: RetryPolicy
metadata:
  name: idempotent-only
  namespace: checkout
spec:
  attempts: 3
  perTryTimeout: 600ms
  retryOn: "5xx,reset,connect-failure"
  retryRemoteLocalities: true

Ambient in 1.22 uses Gateway API for waypoint-level config, which is a real improvement over the VirtualService/DestinationRule combo. Cleaner mental model. Notice the backendRequest timeout shorter than the overall request timeout, leaving headroom for retries.

Linkerd 2.15 retry policy

apiVersion: policy.linkerd.io/v1alpha1
kind: HTTPRoute
metadata:
  name: checkout-route
  namespace: checkout
spec:
  parentRefs:
    - name: checkout
      kind: Service
      group: core
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      timeouts:
        request: 2s
        backendRequest: 800ms
---
apiVersion: policy.linkerd.io/v1beta3
kind: HTTPLocalRateLimitPolicy
metadata:
  name: checkout-retry
  namespace: checkout
spec:
  targetRef:
    name: checkout
    kind: Service
    group: core
  total:
    requestsPerSecond: 200

Linkerd’s per-route retry budget config is leaner. Linkerd uses retry budgets (a percentage of in-flight requests that may be retries) rather than per-request retry counts, which is the safer default. It prevents retry storms by construction.

Reasonable people disagree on which is better. I lean Linkerd here because retry budgets are a more honest model of how retries interact with capacity.

mTLS, the cheap reliability win

Both meshes give you mTLS between meshed workloads by default. The reliability angle (besides security) is that mTLS forces a strong identity check at connection time, which catches misrouted traffic earlier and more loudly than a TCP-only mesh would.

In Istio Ambient, mTLS is provided by ztunnel at the node level. No sidecar. No injection. Just enroll the namespace and traffic gets encrypted.

apiVersion: v1
kind: Namespace
metadata:
  name: checkout
  labels:
    istio.io/dataplane-mode: ambient

That’s the whole opt-in for L4 mTLS. The waypoint for L7 features is a separate object you create when you want HTTP-level policy.

In Linkerd, linkerd-proxy injects per pod (still sidecar-mode; Linkerd hasn’t gone sidecar-less). mTLS is on by default for meshed pods. Same UX, different mechanism.

For broader context, the Istio Ambient overview is the canonical reading. It’s the most thorough explanation of why the data plane split into L4 (ztunnel) and L7 (waypoint) is the model going forward.

Outlier detection, the unsung feature

Outlier detection is the mesh removing a misbehaving endpoint from the load-balancing pool. Both meshes do it. It’s underused.

Linkerd does it automatically via its load balancer (which is EWMA-based and downweights slow endpoints). You don’t configure it. It just works.

Istio requires explicit DestinationRule (or Ambient equivalent) configuration:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: payments-circuit-breaker
  namespace: checkout
spec:
  host: payments.payments.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 30
    connectionPool:
      http:
        maxRequestsPerConnection: 100
        http2MaxRequests: 1000

The maxEjectionPercent: 30 is important. Without it, a flaky downstream can have most of its endpoints ejected and you’ve created the outage you were trying to prevent.

Picking between them

A short heuristic:

Have <100 services, want minimal ops, mostly HTTP/gRPC: Linkerd 2.15. You’ll be productive in a week.
Want VM workloads, multi-cluster federation, complex routing: Istio Ambient.
Have an existing Istio install with sidecars: stay on Istio, plan a phased Ambient migration.
Need WASM extensions: Istio.
Care about tail latency above all: Linkerd. The Rust proxy is genuinely faster at the p99.

The Ambient release closes most of the operational gap that drove teams to Linkerd. It doesn’t close all of it. Linkerd’s “boring on purpose” posture still results in fewer 3am pages.

Migration considerations

A few teams I’ve talked to are weighing migration paths. Two directions I’d flag:

Sidecar Istio to Ambient. This is the safer migration. Same control plane, same APIs largely, but you remove the sidecar tax. Phase it: pick a low-risk namespace, enroll it in Ambient, leave the rest as is. The two modes coexist. After a quarter of operation, evaluate whether to convert more.

Linkerd to Istio Ambient. Harder. Different control planes, different policy CRDs, different proxy. Don’t do this unless you have a specific feature need that Linkerd genuinely doesn’t cover (multi-cluster federation is the usual one). The operational burden of Istio is real, even with Ambient.

Nothing to one or the other. Easier than either migration. Start with the simpler choice (Linkerd) unless you know you’ll need Istio features.

The decision should be documented with a specific named feature need, not “Istio has more features.” It always does. That’s not a reason to pick it.

Resilience features in code anyway

A frequently-debated question: with a mesh, do you still need resilience patterns in app code? My answer is yes, for two reasons.

First, the mesh doesn’t know your business semantics. It can retry an HTTP request, but it can’t decide that “duplicate this charge” is the wrong retry to make. Idempotency keys live in your code.

Second, mesh failures happen. The proxy can crash, the control plane can be partitioned, the policy can be misconfigured. If your app has zero in-process resilience, a mesh failure becomes an application outage. Defensive timeouts and circuit breakers in code are cheap insurance.

What the mesh frees you from is the mechanical stuff: TLS, basic retries, simple timeouts, traffic shaping. That’s enough to be worth running.

Gotchas

Mesh-level retries plus app-level retries. This multiplies. If your app has 3 retries and the mesh has 3, a single failed request can balloon into 9. Pick one layer.
No mTLS skew handling. When you rotate the root, both meshes need a grace window. Plan rotations weeks ahead, not the same day.
Ambient migration with leftover sidecars. Some namespaces will still need sidecars (for WASM or EnvoyFilter). Mixing is fine but unconfusable; don’t enroll a namespace in Ambient and leave injected sidecars in it.
Outlier detection without observability. If the mesh ejects 30% of endpoints and you don’t see it, you’ll think the service is fine when half its capacity is gone.
Treating the mesh as a policy engine. Mesh policy is fine for L4/L7 routing. For finer-grained controls (data residency, request shape), use a real admission layer.
Linkerd 2.15 on x86 only? It’s not. ARM works in 2.15. People assume otherwise from old docs.

Wrapping Up

The Istio Ambient GA is real news. It removes the sidecar tax that was the main objection to Istio at scale. Linkerd 2.15 didn’t do anything dramatic and still doesn’t need to. Both are good. Pick on operational fit, not feature checklists.

Whatever you pick, treat the mesh as one layer in your resilience story, not the whole thing. App-level idempotency, circuit breakers in code, and proper SLOs still matter. The mesh helps. It’s not magic.

Next week I’ll write about blameless postmortems, which is the human layer that closes the resilience loop after the technical layers (mesh, observability, chaos) have done their work.

Two meshes, two postures

Resilience as policy, not code

Istio Ambient retry policy

Linkerd 2.15 retry policy

mTLS, the cheap reliability win

Outlier detection, the unsung feature

Picking between them

Migration considerations

Resilience features in code anyway

Gotchas

Wrapping Up

Related posts

Synthetic Monitoring and Canary Deploys, A Practical Pairing

Blameless Postmortems That Actually Change Behavior

eBPF Plus OpenTelemetry, The Observability Pairing for 2024

Auto Remediation on Kubernetes, Argo Events and Policy as Code

Chaos Engineering on Kubernetes, Litmus and Chaos Mesh in 2024

SLOs and Error Budgets That Engineers Actually Use

Digital Immune Systems for Engineers, What Gartner Got Right

Building an SRE Copilot for On Call Engineers

Let’s Start a Project