Rate Limiting and Resilience Patterns for Modern APIs

Rate Limiting and Resilience Patterns for Modern APIs

July 30, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Rate limiting alone is not resilience. You need three layers — gateway limits for the edge, application limits for fairness, and circuit breakers plus hedged retries for the dependency graph. Each one fails differently, and you need all three.

If there’s a single piece of operational wisdom I’ve earned the hard way, it’s this: every API will eventually be hit by load it can’t handle. The interesting question is not whether it happens but how the service behaves when it does. A service with no protection becomes a paging incident. A service with naive rate limiting becomes a quieter paging incident that drops legitimate traffic. A service with the right layered defense degrades cleanly and stays up.

This post is the playbook I use across teams. It covers the four patterns that, in my experience, do most of the work: token bucket rate limiting, concurrency limits, circuit breakers, and hedged retries. The code is Go 1.24, the gateway examples are Envoy Gateway 1.3 and Kong 3.9, and the patterns transfer to whatever language you’re in.

If you’ve read the API gateway architectures post, you’ll recognize the gateway snippets here; this one focuses on the operational behavior they produce rather than the config syntax. There’s also a natural pairing with OpenTelemetry for gRPC services — none of the resilience signals matter if you can’t observe them.

1. The Three-Layer Mental Model

+-------------------+    edge rate limit (per IP, per key)
|     Gateway       | --------------------------------------
+-------------------+
         |
+-------------------+    app rate limit (per tenant, per route)
|     Service       |    concurrency limit (in-flight cap)
+-------------------+
         |
+-------------------+    circuit breaker (per dependency)
|   Dependency      |    retries with hedging
+-------------------+

Edge: protects you from external abuse. A token bucket per API key, plus a coarser per-IP fallback for unauthed traffic.
App: protects fair allocation across legitimate clients. A noisy tenant doesn’t starve a quiet one.
Downstream: protects you from a slow dependency turning into a wedged service.

The most common mistake is doing only one layer. Edge-only protects you from external attack but a misbehaving internal client still wedges you. App-only protects fairness but your unauthed endpoints are open to drive-by abuse. Downstream-only is rare and means somebody thought “circuit breaker” without thinking “rate limit.”

2. Token Bucket Rate Limiting in Go

The standard golang.org/x/time/rate package gives you a single-process token bucket:

import "golang.org/x/time/rate"

// 100 requests per second, burst of 200
limiter := rate.NewLimiter(100, 200)

func handler(w http.ResponseWriter, r *http.Request) {
	if !limiter.Allow() {
		http.Error(w, "rate limit", http.StatusTooManyRequests)
		return
	}
	// ...
}

This is fine for single-pod, single-tenant services. The moment you have multiple pods or multiple tenants, it falls down. For multi-pod, you need a distributed counter. For multi-tenant, you need a per-tenant limiter.

2.1 Per-Tenant Limiters

type tenantLimiter struct {
	mu    sync.Mutex
	cache map[string]*rate.Limiter
	rps   rate.Limit
	burst int
}

func newTenantLimiter(rps rate.Limit, burst int) *tenantLimiter {
	return &tenantLimiter{cache: map[string]*rate.Limiter{}, rps: rps, burst: burst}
}

func (t *tenantLimiter) Allow(tenant string) bool {
	t.mu.Lock()
	l, ok := t.cache[tenant]
	if !ok {
		l = rate.NewLimiter(t.rps, t.burst)
		t.cache[tenant] = l
	}
	t.mu.Unlock()
	return l.Allow()
}

This grows unbounded if you have a long tail of tenants. Wrap it in an LRU (hashicorp/golang-lru/v2) or evict idle entries.

2.2 Distributed Rate Limiting With Redis

For multi-pod accuracy, use Redis with a Lua script for atomicity:

-- token_bucket.lua
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or burst
local ts = tonumber(data[2]) or now

local elapsed = math.max(0, now - ts)
tokens = math.min(burst, tokens + elapsed * rate)

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, math.ceil(burst / rate) + 10)
return { allowed, tokens }

type redisLimiter struct {
	c    *redis.Client
	sha  string
	rate float64
	burst int
}

func (rl *redisLimiter) Allow(ctx context.Context, key string) (bool, error) {
	now := float64(time.Now().UnixMilli()) / 1000
	res, err := rl.c.EvalSha(ctx, rl.sha, []string{"rl:" + key},
		rl.rate, rl.burst, now, 1).Result()
	if err != nil { return false, err }
	arr := res.([]interface{})
	return arr[0].(int64) == 1, nil
}

The latency budget per request is one Redis round trip. With a co-located Redis (same AZ), that’s sub-millisecond and acceptable. If you can’t afford it, fall back to per-pod limiters with the rate divided by replica count — imprecise but cheap.

3. Concurrency Limits

Rate limiting throttles requests per second. It does nothing if every request takes ten seconds. A burst of expensive requests can pile up arbitrarily until you OOM. Concurrency limits cap the number of in-flight requests, which directly bounds memory and goroutine count.

type semaphore struct{ ch chan struct{} }

func newSem(n int) *semaphore { return &semaphore{ch: make(chan struct{}, n)} }

func (s *semaphore) Acquire(ctx context.Context) error {
	select {
	case s.ch <- struct{}{}:
		return nil
	case <-ctx.Done():
		return ctx.Err()
	}
}
func (s *semaphore) Release() { <-s.ch }

func middleware(sem *semaphore, next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		ctx, cancel := context.WithTimeout(r.Context(), 200*time.Millisecond)
		defer cancel()
		if err := sem.Acquire(ctx); err != nil {
			http.Error(w, "overloaded", http.StatusServiceUnavailable)
			return
		}
		defer sem.Release()
		next.ServeHTTP(w, r)
	})
}

Pick the cap empirically: load test the service, find the point at which p99 latency starts climbing, set the cap at 80% of that concurrency. The 200ms acquire timeout means callers fail fast instead of queueing forever.

For adaptive concurrency, the AIMD (additive increase, multiplicative decrease) algorithm used by Envoy’s adaptive concurrency filter is the gold standard. It increases the limit while latency is stable and slashes it on signs of overload. Worth using at the gateway layer; complex enough that I wouldn’t build it in-app.

4. Circuit Breakers

A circuit breaker stops calling a dependency that’s failing, giving it room to recover and shielding your service from cascading failures.

import "github.com/sony/gobreaker/v2"

var breaker = gobreaker.NewCircuitBreaker[*PaymentResp](gobreaker.Settings{
	Name:        "payments",
	MaxRequests: 5,                // in half-open state
	Interval:    30 * time.Second, // counter reset window
	Timeout:     10 * time.Second, // open -> half-open
	ReadyToTrip: func(c gobreaker.Counts) bool {
		return c.Requests >= 20 && c.TotalFailures*100/c.Requests >= 50
	},
})

func charge(ctx context.Context, amount int64) (*PaymentResp, error) {
	return breaker.Execute(func() (*PaymentResp, error) {
		return paymentClient.Charge(ctx, amount)
	})
}

The breaker states:

       failures over threshold
CLOSED --------------------------> OPEN
   ^                                |
   | all probes succeed             | timeout
   |                                v
HALF-OPEN <--------- (probe a few requests)

Tuning notes from experience:

ReadyToTrip should require a minimum request count (20 above). Without it, the first failure trips the breaker.
Timeout (open -> half-open) of 10-30 seconds is usually right. Shorter and you flap; longer and recovery lags.
MaxRequests in half-open is the canary: a few requests at a time to test recovery, not a flood.

The breaker is per-dependency, not per-process. A flaky payments service should not affect your inventory service calls.

5. Retries and Hedging

Retries are easy to get wrong. The right model is:

Retry only idempotent operations
Retry only on retryable errors (gRPC UNAVAILABLE, DEADLINE_EXCEEDED; HTTP 5xx and 429)
Use exponential backoff with jitter
Cap total attempts and total time
For latency-sensitive reads, consider hedging

import (
	"math/rand"
	"time"
)

func withRetry[T any](ctx context.Context, fn func(context.Context) (T, error)) (T, error) {
	var zero T
	for attempt := 0; attempt < 4; attempt++ {
		v, err := fn(ctx)
		if err == nil { return v, nil }
		if !retryable(err) { return zero, err }
		if attempt == 3 { return zero, err }
		backoff := time.Duration(1<<attempt) * 50 * time.Millisecond
		jitter := time.Duration(rand.Int63n(int64(backoff)))
		select {
		case <-time.After(backoff + jitter):
		case <-ctx.Done():
			return zero, ctx.Err()
		}
	}
	return zero, errors.New("unreachable")
}

5.1 Hedging for Tail Latency

For reads where consistency allows it, send the second request after a deadline:

func hedged[T any](ctx context.Context, hedgeAt time.Duration, fn func(context.Context) (T, error)) (T, error) {
	type res struct { v T; err error }
	ch := make(chan res, 2)
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()

	go func() { v, err := fn(ctx); ch <- res{v, err} }()
	select {
	case r := <-ch:
		return r.v, r.err
	case <-time.After(hedgeAt):
	}
	go func() { v, err := fn(ctx); ch <- res{v, err} }()

	r := <-ch
	return r.v, r.err
}

Hedge timing should be just above the p95 latency of the underlying call. Hedging at p50 doubles your load; hedging at p99 doesn’t help. The technique works because slow tails are usually one-off (a bad pod, a long GC pause), so the second request avoids the unlucky one.

6. Gateway-Level Rate Limits

The application limits above don’t help if traffic is overwhelming before it reaches the app. Push the coarse limits to the gateway.

6.1 Envoy Gateway

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata: { name: edge-rl }
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: public
  rateLimit:
    type: Global
    global:
      rules:
        - clientSelectors:
            - headers: [{ name: x-api-key, type: Distinct }]
          limit: { requests: 1000, unit: Minute }
        - clientSelectors:
            - sourceCIDR: { value: 0.0.0.0/0 }
          limit: { requests: 60, unit: Minute }

Two rules: 1000/min per API key, 60/min per IP as a fallback for unauthed traffic.

6.2 Kong

apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata: { name: edge-rl }
plugin: rate-limiting
config:
  minute: 1000
  policy: redis
  redis_host: redis.cache.svc
  fault_tolerant: true
  identifier: header
  header_name: x-api-key

fault_tolerant: true is critical. Without it, a Redis outage fails open or closed depending on settings, and you’ll discover which only when Redis goes down.

7. Common Pitfalls

7.1 Failing Open on Edge Limiter Outage

If your edge limiter’s backing store (Redis) goes down and your code fails open, your service is unprotected. If it fails closed, your service is down. Pick “fail open with alarm” by default, alert immediately, and degrade.

7.2 Retrying Non-Idempotent Writes

Retrying a POST /payments/charge results in double charges. Either make all writes idempotent (idempotency keys) or never retry them. The middle ground is a footgun.

7.3 Synchronous Cascading Retries

Service A retries calling B, which retries calling C, which retries calling D. A single deep stall in D becomes an exponential request fan-out. Set a Retry-Attempt header and decrement at each hop, or simply don’t retry across service boundaries — retry only at the top of the call tree.

7.4 Circuit Breaker Per Request

gobreaker.NewCircuitBreaker inside a handler creates a new breaker per request. State doesn’t persist. Build breakers at startup, share across requests.

7.5 No Budget on Concurrent Hedges

Hedging doubles load. Without a budget, a slow downstream gets twice the traffic just as it’s struggling. The fix is a global hedge budget (e.g., “at most 5% of requests may hedge”), enforced with a separate token bucket.

8. Troubleshooting

8.1 p99 Latency Doubled Right After Adding Retries

Your retries are amplifying a slow downstream. Each first attempt times out, the retry waits for jittered backoff, then completes. Result: p99 = first_timeout + backoff + downstream_p50. Either lower the per-attempt deadline (so retries get more chances) or skip retries on this path.

8.2 Rate Limiter Counts Are Wrong After Pod Restart

Local counters reset on restart. If precise enforcement matters, you must use a distributed store. If “approximately N per second across the fleet” is good enough, the Redis-backed limiter wins.

8.3 Circuit Breaker Flapping

Half-open probes succeed, breaker closes, real traffic surges in, downstream falls back over. Either raise MaxRequests in half-open so probing is gradual, or hold open longer.

9. Wrapping Up

Resilience is a system property, not a feature. The four patterns above — rate limits at two layers, concurrency caps, circuit breakers, and disciplined retries — are the spine of every production API I’ve shipped that’s stayed up under unexpected load. None of them is hard individually. Composing them, with sensible defaults and the right observability, is where the work is.

The Google SRE workbook chapter on overload handling is the canonical read on this topic, and the patterns here are mostly its descendants applied to 2025 stacks. With this piece, the July series is complete: schemas, transports, gateways, security, observability, and resilience. The boring fundamentals are still the ones that ship.

1. The Three-Layer Mental Model

2. Token Bucket Rate Limiting in Go

2.1 Per-Tenant Limiters

2.2 Distributed Rate Limiting With Redis

3. Concurrency Limits

4. Circuit Breakers

5. Retries and Hedging

5.1 Hedging for Tail Latency

6. Gateway-Level Rate Limits

6.1 Envoy Gateway

6.2 Kong

7. Common Pitfalls

7.1 Failing Open on Edge Limiter Outage

7.2 Retrying Non-Idempotent Writes

7.3 Synchronous Cascading Retries

7.4 Circuit Breaker Per Request

7.5 No Budget on Concurrent Hedges

8. Troubleshooting

8.1 p99 Latency Doubled Right After Adding Retries

8.2 Rate Limiter Counts Are Wrong After Pod Restart

8.3 Circuit Breaker Flapping

9. Wrapping Up

Related posts

Rate Limiting at Scale, Token Bucket and Sliding Window in Redis

Error Handling and Retries for Production n8n Workflows

OpenTelemetry for gRPC Services in Go, A Production Walkthrough

Schema First API Development with buf, A Step by Step Tutorial

API Gateway Architectures in 2025, Envoy Gateway and Kong Compared

Streaming gRPC for Real Time Data, A Hands On Guide

Securing Internal Microservices with JWT and SPIFFE in 2025

Connect Go for Browser Friendly gRPC, A Production Tutorial

Let’s Start a Project