Rate Limiting and Resilience Patterns for Modern APIs
TL;DR — Rate limiting alone is not resilience. You need three layers — gateway limits for the edge, application limits for fairness, and circuit breakers plus hedged retries for the dependency graph. Each one fails differently, and you need all three.
If there’s a single piece of operational wisdom I’ve earned the hard way, it’s this: every API will eventually be hit by load it can’t handle. The interesting question is not whether it happens but how the service behaves when it does. A service with no protection becomes a paging incident. A service with naive rate limiting becomes a quieter paging incident that drops legitimate traffic. A service with the right layered defense degrades cleanly and stays up.
This post is the playbook I use across teams. It covers the four patterns that, in my experience, do most of the work: token bucket rate limiting, concurrency limits, circuit breakers, and hedged retries. The code is Go 1.24, the gateway examples are Envoy Gateway 1.3 and Kong 3.9, and the patterns transfer to whatever language you’re in.
If you’ve read the API gateway architectures post, you’ll recognize the gateway snippets here; this one focuses on the operational behavior they produce rather than the config syntax. There’s also a natural pairing with OpenTelemetry for gRPC services — none of the resilience signals matter if you can’t observe them.
1. The Three-Layer Mental Model
+-------------------+ edge rate limit (per IP, per key)
| Gateway | --------------------------------------
+-------------------+
|
+-------------------+ app rate limit (per tenant, per route)
| Service | concurrency limit (in-flight cap)
+-------------------+
|
+-------------------+ circuit breaker (per dependency)
| Dependency | retries with hedging
+-------------------+
- Edge: protects you from external abuse. A token bucket per API key, plus a coarser per-IP fallback for unauthed traffic.
- App: protects fair allocation across legitimate clients. A noisy tenant doesn’t starve a quiet one.
- Downstream: protects you from a slow dependency turning into a wedged service.
The most common mistake is doing only one layer. Edge-only protects you from external attack but a misbehaving internal client still wedges you. App-only protects fairness but your unauthed endpoints are open to drive-by abuse. Downstream-only is rare and means somebody thought “circuit breaker” without thinking “rate limit.”
2. Token Bucket Rate Limiting in Go
The standard golang.org/x/time/rate package gives you a single-process token bucket:
import "golang.org/x/time/rate"
// 100 requests per second, burst of 200
limiter := rate.NewLimiter(100, 200)
func handler(w http.ResponseWriter, r *http.Request) {
if !limiter.Allow() {
http.Error(w, "rate limit", http.StatusTooManyRequests)
return
}
// ...
}
This is fine for single-pod, single-tenant services. The moment you have multiple pods or multiple tenants, it falls down. For multi-pod, you need a distributed counter. For multi-tenant, you need a per-tenant limiter.
2.1 Per-Tenant Limiters
type tenantLimiter struct {
mu sync.Mutex
cache map[string]*rate.Limiter
rps rate.Limit
burst int
}
func newTenantLimiter(rps rate.Limit, burst int) *tenantLimiter {
return &tenantLimiter{cache: map[string]*rate.Limiter{}, rps: rps, burst: burst}
}
func (t *tenantLimiter) Allow(tenant string) bool {
t.mu.Lock()
l, ok := t.cache[tenant]
if !ok {
l = rate.NewLimiter(t.rps, t.burst)
t.cache[tenant] = l
}
t.mu.Unlock()
return l.Allow()
}
This grows unbounded if you have a long tail of tenants. Wrap it in an LRU (hashicorp/golang-lru/v2) or evict idle entries.
2.2 Distributed Rate Limiting With Redis
For multi-pod accuracy, use Redis with a Lua script for atomicity:
-- token_bucket.lua
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local burst = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or burst
local ts = tonumber(data[2]) or now
local elapsed = math.max(0, now - ts)
tokens = math.min(burst, tokens + elapsed * rate)
local allowed = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
end
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("EXPIRE", key, math.ceil(burst / rate) + 10)
return { allowed, tokens }
type redisLimiter struct {
c *redis.Client
sha string
rate float64
burst int
}
func (rl *redisLimiter) Allow(ctx context.Context, key string) (bool, error) {
now := float64(time.Now().UnixMilli()) / 1000
res, err := rl.c.EvalSha(ctx, rl.sha, []string{"rl:" + key},
rl.rate, rl.burst, now, 1).Result()
if err != nil { return false, err }
arr := res.([]interface{})
return arr[0].(int64) == 1, nil
}
The latency budget per request is one Redis round trip. With a co-located Redis (same AZ), that’s sub-millisecond and acceptable. If you can’t afford it, fall back to per-pod limiters with the rate divided by replica count — imprecise but cheap.
3. Concurrency Limits
Rate limiting throttles requests per second. It does nothing if every request takes ten seconds. A burst of expensive requests can pile up arbitrarily until you OOM. Concurrency limits cap the number of in-flight requests, which directly bounds memory and goroutine count.
type semaphore struct{ ch chan struct{} }
func newSem(n int) *semaphore { return &semaphore{ch: make(chan struct{}, n)} }
func (s *semaphore) Acquire(ctx context.Context) error {
select {
case s.ch <- struct{}{}:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
func (s *semaphore) Release() { <-s.ch }
func middleware(sem *semaphore, next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 200*time.Millisecond)
defer cancel()
if err := sem.Acquire(ctx); err != nil {
http.Error(w, "overloaded", http.StatusServiceUnavailable)
return
}
defer sem.Release()
next.ServeHTTP(w, r)
})
}
Pick the cap empirically: load test the service, find the point at which p99 latency starts climbing, set the cap at 80% of that concurrency. The 200ms acquire timeout means callers fail fast instead of queueing forever.
For adaptive concurrency, the AIMD (additive increase, multiplicative decrease) algorithm used by Envoy’s adaptive concurrency filter is the gold standard. It increases the limit while latency is stable and slashes it on signs of overload. Worth using at the gateway layer; complex enough that I wouldn’t build it in-app.
4. Circuit Breakers
A circuit breaker stops calling a dependency that’s failing, giving it room to recover and shielding your service from cascading failures.
import "github.com/sony/gobreaker/v2"
var breaker = gobreaker.NewCircuitBreaker[*PaymentResp](gobreaker.Settings{
Name: "payments",
MaxRequests: 5, // in half-open state
Interval: 30 * time.Second, // counter reset window
Timeout: 10 * time.Second, // open -> half-open
ReadyToTrip: func(c gobreaker.Counts) bool {
return c.Requests >= 20 && c.TotalFailures*100/c.Requests >= 50
},
})
func charge(ctx context.Context, amount int64) (*PaymentResp, error) {
return breaker.Execute(func() (*PaymentResp, error) {
return paymentClient.Charge(ctx, amount)
})
}
The breaker states:
failures over threshold
CLOSED --------------------------> OPEN
^ |
| all probes succeed | timeout
| v
HALF-OPEN <--------- (probe a few requests)
Tuning notes from experience:
ReadyToTripshould require a minimum request count (20 above). Without it, the first failure trips the breaker.Timeout(open -> half-open) of 10-30 seconds is usually right. Shorter and you flap; longer and recovery lags.MaxRequestsin half-open is the canary: a few requests at a time to test recovery, not a flood.
The breaker is per-dependency, not per-process. A flaky payments service should not affect your inventory service calls.
5. Retries and Hedging
Retries are easy to get wrong. The right model is:
- Retry only idempotent operations
- Retry only on retryable errors (gRPC
UNAVAILABLE,DEADLINE_EXCEEDED; HTTP 5xx and 429) - Use exponential backoff with jitter
- Cap total attempts and total time
- For latency-sensitive reads, consider hedging
import (
"math/rand"
"time"
)
func withRetry[T any](ctx context.Context, fn func(context.Context) (T, error)) (T, error) {
var zero T
for attempt := 0; attempt < 4; attempt++ {
v, err := fn(ctx)
if err == nil { return v, nil }
if !retryable(err) { return zero, err }
if attempt == 3 { return zero, err }
backoff := time.Duration(1<<attempt) * 50 * time.Millisecond
jitter := time.Duration(rand.Int63n(int64(backoff)))
select {
case <-time.After(backoff + jitter):
case <-ctx.Done():
return zero, ctx.Err()
}
}
return zero, errors.New("unreachable")
}
5.1 Hedging for Tail Latency
For reads where consistency allows it, send the second request after a deadline:
func hedged[T any](ctx context.Context, hedgeAt time.Duration, fn func(context.Context) (T, error)) (T, error) {
type res struct { v T; err error }
ch := make(chan res, 2)
ctx, cancel := context.WithCancel(ctx)
defer cancel()
go func() { v, err := fn(ctx); ch <- res{v, err} }()
select {
case r := <-ch:
return r.v, r.err
case <-time.After(hedgeAt):
}
go func() { v, err := fn(ctx); ch <- res{v, err} }()
r := <-ch
return r.v, r.err
}
Hedge timing should be just above the p95 latency of the underlying call. Hedging at p50 doubles your load; hedging at p99 doesn’t help. The technique works because slow tails are usually one-off (a bad pod, a long GC pause), so the second request avoids the unlucky one.
6. Gateway-Level Rate Limits
The application limits above don’t help if traffic is overwhelming before it reaches the app. Push the coarse limits to the gateway.
6.1 Envoy Gateway
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata: { name: edge-rl }
spec:
targetRef:
group: gateway.networking.k8s.io
kind: Gateway
name: public
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- headers: [{ name: x-api-key, type: Distinct }]
limit: { requests: 1000, unit: Minute }
- clientSelectors:
- sourceCIDR: { value: 0.0.0.0/0 }
limit: { requests: 60, unit: Minute }
Two rules: 1000/min per API key, 60/min per IP as a fallback for unauthed traffic.
6.2 Kong
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata: { name: edge-rl }
plugin: rate-limiting
config:
minute: 1000
policy: redis
redis_host: redis.cache.svc
fault_tolerant: true
identifier: header
header_name: x-api-key
fault_tolerant: true is critical. Without it, a Redis outage fails open or closed depending on settings, and you’ll discover which only when Redis goes down.
7. Common Pitfalls
7.1 Failing Open on Edge Limiter Outage
If your edge limiter’s backing store (Redis) goes down and your code fails open, your service is unprotected. If it fails closed, your service is down. Pick “fail open with alarm” by default, alert immediately, and degrade.
7.2 Retrying Non-Idempotent Writes
Retrying a POST /payments/charge results in double charges. Either make all writes idempotent (idempotency keys) or never retry them. The middle ground is a footgun.
7.3 Synchronous Cascading Retries
Service A retries calling B, which retries calling C, which retries calling D. A single deep stall in D becomes an exponential request fan-out. Set a Retry-Attempt header and decrement at each hop, or simply don’t retry across service boundaries — retry only at the top of the call tree.
7.4 Circuit Breaker Per Request
gobreaker.NewCircuitBreaker inside a handler creates a new breaker per request. State doesn’t persist. Build breakers at startup, share across requests.
7.5 No Budget on Concurrent Hedges
Hedging doubles load. Without a budget, a slow downstream gets twice the traffic just as it’s struggling. The fix is a global hedge budget (e.g., “at most 5% of requests may hedge”), enforced with a separate token bucket.
8. Troubleshooting
8.1 p99 Latency Doubled Right After Adding Retries
Your retries are amplifying a slow downstream. Each first attempt times out, the retry waits for jittered backoff, then completes. Result: p99 = first_timeout + backoff + downstream_p50. Either lower the per-attempt deadline (so retries get more chances) or skip retries on this path.
8.2 Rate Limiter Counts Are Wrong After Pod Restart
Local counters reset on restart. If precise enforcement matters, you must use a distributed store. If “approximately N per second across the fleet” is good enough, the Redis-backed limiter wins.
8.3 Circuit Breaker Flapping
Half-open probes succeed, breaker closes, real traffic surges in, downstream falls back over. Either raise MaxRequests in half-open so probing is gradual, or hold open longer.
9. Wrapping Up
Resilience is a system property, not a feature. The four patterns above — rate limits at two layers, concurrency caps, circuit breakers, and disciplined retries — are the spine of every production API I’ve shipped that’s stayed up under unexpected load. None of them is hard individually. Composing them, with sensible defaults and the right observability, is where the work is.
The Google SRE workbook chapter on overload handling is the canonical read on this topic, and the patterns here are mostly its descendants applied to 2025 stacks. With this piece, the July series is complete: schemas, transports, gateways, security, observability, and resilience. The boring fundamentals are still the ones that ship.