Rate Limiting Algorithms, Token Bucket, Leaky Bucket, Sliding Window
TL;DR — Token bucket: most common, allows bursts. Leaky bucket: smooths traffic to constant rate. Fixed window: simplest, boundary issues. Sliding window: smoother but more expensive. Use token bucket for “user can do X actions per minute”; sliding window when fairness near boundaries matters.
After OAuth, the operational layer. Every API needs rate limiting. Wrong algorithm = either useless or hostile to users.
Token bucket
bucket capacity: 100 tokens
refill rate: 10 tokens/sec
each request: takes 1 token
no tokens available → reject
Tokens accumulate up to the cap when nobody’s making requests. Bursts up to the cap allowed; sustained rate limited.
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity
self.last_refill = time.time()
def consume(self, n: int = 1) -> bool:
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= n:
self.tokens -= n
return True
return False
Pros: simple, burst-friendly, intuitive.
Cons: requires per-key state (last refill time + current tokens).
Use for: most user-facing rate limits (“10 logins per minute”).
Leaky bucket
bucket capacity: 100 requests
leak rate: 10 requests/sec
incoming request: added to bucket
bucket full → reject
Same shape as token bucket but inverted. Smooths traffic to constant output rate.
Less common in HTTP rate limiting because it queues requests (delays them). More common for traffic shaping (network ingress).
Fixed window
window: 1 minute
limit: 100 requests/minute
count requests in current minute
count > 100 → reject
window resets at minute boundary
class FixedWindow:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.window_start = int(time.time()) // window_seconds
self.count = 0
def allow(self) -> bool:
current_window = int(time.time()) // self.window_seconds
if current_window != self.window_start:
self.window_start = current_window
self.count = 0
if self.count < self.limit:
self.count += 1
return True
return False
Pros: simplest possible.
Cons: boundary effect. A burst at 11:59:30 (100 requests) and another at 12:00:30 (100 requests) = 200 requests in 60 seconds despite a “100/minute” limit.
Use for: very simple rate limits where the boundary issue is tolerable.
Sliding window log
Track timestamps of recent requests in a list:
class SlidingWindowLog:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.requests = []
def allow(self) -> bool:
now = time.time()
cutoff = now - self.window_seconds
self.requests = [t for t in self.requests if t > cutoff]
if len(self.requests) < self.limit:
self.requests.append(now)
return True
return False
Pros: precise.
Cons: O(N) memory per key.
Sliding window counter
Compromise between fixed and sliding:
class SlidingWindowCounter:
def __init__(self, limit: int, window_seconds: int):
self.limit = limit
self.window_seconds = window_seconds
self.current_window = int(time.time()) // window_seconds
self.current_count = 0
self.previous_count = 0
def allow(self) -> bool:
now = time.time()
current = int(now) // self.window_seconds
if current != self.current_window:
self.previous_count = self.current_count if current == self.current_window + 1 else 0
self.current_count = 0
self.current_window = current
# Linear interpolation: how much of the previous window is still in the sliding window
elapsed_in_current = now - (current * self.window_seconds)
weight = (self.window_seconds - elapsed_in_current) / self.window_seconds
approx_count = self.current_count + self.previous_count * weight
if approx_count < self.limit:
self.current_count += 1
return True
return False
Pros: O(1) state, approximate smoothness.
Cons: approximation.
Use for: when fairness near window boundaries matters but you don’t want O(N) memory.
Algorithm choice
For most rate limits:
- “X requests per minute” (general API): Token bucket
- “X failed logins per hour”: Token bucket
- “Burst tolerance important”: Token bucket with appropriate capacity
- “Strict per-second fairness”: Sliding window counter
- “Simple is fine”: Fixed window
Token bucket wins most cases. Sliding window for high-traffic APIs where the boundary effect matters.
What to limit on
Three keys:
IP address: limits by network endpoint. Bypassed by:
- Mobile devices on shared NAT (lots of users behind one IP)
- Cloud botnets (many IPs)
- Proxy services
API key / user ID: limits per identified entity. Bypassed by:
- Anonymous endpoints (login!)
Combination: per-IP for unauthenticated; per-user for authenticated.
For login: IP + email combo. Slows credential stuffing across many emails from one IP, and many IPs trying one email.
Response headers
Communicate rate limit state via standard headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1668447600
Retry-After: 15
(GitHub conventions; widely adopted. RFC standard pending.)
When you reject (429):
HTTP/1.1 429 Too Many Requests
Retry-After: 15
{"error": "rate limit exceeded; retry in 15 seconds"}
Tells the client when to try again. Good clients use this.
In-process vs distributed
The above examples are in-process. Works for single-instance services.
For multi-instance services (auto-scaling, k8s): in-process rate limits are per-pod. 10-pod deployment with 100/min limit = 1000/min cluster-wide. Useless.
Distributed rate limiting requires a shared store. Redis is the standard. Covered Wednesday in Redis rate limiting.
Common Pitfalls
Rate limit per IP only. Cloud botnets bypass. Combine with user ID where possible.
Rate limit only on login. Whole-API rate limits matter too.
No Retry-After header. Clients don’t know when to retry.
Long penalty after one rejection. Network blips, retries trigger. Use short windows (seconds), not hour-long bans.
Single-window fixed boundary issue. 200 req in 60s allowed. Use sliding window.
Same limit for all endpoints. Login is more sensitive than /health. Per-route limits.
Wrapping Up
Token bucket as default. Sliding window for boundary-sensitive cases. Per-IP + per-user combo for sensitive endpoints. Wednesday: Redis-backed distributed rate limiting.