background-shape
Rate Limiting Algorithms, Token Bucket, Leaky Bucket, Sliding Window
November 14, 2022 · 5 min read · by Muhammad Amal programming

TL;DR — Token bucket: most common, allows bursts. Leaky bucket: smooths traffic to constant rate. Fixed window: simplest, boundary issues. Sliding window: smoother but more expensive. Use token bucket for “user can do X actions per minute”; sliding window when fairness near boundaries matters.

After OAuth, the operational layer. Every API needs rate limiting. Wrong algorithm = either useless or hostile to users.

Token bucket

bucket capacity: 100 tokens
refill rate: 10 tokens/sec
each request: takes 1 token
no tokens available → reject

Tokens accumulate up to the cap when nobody’s making requests. Bursts up to the cap allowed; sustained rate limited.

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill = time.time()

    def consume(self, n: int = 1) -> bool:
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

Pros: simple, burst-friendly, intuitive.

Cons: requires per-key state (last refill time + current tokens).

Use for: most user-facing rate limits (“10 logins per minute”).

Leaky bucket

bucket capacity: 100 requests
leak rate: 10 requests/sec
incoming request: added to bucket
bucket full → reject

Same shape as token bucket but inverted. Smooths traffic to constant output rate.

Less common in HTTP rate limiting because it queues requests (delays them). More common for traffic shaping (network ingress).

Fixed window

window: 1 minute
limit: 100 requests/minute
count requests in current minute
count > 100 → reject
window resets at minute boundary
class FixedWindow:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.window_start = int(time.time()) // window_seconds
        self.count = 0

    def allow(self) -> bool:
        current_window = int(time.time()) // self.window_seconds
        if current_window != self.window_start:
            self.window_start = current_window
            self.count = 0
        if self.count < self.limit:
            self.count += 1
            return True
        return False

Pros: simplest possible.

Cons: boundary effect. A burst at 11:59:30 (100 requests) and another at 12:00:30 (100 requests) = 200 requests in 60 seconds despite a “100/minute” limit.

Use for: very simple rate limits where the boundary issue is tolerable.

Sliding window log

Track timestamps of recent requests in a list:

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.requests = []

    def allow(self) -> bool:
        now = time.time()
        cutoff = now - self.window_seconds
        self.requests = [t for t in self.requests if t > cutoff]
        if len(self.requests) < self.limit:
            self.requests.append(now)
            return True
        return False

Pros: precise.

Cons: O(N) memory per key.

Sliding window counter

Compromise between fixed and sliding:

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.current_window = int(time.time()) // window_seconds
        self.current_count = 0
        self.previous_count = 0

    def allow(self) -> bool:
        now = time.time()
        current = int(now) // self.window_seconds
        if current != self.current_window:
            self.previous_count = self.current_count if current == self.current_window + 1 else 0
            self.current_count = 0
            self.current_window = current

        # Linear interpolation: how much of the previous window is still in the sliding window
        elapsed_in_current = now - (current * self.window_seconds)
        weight = (self.window_seconds - elapsed_in_current) / self.window_seconds
        approx_count = self.current_count + self.previous_count * weight

        if approx_count < self.limit:
            self.current_count += 1
            return True
        return False

Pros: O(1) state, approximate smoothness.

Cons: approximation.

Use for: when fairness near window boundaries matters but you don’t want O(N) memory.

Algorithm choice

For most rate limits:

  • “X requests per minute” (general API): Token bucket
  • “X failed logins per hour”: Token bucket
  • “Burst tolerance important”: Token bucket with appropriate capacity
  • “Strict per-second fairness”: Sliding window counter
  • “Simple is fine”: Fixed window

Token bucket wins most cases. Sliding window for high-traffic APIs where the boundary effect matters.

What to limit on

Three keys:

IP address: limits by network endpoint. Bypassed by:

  • Mobile devices on shared NAT (lots of users behind one IP)
  • Cloud botnets (many IPs)
  • Proxy services

API key / user ID: limits per identified entity. Bypassed by:

  • Anonymous endpoints (login!)

Combination: per-IP for unauthenticated; per-user for authenticated.

For login: IP + email combo. Slows credential stuffing across many emails from one IP, and many IPs trying one email.

Response headers

Communicate rate limit state via standard headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1668447600
Retry-After: 15

(GitHub conventions; widely adopted. RFC standard pending.)

When you reject (429):

HTTP/1.1 429 Too Many Requests
Retry-After: 15

{"error": "rate limit exceeded; retry in 15 seconds"}

Tells the client when to try again. Good clients use this.

In-process vs distributed

The above examples are in-process. Works for single-instance services.

For multi-instance services (auto-scaling, k8s): in-process rate limits are per-pod. 10-pod deployment with 100/min limit = 1000/min cluster-wide. Useless.

Distributed rate limiting requires a shared store. Redis is the standard. Covered Wednesday in Redis rate limiting.

Common Pitfalls

Rate limit per IP only. Cloud botnets bypass. Combine with user ID where possible.

Rate limit only on login. Whole-API rate limits matter too.

No Retry-After header. Clients don’t know when to retry.

Long penalty after one rejection. Network blips, retries trigger. Use short windows (seconds), not hour-long bans.

Single-window fixed boundary issue. 200 req in 60s allowed. Use sliding window.

Same limit for all endpoints. Login is more sensitive than /health. Per-route limits.

Wrapping Up

Token bucket as default. Sliding window for boundary-sensitive cases. Per-IP + per-user combo for sensitive endpoints. Wednesday: Redis-backed distributed rate limiting.