Rate Limiting at Scale, Token Bucket and Sliding Window in Redis

Rate Limiting at Scale, Token Bucket and Sliding Window in Redis

July 17, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — Token bucket is what you want for most API rate limiting. Sliding window log is more accurate but expensive. Sliding window counter is the pragmatic middle. Implement all three with Lua scripts and benchmark on your traffic — defaults from blog posts are almost always wrong for your case.

Rate limiting looks like a solved problem until you have to ship one at scale. Then it’s a thicket of “fixed window has burst issues”, “sliding window log uses too much memory”, “token bucket needs careful refill math”, “what about distributed clock skew”, and so on. Most of the public guides simplify too much.

This is the version I wish I’d had on the desk during a recent migration off a buggy in-memory limiter to something Redis-backed. I’ll cover the three algorithms that matter, the Redis primitives you actually need, working Go code, and the failure modes I’ve personally caused in production.

The setup assumption: you have a Go service (or several) that need a shared, distributed rate limit. Redis is your coordination layer. Latency budget for the limiter is single-digit milliseconds. If any of those don’t hold, this post is the wrong post.

Why not just use a library?

There are good Go rate limiting libraries — golang.org/x/time/rate (token bucket, in-memory), redis_rate from go-redis, uber-go/ratelimit. They work. The reason to roll your own is that the failure modes of an off-the-shelf library are not documented in a way that helps you on-call. You should know how your limiter behaves when Redis hiccups, because it will hiccup.

Reading the source of one of these libraries is also the fastest way to learn the algorithms properly. The implementations below are similar to what redis_rate does internally, with the corners I care about exposed.

Algorithm 1, fixed window

Just for completeness, because people still ship it: increment a counter keyed by user:window-start, set TTL, allow if below limit.

-- KEYS[1] = "rl:user:123:1721234400"  (window start as unix second)
-- ARGV[1] = limit
-- ARGV[2] = ttl_seconds
local n = redis.call("INCR", KEYS[1])
if n == 1 then
  redis.call("EXPIRE", KEYS[1], ARGV[2])
end
if n > tonumber(ARGV[1]) then
  return {0, n}
end
return {1, n}

Don’t ship this for a public API. The burst problem is real: a user can spend their full quota in the last second of one window and the first second of the next, doubling their effective rate. Fine for internal services with cooperative clients. Not fine when a single user can knock your service over.

Algorithm 2, sliding window log

The most accurate algorithm. Store a sorted set of request timestamps, remove old ones, check size against limit.

-- KEYS[1] = "rl:user:123"
-- ARGV[1] = now_ms
-- ARGV[2] = window_ms
-- ARGV[3] = limit
-- ARGV[4] = request_id (unique per request, e.g. UUID)
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local id = ARGV[4]

redis.call("ZREMRANGEBYSCORE", key, 0, now - window)
local count = redis.call("ZCARD", key)
if count >= limit then
  return {0, count}
end
redis.call("ZADD", key, now, id)
redis.call("PEXPIRE", key, window)
return {1, count + 1}

This works, it’s accurate, and it’s memory-hungry. Every request stores a member in the sorted set. A user doing 1000 requests/minute with a 1-minute window costs you ~1000 entries per user at all times. Multiply by active users. For a public API with a million daily actives, this is too much.

I use sliding window log only for low-cardinality, high-value limits — global limits, per-tenant limits where there are dozens of tenants, never per-user limits at scale.

Algorithm 3, sliding window counter

The pragmatic middle. Approximate the sliding window using two fixed-window counters, weighted by how much of the previous window still falls inside the sliding window.

-- KEYS[1] = base key, e.g. "rl:user:123"
-- ARGV[1] = now_ms
-- ARGV[2] = window_ms
-- ARGV[3] = limit
local base = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])

local current_window = math.floor(now / window)
local previous_window = current_window - 1
local elapsed_in_current = now - (current_window * window)
local weight_of_previous = 1 - (elapsed_in_current / window)

local current_key = base .. ":" .. current_window
local previous_key = base .. ":" .. previous_window

local prev_count = tonumber(redis.call("GET", previous_key) or "0")
local curr_count = tonumber(redis.call("GET", current_key) or "0")

local effective = (prev_count * weight_of_previous) + curr_count
if effective + 1 > limit then
  return {0, math.floor(effective)}
end

redis.call("INCR", current_key)
redis.call("PEXPIRE", current_key, window * 2)
return {1, math.floor(effective) + 1}

This is the algorithm I default to for per-user API rate limiting. Memory is O(2) per user. Accuracy is within a few percent of true sliding window for any reasonable distribution of traffic. The math is simple enough to reason about in an incident.

Algorithm 4, token bucket

Token bucket is the right model when you want to allow short bursts but cap sustained throughput. It maps cleanly to “user X can make 100 requests per minute with a burst of 20”.

-- KEYS[1] = "rl:tb:user:123"
-- ARGV[1] = now_ms
-- ARGV[2] = capacity (max tokens)
-- ARGV[3] = refill_rate_per_ms (e.g. 100/60000 for 100/min)
-- ARGV[4] = cost (tokens to consume, usually 1)
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local rate = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or capacity
local ts = tonumber(data[2]) or now

local elapsed = math.max(0, now - ts)
tokens = math.min(capacity, tokens + elapsed * rate)

if tokens < cost then
  redis.call("HMSET", key, "tokens", tokens, "ts", now)
  redis.call("PEXPIRE", key, math.ceil(capacity / rate))
  return {0, tokens}
end

tokens = tokens - cost
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(capacity / rate))
return {1, tokens}

This is what I ship for almost everything in 2024. The capacity parameter gives you the burst allowance, the rate parameter gives you sustained throughput, and the cost parameter lets you charge different amounts for different operations (e.g. a search query costs 5 tokens, a single GET costs 1).

Go wiring

The Go side is straightforward with go-redis/v9. Load the Lua script once at startup, evaluate by SHA, fall back to full eval on NOSCRIPT error.

package ratelimit

import (
    "context"
    _ "embed"
    "time"

    "github.com/redis/go-redis/v9"
)

//go:embed token_bucket.lua
var tokenBucketScript string

type TokenBucket struct {
    rdb      *redis.Client
    script   *redis.Script
    capacity int
    rate     float64 // tokens per ms
}

func NewTokenBucket(rdb *redis.Client, capacity int, perMinute int) *TokenBucket {
    return &TokenBucket{
        rdb:      rdb,
        script:   redis.NewScript(tokenBucketScript),
        capacity: capacity,
        rate:     float64(perMinute) / 60000.0,
    }
}

type Result struct {
    Allowed   bool
    Remaining int
}

func (t *TokenBucket) Allow(ctx context.Context, key string, cost int) (Result, error) {
    now := time.Now().UnixMilli()
    res, err := t.script.Run(ctx, t.rdb, []string{key}, now, t.capacity, t.rate, cost).Slice()
    if err != nil {
        return Result{}, err
    }
    allowed, _ := res[0].(int64)
    remaining, _ := res[1].(int64)
    return Result{
        Allowed:   allowed == 1,
        Remaining: int(remaining),
    }, nil
}

redis.NewScript handles the EVALSHA fast path with NOSCRIPT fallback automatically. The embedded Lua keeps the source close to the Go code that calls it, which makes review easier.

The failure mode you must plan for

Redis goes down. Or its latency goes from 1ms to 100ms. What does your service do?

Two stances, both valid, you must pick one:

Fail open. If Redis is unreachable, allow the request. You lose rate limiting until Redis recovers. Acceptable for “convenience” limits where overrun is annoying but not catastrophic.

Fail closed. If Redis is unreachable, deny the request. You preserve the limit but you’ve coupled your availability to Redis. Acceptable when the limit protects something whose overrun is worse than downtime.

In Go, the pattern looks like:

res, err := limiter.Allow(ctx, key, 1)
if err != nil {
    if failOpen {
        return true, nil
    }
    return false, err
}
return res.Allowed, nil

I default to fail-open for API rate limiting, fail-closed for limits that protect billing or quota enforcement. Pick deliberately and write it in the runbook.

For more on where rate limiting sits in the broader API stack, choosing an API gateway in 2024 covers gateway-level rate limiting — sometimes the right answer is to do this at the gateway and not in your service.

Clients want their limits

Whatever you ship, return rate limit info to the client. The de facto standard headers in 2024:

RateLimit-Limit: 100
RateLimit-Remaining: 43
RateLimit-Reset: 27

The IETF draft is converging on this format. Use it. On a 429, also include Retry-After: <seconds>. Clients with decent retry libraries (Stripe SDK, Google SDK) will honor this and back off correctly. Without it, they hammer you.

Common Pitfalls

Per-request Lua script loading. Use EVALSHA with NOSCRIPT fallback (go-redis’s Script type handles this). Sending the full Lua source every request wastes bandwidth and CPU.
Forgetting PEXPIRE. Without TTL, your Redis grows unboundedly. Every limiter key needs a TTL slightly longer than the window.
Using time.Now() on each application server. Clock skew across servers shifts your window boundaries. For most use cases this is fine; for strict accuracy, use Redis’s TIME command and pass it as a script argument.
Not testing under partition. Run your rate limiter while killing Redis. Watch what happens. Decide if that’s what you want.
Tying retry-after to the bucket refill time only. If 100 clients all hit a limit at once and you tell them all to retry in 1 second, you get a thundering herd. Add jitter on the client side, and ideally suggest a randomized backoff in your Retry-After.

Wrapping Up

Rate limiting is one of those problems where you can ship the wrong thing for a long time before it bites you. The day it bites — usually a black-friday-shaped traffic spike, or a buggy client that retries hard — is the day you learn whether you got the algorithm and the failure mode right.

My defaults in 2024: token bucket via Lua script, redis-go v9, fail-open, IETF RateLimit-* headers in responses, jittered Retry-After on 429. That setup has held up across a few hundred million requests a day in my current job without surprises. Your traffic shape may differ, but the framework — pick algorithm deliberately, plan for Redis failure, surface limits to clients — generalizes.

Why not just use a library?

Algorithm 1, fixed window

Algorithm 2, sliding window log

Algorithm 3, sliding window counter

Algorithm 4, token bucket

Go wiring

The failure mode you must plan for

Clients want their limits

Common Pitfalls

Wrapping Up

Related posts

Rate Limiting and Resilience Patterns for Modern APIs

Distributed Rate Limiting with Redis

mTLS, Service Mesh or Application Layer in 2024

OpenAPI First API Design in Go, oapi-codegen in 2024

Choosing an API Gateway in 2024, Kong, Tyk, and Envoy Gateway

Securing Go Microservices with JWT, Patterns That Hold Up

gRPC for Internal Services in Go, A buf Powered Workflow

Federated GraphQL with Apollo Router, Patterns for 2024

Let’s Start a Project