Rate Limiting at Scale, Token Bucket and Sliding Window in Redis
TL;DR — Token bucket is what you want for most API rate limiting. Sliding window log is more accurate but expensive. Sliding window counter is the pragmatic middle. Implement all three with Lua scripts and benchmark on your traffic — defaults from blog posts are almost always wrong for your case.
Rate limiting looks like a solved problem until you have to ship one at scale. Then it’s a thicket of “fixed window has burst issues”, “sliding window log uses too much memory”, “token bucket needs careful refill math”, “what about distributed clock skew”, and so on. Most of the public guides simplify too much.
This is the version I wish I’d had on the desk during a recent migration off a buggy in-memory limiter to something Redis-backed. I’ll cover the three algorithms that matter, the Redis primitives you actually need, working Go code, and the failure modes I’ve personally caused in production.
The setup assumption: you have a Go service (or several) that need a shared, distributed rate limit. Redis is your coordination layer. Latency budget for the limiter is single-digit milliseconds. If any of those don’t hold, this post is the wrong post.
Why not just use a library?
There are good Go rate limiting libraries — golang.org/x/time/rate (token bucket, in-memory), redis_rate from go-redis, uber-go/ratelimit. They work. The reason to roll your own is that the failure modes of an off-the-shelf library are not documented in a way that helps you on-call. You should know how your limiter behaves when Redis hiccups, because it will hiccup.
Reading the source of one of these libraries is also the fastest way to learn the algorithms properly. The implementations below are similar to what redis_rate does internally, with the corners I care about exposed.
Algorithm 1, fixed window
Just for completeness, because people still ship it: increment a counter keyed by user:window-start, set TTL, allow if below limit.
-- KEYS[1] = "rl:user:123:1721234400" (window start as unix second)
-- ARGV[1] = limit
-- ARGV[2] = ttl_seconds
local n = redis.call("INCR", KEYS[1])
if n == 1 then
redis.call("EXPIRE", KEYS[1], ARGV[2])
end
if n > tonumber(ARGV[1]) then
return {0, n}
end
return {1, n}
Don’t ship this for a public API. The burst problem is real: a user can spend their full quota in the last second of one window and the first second of the next, doubling their effective rate. Fine for internal services with cooperative clients. Not fine when a single user can knock your service over.
Algorithm 2, sliding window log
The most accurate algorithm. Store a sorted set of request timestamps, remove old ones, check size against limit.
-- KEYS[1] = "rl:user:123"
-- ARGV[1] = now_ms
-- ARGV[2] = window_ms
-- ARGV[3] = limit
-- ARGV[4] = request_id (unique per request, e.g. UUID)
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local id = ARGV[4]
redis.call("ZREMRANGEBYSCORE", key, 0, now - window)
local count = redis.call("ZCARD", key)
if count >= limit then
return {0, count}
end
redis.call("ZADD", key, now, id)
redis.call("PEXPIRE", key, window)
return {1, count + 1}
This works, it’s accurate, and it’s memory-hungry. Every request stores a member in the sorted set. A user doing 1000 requests/minute with a 1-minute window costs you ~1000 entries per user at all times. Multiply by active users. For a public API with a million daily actives, this is too much.
I use sliding window log only for low-cardinality, high-value limits — global limits, per-tenant limits where there are dozens of tenants, never per-user limits at scale.
Algorithm 3, sliding window counter
The pragmatic middle. Approximate the sliding window using two fixed-window counters, weighted by how much of the previous window still falls inside the sliding window.
-- KEYS[1] = base key, e.g. "rl:user:123"
-- ARGV[1] = now_ms
-- ARGV[2] = window_ms
-- ARGV[3] = limit
local base = KEYS[1]
local now = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local current_window = math.floor(now / window)
local previous_window = current_window - 1
local elapsed_in_current = now - (current_window * window)
local weight_of_previous = 1 - (elapsed_in_current / window)
local current_key = base .. ":" .. current_window
local previous_key = base .. ":" .. previous_window
local prev_count = tonumber(redis.call("GET", previous_key) or "0")
local curr_count = tonumber(redis.call("GET", current_key) or "0")
local effective = (prev_count * weight_of_previous) + curr_count
if effective + 1 > limit then
return {0, math.floor(effective)}
end
redis.call("INCR", current_key)
redis.call("PEXPIRE", current_key, window * 2)
return {1, math.floor(effective) + 1}
This is the algorithm I default to for per-user API rate limiting. Memory is O(2) per user. Accuracy is within a few percent of true sliding window for any reasonable distribution of traffic. The math is simple enough to reason about in an incident.
Algorithm 4, token bucket
Token bucket is the right model when you want to allow short bursts but cap sustained throughput. It maps cleanly to “user X can make 100 requests per minute with a burst of 20”.
-- KEYS[1] = "rl:tb:user:123"
-- ARGV[1] = now_ms
-- ARGV[2] = capacity (max tokens)
-- ARGV[3] = refill_rate_per_ms (e.g. 100/60000 for 100/min)
-- ARGV[4] = cost (tokens to consume, usually 1)
local key = KEYS[1]
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local rate = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or capacity
local ts = tonumber(data[2]) or now
local elapsed = math.max(0, now - ts)
tokens = math.min(capacity, tokens + elapsed * rate)
if tokens < cost then
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(capacity / rate))
return {0, tokens}
end
tokens = tokens - cost
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(capacity / rate))
return {1, tokens}
This is what I ship for almost everything in 2024. The capacity parameter gives you the burst allowance, the rate parameter gives you sustained throughput, and the cost parameter lets you charge different amounts for different operations (e.g. a search query costs 5 tokens, a single GET costs 1).
Go wiring
The Go side is straightforward with go-redis/v9. Load the Lua script once at startup, evaluate by SHA, fall back to full eval on NOSCRIPT error.
package ratelimit
import (
"context"
_ "embed"
"time"
"github.com/redis/go-redis/v9"
)
//go:embed token_bucket.lua
var tokenBucketScript string
type TokenBucket struct {
rdb *redis.Client
script *redis.Script
capacity int
rate float64 // tokens per ms
}
func NewTokenBucket(rdb *redis.Client, capacity int, perMinute int) *TokenBucket {
return &TokenBucket{
rdb: rdb,
script: redis.NewScript(tokenBucketScript),
capacity: capacity,
rate: float64(perMinute) / 60000.0,
}
}
type Result struct {
Allowed bool
Remaining int
}
func (t *TokenBucket) Allow(ctx context.Context, key string, cost int) (Result, error) {
now := time.Now().UnixMilli()
res, err := t.script.Run(ctx, t.rdb, []string{key}, now, t.capacity, t.rate, cost).Slice()
if err != nil {
return Result{}, err
}
allowed, _ := res[0].(int64)
remaining, _ := res[1].(int64)
return Result{
Allowed: allowed == 1,
Remaining: int(remaining),
}, nil
}
redis.NewScript handles the EVALSHA fast path with NOSCRIPT fallback automatically. The embedded Lua keeps the source close to the Go code that calls it, which makes review easier.
The failure mode you must plan for
Redis goes down. Or its latency goes from 1ms to 100ms. What does your service do?
Two stances, both valid, you must pick one:
Fail open. If Redis is unreachable, allow the request. You lose rate limiting until Redis recovers. Acceptable for “convenience” limits where overrun is annoying but not catastrophic.
Fail closed. If Redis is unreachable, deny the request. You preserve the limit but you’ve coupled your availability to Redis. Acceptable when the limit protects something whose overrun is worse than downtime.
In Go, the pattern looks like:
res, err := limiter.Allow(ctx, key, 1)
if err != nil {
if failOpen {
return true, nil
}
return false, err
}
return res.Allowed, nil
I default to fail-open for API rate limiting, fail-closed for limits that protect billing or quota enforcement. Pick deliberately and write it in the runbook.
For more on where rate limiting sits in the broader API stack, choosing an API gateway in 2024 covers gateway-level rate limiting — sometimes the right answer is to do this at the gateway and not in your service.
Clients want their limits
Whatever you ship, return rate limit info to the client. The de facto standard headers in 2024:
RateLimit-Limit: 100
RateLimit-Remaining: 43
RateLimit-Reset: 27
The IETF draft is converging on this format. Use it. On a 429, also include Retry-After: <seconds>. Clients with decent retry libraries (Stripe SDK, Google SDK) will honor this and back off correctly. Without it, they hammer you.
Common Pitfalls
- Per-request Lua script loading. Use
EVALSHAwithNOSCRIPTfallback (go-redis’sScripttype handles this). Sending the full Lua source every request wastes bandwidth and CPU. - Forgetting
PEXPIRE. Without TTL, your Redis grows unboundedly. Every limiter key needs a TTL slightly longer than the window. - Using
time.Now()on each application server. Clock skew across servers shifts your window boundaries. For most use cases this is fine; for strict accuracy, use Redis’sTIMEcommand and pass it as a script argument. - Not testing under partition. Run your rate limiter while killing Redis. Watch what happens. Decide if that’s what you want.
- Tying retry-after to the bucket refill time only. If 100 clients all hit a limit at once and you tell them all to retry in 1 second, you get a thundering herd. Add jitter on the client side, and ideally suggest a randomized backoff in your
Retry-After.
Wrapping Up
Rate limiting is one of those problems where you can ship the wrong thing for a long time before it bites you. The day it bites — usually a black-friday-shaped traffic spike, or a buggy client that retries hard — is the day you learn whether you got the algorithm and the failure mode right.
My defaults in 2024: token bucket via Lua script, redis-go v9, fail-open, IETF RateLimit-* headers in responses, jittered Retry-After on 429. That setup has held up across a few hundred million requests a day in my current job without surprises. Your traffic shape may differ, but the framework — pick algorithm deliberately, plan for Redis failure, surface limits to clients — generalizes.