High Availability Postgres with Patroni 4.0, A Step by Step Setup
TL;DR — Patroni 4.0 with etcd and HAProxy gives you a battle-tested Postgres HA setup in a few hours. The tricky parts are network partitions, failover timing, and making your application reconnect cleanly. Plan for these or your “highly available” cluster becomes “occasionally available.”
I’ve built or inherited Patroni clusters at six different companies. Every single one had failover bugs in the first month of production, and every single bug was something the documentation warned about and somebody ignored. HA Postgres is not a thing you set up once and forget. It’s a system that needs ongoing care, monitoring, and occasional drills.
Patroni 4.0 (released late 2024) is the version I’d recommend for new deployments. It supports Postgres 17 cleanly, the new health-check probes are better, and the Kubernetes integration finally stopped being a second-class citizen. This guide walks through a real production setup on bare VMs. If you’re running on Kubernetes, the concepts translate but the operator landscape (Zalando, CrunchyData, StackGres) is its own topic.
Before you start, you should be comfortable with Postgres replication and understand the basics of streaming replication.
1. The Architecture
A minimum production Patroni setup has:
- 3 Postgres nodes (1 primary, 2 replicas).
- 3 etcd nodes (for consensus, can run on the same boxes).
- 2 HAProxy nodes (for client routing).
- Either keepalived or a cloud load balancer for HAProxy itself.
+---------+ +---------+
| App | | App |
+----+----+ +----+----+
| |
v v
+-----------------+
| HAProxy VIP |
+--------+--------+
|
+-----------+-----------+
| |
+----v----+ +-----+ +----v----+
|HAProxy 1|---| |---|HAProxy 2|
+---------+ +-----+ +---------+
| |
+-----+-----+-----+-----+
| | |
+-----+-----+-----+-----+
| | | | |
+----v---+ | +--v--+ | +---v----+
|Patroni | | |Patro| | |Patroni |
| + PG 1 |-+--+ +PG2 |-+ + PG 3 |
+--------+ | +-----+ | +--------+
| | |
+-----v-----v-----v-----+
| etcd cluster |
| (3 nodes, Raft) |
+-----------------------+
The flow: HAProxy uses Patroni’s REST API to know which Postgres node is the primary. Patroni nodes coordinate via etcd to decide who’s primary at any moment. Applications connect to HAProxy.
2. Setting Up etcd
etcd is your source of truth. Don’t skimp on it.
Install on three nodes
# On each node, as root
curl -L https://github.com/etcd-io/etcd/releases/download/v3.5.17/etcd-v3.5.17-linux-amd64.tar.gz -o etcd.tar.gz
tar xzf etcd.tar.gz
mv etcd-v3.5.17-linux-amd64/etcd* /usr/local/bin/
Systemd unit
# /etc/systemd/system/etcd.service
[Unit]
Description=etcd
After=network.target
[Service]
Type=notify
User=etcd
ExecStart=/usr/local/bin/etcd \
--name=etcd-1 \
--data-dir=/var/lib/etcd \
--listen-client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://10.0.0.1:2379 \
--listen-peer-urls=http://0.0.0.0:2380 \
--initial-advertise-peer-urls=http://10.0.0.1:2380 \
--initial-cluster=etcd-1=http://10.0.0.1:2380,etcd-2=http://10.0.0.2:2380,etcd-3=http://10.0.0.3:2380 \
--initial-cluster-state=new \
--initial-cluster-token=patroni-cluster
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
After starting all three nodes:
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=10.0.0.1:2379,10.0.0.2:2379,10.0.0.3:2379 \
--write-out=table
You should see one leader and two followers. If not, fix it before continuing.
3. Installing Patroni
# On each Postgres node
apt install -y postgresql-17 python3-pip
pip install --break-system-packages "patroni[etcd3]==4.0.0"
Stop and disable the default Postgres systemd unit. Patroni will manage Postgres directly.
systemctl stop postgresql
systemctl disable postgresql
Patroni config
# /etc/patroni/patroni.yml
scope: pgcluster
name: pg-node-1
namespace: /service/
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.0.11:8008
etcd3:
hosts:
- 10.0.0.1:2379
- 10.0.0.2:2379
- 10.0.0.3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
synchronous_mode: true
synchronous_mode_strict: false
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
shared_buffers: 16GB
effective_cache_size: 48GB
wal_level: replica
max_wal_senders: 10
max_replication_slots: 10
hot_standby: 'on'
wal_log_hints: 'on'
archive_mode: 'on'
archive_command: 'wal-g wal-push %p'
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host replication replicator 10.0.0.0/16 scram-sha-256
- host all all 10.0.0.0/16 scram-sha-256
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.11:5432
data_dir: /var/lib/postgresql/17/main
bin_dir: /usr/lib/postgresql/17/bin
authentication:
replication:
username: replicator
password: REDACTED
superuser:
username: postgres
password: REDACTED
Critical settings:
ttl: 30— how long a leader’s lease lasts. Lower means faster failover but more sensitivity to network blips.loop_wait: 10— how often Patroni checks state. Lower is more responsive.synchronous_mode: true— at least one replica must ack before commit. This protects against data loss on failover.wal_log_hints: 'on'— required for pg_rewind to work, which makes recovery faster after failover.
Start Patroni
# /etc/systemd/system/patroni.service
[Unit]
Description=Patroni
After=network.target etcd.service
[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
Start on the first node, wait for it to bootstrap as primary, then start the other two as replicas.
patronictl -c /etc/patroni/patroni.yml list
Should show one Leader, two Replicas, all running.
4. HAProxy For Client Routing
HAProxy uses Patroni’s REST API to route writes to the primary.
# /etc/haproxy/haproxy.cfg
global
maxconn 8192
defaults
mode tcp
timeout connect 5s
timeout client 30m
timeout server 30m
listen postgres_primary
bind *:5000
option httpchk OPTIONS /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-node-1 10.0.0.11:5432 check port 8008
server pg-node-2 10.0.0.12:5432 check port 8008
server pg-node-3 10.0.0.13:5432 check port 8008
listen postgres_replicas
bind *:5001
balance roundrobin
option httpchk OPTIONS /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-node-1 10.0.0.11:5432 check port 8008
server pg-node-2 10.0.0.12:5432 check port 8008
server pg-node-3 10.0.0.13:5432 check port 8008
The OPTIONS /primary health check hits Patroni’s REST API, which returns 200 only on the current primary. The /replica endpoint returns 200 only on running replicas. This means failover Just Works for the HAProxy layer.
Applications connect to port 5000 for writes and port 5001 for reads. Run two HAProxy nodes behind keepalived or a cloud load balancer for HA on this layer too.
on-marked-down shutdown-sessions
This is the key directive. When a node becomes unhealthy (failover happened), HAProxy aggressively terminates existing connections so clients reconnect to the new primary. Without this, connections to the old primary hang until they time out.
5. Failover Testing
You must test failover. Not in theory, in production-like environment, regularly.
Manual switchover
patronictl -c /etc/patroni/patroni.yml switchover \
--leader pg-node-1 --candidate pg-node-2
A switchover is a controlled handoff. Patroni waits for replication to fully catch up, demotes the leader cleanly, and promotes the candidate. No data loss, minimal downtime (under 5 seconds typically).
Failure simulation
# On the primary node
systemctl stop patroni
# or kill -9 the postgres process for a harder failure
Watch what happens with patronictl list from another node. Within ttl seconds (30 by default), one replica should be promoted. With synchronous_mode: true, the sync replica is preferred.
Test all the failure modes:
- Clean shutdown of Patroni (graceful).
- SIGKILL on Postgres (hard).
- Network partition (use iptables to drop traffic).
- Disk full on primary.
- etcd node failure.
- Two simultaneous failures (e.g., primary + one etcd node).
Each of these has different recovery characteristics. The two-failure cases are the ones that surprise teams. For example, if etcd loses quorum at the same time as a Postgres failure, Patroni can’t promote anyone because it can’t write the new leader’s identity to etcd. The cluster waits, which is the safe behavior but feels broken to people watching dashboards.
Failback patterns
After a failover, the former primary is now a replica. There are two schools of thought on what to do next.
The first school: bring it back as the primary. This requires a switchover back, which is a controlled operation but means another brief blip. Useful when one node has materially better hardware than the others.
The second school: leave it. The new primary is fine. The cluster is balanced. Don’t introduce more change. This is what I do unless there’s a reason.
# If you do want to switch back
patronictl -c /etc/patroni/patroni.yml switchover \
--leader pg-node-2 --candidate pg-node-1
The downside of leaving things is that over a year, with several failovers, your “primary preference” becomes random. If your monitoring assumes a specific node is primary, fix the monitoring, not the cluster.
6. Application Patterns
Your application needs to handle failover gracefully. Two patterns matter.
Connection retry
Any query can fail with “connection broken” during a failover. The application should retry, ideally with exponential backoff:
// Go example
for attempt := 0; attempt < 5; attempt++ {
err := pool.Ping(ctx)
if err == nil {
break
}
backoff := time.Duration(1<<attempt) * 100 * time.Millisecond
time.Sleep(backoff)
}
Read/write splitting
Use the two HAProxy ports. Writes go to port 5000, reads go to port 5001. Be careful with read-after-write consistency. A write commits on the primary but the replica might be a few hundred milliseconds behind. If your app needs read-after-write, route those reads to the primary too.
A common pattern: stamp a “freshness” requirement on each query at the application layer. Anything that just rendered a form submission, anything that follows a POST in the same user session, anything that reads data the user might have just changed — those go to the primary. Everything else can use a replica. Most data-platform frameworks have hooks for this; if yours doesn’t, build it into your DB access layer.
# Python pattern
primary_pool = psycopg_pool.ConnectionPool("postgresql://haproxy:5000/db")
replica_pool = psycopg_pool.ConnectionPool("postgresql://haproxy:5001/db")
def write(query, params):
with primary_pool.connection() as conn:
return conn.execute(query, params)
def read(query, params, fresh=False):
pool = primary_pool if fresh else replica_pool
with pool.connection() as conn:
return conn.execute(query, params).fetchall()
Common Pitfalls
1. Two-node Patroni
People try to save money with two Postgres nodes. Don’t. With two nodes you can’t tell a network partition from a node failure. Patroni will refuse to fail over in that case, leaving you with no primary. Three nodes minimum.
2. Putting etcd on the Postgres nodes only
If your three etcd nodes are co-located with your three Postgres nodes, a network partition that splits one Postgres node also splits one etcd node. That’s fine. But if you lose two of three Postgres nodes (and thus two of three etcd nodes), you’ve lost etcd quorum and Patroni can’t function. Consider running etcd on separate hardware for very high reliability.
3. Ignoring archive_command
Without archive_command, your only safety net is the streaming replicas. If all replicas are corrupted somehow, you can’t recover beyond the WAL still on disk. Always set up WAL archiving with wal-g or pgbackrest.
4. Synchronous replication misconfigured
synchronous_mode_strict: true means commits block if no sync replica is available. This protects data but means a sync replica failure stalls writes. synchronous_mode_strict: false means commits succeed even without a sync replica, which protects availability but allows data loss in rare scenarios. Choose deliberately.
Troubleshooting
Failover doesn’t happen
Check etcd quorum:
ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table
If etcd is unhealthy, Patroni can’t make decisions. Fix etcd first.
Replica falls behind, gets stuck
SELECT * FROM pg_stat_replication;
Check replay_lag. If it’s growing, the replica can’t keep up. Either your replica hardware is too weak or you have a long-running query blocking replay (max_standby_streaming_delay). Postgres 17 made this less common but it still happens.
Split brain after weird network event
Patroni 4.0 is good at preventing this, but if it happens (two nodes both think they’re primary), shut down one immediately:
patronictl -c /etc/patroni/patroni.yml restart pg-node-X --force
Then investigate etcd state and Patroni logs. This is the scenario you most need backups and WAL archives for. Sort out which node has the most committed transactions, keep that one as primary, and rebuild the other from a base backup plus WAL replay. Don’t try to merge writes from a split brain; you’ll lose data or duplicate IDs.
HAProxy stuck routing to old primary
If clients are still hitting an old node after a failover, HAProxy’s health checks didn’t fire fast enough or aren’t configured correctly. Check show stat on the HAProxy admin socket:
echo "show stat" | socat stdio /var/run/haproxy/admin.sock | grep postgres
Look at the chkfail counter and the status column. If the old primary still shows UP, your health check interval is too long or the check isn’t actually hitting Patroni’s /primary endpoint. The inter 3s fall 3 rise 2 configuration above means a node is marked DOWN after 9 seconds of failed checks. That’s usually fine but it does mean 9 seconds of routing to a dead primary in the worst case.
The Postgres replication documentation covers the underlying replication settings Patroni manages.
Wrapping Up
A Patroni HA setup is more nuanced than the install docs suggest, but the underlying design is sound. Three Postgres nodes, three etcd nodes, two HAProxy nodes with health-checked failover. Test failover regularly. Have backups separate from replication. That’s the formula.
The next step is monitoring. The pg_stat_io deep dive covers how to know whether your nodes are healthy under load.