Postgres article cover illustration on a gradient background

June 16, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — Patroni 4.0 with etcd and HAProxy gives you a battle-tested Postgres HA setup in a few hours. The tricky parts are network partitions, failover timing, and making your application reconnect cleanly. Plan for these or your “highly available” cluster becomes “occasionally available.”

I’ve built or inherited Patroni clusters at six different companies. Every single one had failover bugs in the first month of production, and every single bug was something the documentation warned about and somebody ignored. HA Postgres is not a thing you set up once and forget. It’s a system that needs ongoing care, monitoring, and occasional drills.

Patroni 4.0 (released late 2024) is the version I’d recommend for new deployments. It supports Postgres 17 cleanly, the new health-check probes are better, and the Kubernetes integration finally stopped being a second-class citizen. This guide walks through a real production setup on bare VMs. If you’re running on Kubernetes, the concepts translate but the operator landscape (Zalando, CrunchyData, StackGres) is its own topic.

Before you start, you should be comfortable with Postgres replication and understand the basics of streaming replication.

1. The Architecture

A minimum production Patroni setup has:

3 Postgres nodes (1 primary, 2 replicas).
3 etcd nodes (for consensus, can run on the same boxes).
2 HAProxy nodes (for client routing).
Either keepalived or a cloud load balancer for HAProxy itself.

        +---------+   +---------+
        | App     |   | App     |
        +----+----+   +----+----+
             |             |
             v             v
           +-----------------+
           |    HAProxy VIP  |
           +--------+--------+
                    |
        +-----------+-----------+
        |                       |
   +----v----+   +-----+   +----v----+
   |HAProxy 1|---|     |---|HAProxy 2|
   +---------+   +-----+   +---------+
        |                       |
        +-----+-----+-----+-----+
              |     |     |
        +-----+-----+-----+-----+
        |     |     |     |     |
   +----v---+ |  +--v--+  | +---v----+
   |Patroni | |  |Patro| | |Patroni |
   | + PG 1 |-+--+ +PG2 |-+ + PG 3   |
   +--------+ |  +-----+ | +--------+
              |     |     |
        +-----v-----v-----v-----+
        |       etcd cluster    |
        |   (3 nodes, Raft)     |
        +-----------------------+

The flow: HAProxy uses Patroni’s REST API to know which Postgres node is the primary. Patroni nodes coordinate via etcd to decide who’s primary at any moment. Applications connect to HAProxy.

2. Setting Up etcd

etcd is your source of truth. Don’t skimp on it.

Install on three nodes

# On each node, as root
curl -L https://github.com/etcd-io/etcd/releases/download/v3.5.17/etcd-v3.5.17-linux-amd64.tar.gz -o etcd.tar.gz
tar xzf etcd.tar.gz
mv etcd-v3.5.17-linux-amd64/etcd* /usr/local/bin/

Systemd unit

# /etc/systemd/system/etcd.service
[Unit]
Description=etcd
After=network.target

[Service]
Type=notify
User=etcd
ExecStart=/usr/local/bin/etcd \
  --name=etcd-1 \
  --data-dir=/var/lib/etcd \
  --listen-client-urls=http://0.0.0.0:2379 \
  --advertise-client-urls=http://10.0.0.1:2379 \
  --listen-peer-urls=http://0.0.0.0:2380 \
  --initial-advertise-peer-urls=http://10.0.0.1:2380 \
  --initial-cluster=etcd-1=http://10.0.0.1:2380,etcd-2=http://10.0.0.2:2380,etcd-3=http://10.0.0.3:2380 \
  --initial-cluster-state=new \
  --initial-cluster-token=patroni-cluster

Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

After starting all three nodes:

ETCDCTL_API=3 etcdctl endpoint status \
  --endpoints=10.0.0.1:2379,10.0.0.2:2379,10.0.0.3:2379 \
  --write-out=table

You should see one leader and two followers. If not, fix it before continuing.

3. Installing Patroni

# On each Postgres node
apt install -y postgresql-17 python3-pip
pip install --break-system-packages "patroni[etcd3]==4.0.0"

Stop and disable the default Postgres systemd unit. Patroni will manage Postgres directly.

systemctl stop postgresql
systemctl disable postgresql

Patroni config

# /etc/patroni/patroni.yml
scope: pgcluster
name: pg-node-1
namespace: /service/

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.0.11:8008

etcd3:
  hosts:
    - 10.0.0.1:2379
    - 10.0.0.2:2379
    - 10.0.0.3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: true
    synchronous_mode_strict: false
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        shared_buffers: 16GB
        effective_cache_size: 48GB
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10
        hot_standby: 'on'
        wal_log_hints: 'on'
        archive_mode: 'on'
        archive_command: 'wal-g wal-push %p'

  initdb:
    - encoding: UTF8
    - data-checksums

  pg_hba:
    - host replication replicator 10.0.0.0/16 scram-sha-256
    - host all all 10.0.0.0/16 scram-sha-256

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.0.11:5432
  data_dir: /var/lib/postgresql/17/main
  bin_dir: /usr/lib/postgresql/17/bin
  authentication:
    replication:
      username: replicator
      password: REDACTED
    superuser:
      username: postgres
      password: REDACTED

Critical settings:

ttl: 30 — how long a leader’s lease lasts. Lower means faster failover but more sensitivity to network blips.
loop_wait: 10 — how often Patroni checks state. Lower is more responsive.
synchronous_mode: true — at least one replica must ack before commit. This protects against data loss on failover.
wal_log_hints: 'on' — required for pg_rewind to work, which makes recovery faster after failover.

Start Patroni

# /etc/systemd/system/patroni.service
[Unit]
Description=Patroni
After=network.target etcd.service

[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

Start on the first node, wait for it to bootstrap as primary, then start the other two as replicas.

patronictl -c /etc/patroni/patroni.yml list

Should show one Leader, two Replicas, all running.

4. HAProxy For Client Routing

HAProxy uses Patroni’s REST API to route writes to the primary.

# /etc/haproxy/haproxy.cfg
global
    maxconn 8192

defaults
    mode tcp
    timeout connect 5s
    timeout client 30m
    timeout server 30m

listen postgres_primary
    bind *:5000
    option httpchk OPTIONS /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node-1 10.0.0.11:5432 check port 8008
    server pg-node-2 10.0.0.12:5432 check port 8008
    server pg-node-3 10.0.0.13:5432 check port 8008

listen postgres_replicas
    bind *:5001
    balance roundrobin
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node-1 10.0.0.11:5432 check port 8008
    server pg-node-2 10.0.0.12:5432 check port 8008
    server pg-node-3 10.0.0.13:5432 check port 8008

The OPTIONS /primary health check hits Patroni’s REST API, which returns 200 only on the current primary. The /replica endpoint returns 200 only on running replicas. This means failover Just Works for the HAProxy layer.

Applications connect to port 5000 for writes and port 5001 for reads. Run two HAProxy nodes behind keepalived or a cloud load balancer for HA on this layer too.

on-marked-down shutdown-sessions

This is the key directive. When a node becomes unhealthy (failover happened), HAProxy aggressively terminates existing connections so clients reconnect to the new primary. Without this, connections to the old primary hang until they time out.

5. Failover Testing

You must test failover. Not in theory, in production-like environment, regularly.

Manual switchover

patronictl -c /etc/patroni/patroni.yml switchover \
  --leader pg-node-1 --candidate pg-node-2

A switchover is a controlled handoff. Patroni waits for replication to fully catch up, demotes the leader cleanly, and promotes the candidate. No data loss, minimal downtime (under 5 seconds typically).

Failure simulation

# On the primary node
systemctl stop patroni
# or kill -9 the postgres process for a harder failure

Watch what happens with patronictl list from another node. Within ttl seconds (30 by default), one replica should be promoted. With synchronous_mode: true, the sync replica is preferred.

Test all the failure modes:

Clean shutdown of Patroni (graceful).
SIGKILL on Postgres (hard).
Network partition (use iptables to drop traffic).
Disk full on primary.
etcd node failure.
Two simultaneous failures (e.g., primary + one etcd node).

Each of these has different recovery characteristics. The two-failure cases are the ones that surprise teams. For example, if etcd loses quorum at the same time as a Postgres failure, Patroni can’t promote anyone because it can’t write the new leader’s identity to etcd. The cluster waits, which is the safe behavior but feels broken to people watching dashboards.

Failback patterns

After a failover, the former primary is now a replica. There are two schools of thought on what to do next.

The first school: bring it back as the primary. This requires a switchover back, which is a controlled operation but means another brief blip. Useful when one node has materially better hardware than the others.

The second school: leave it. The new primary is fine. The cluster is balanced. Don’t introduce more change. This is what I do unless there’s a reason.

# If you do want to switch back
patronictl -c /etc/patroni/patroni.yml switchover \
  --leader pg-node-2 --candidate pg-node-1

The downside of leaving things is that over a year, with several failovers, your “primary preference” becomes random. If your monitoring assumes a specific node is primary, fix the monitoring, not the cluster.

6. Application Patterns

Your application needs to handle failover gracefully. Two patterns matter.

Connection retry

Any query can fail with “connection broken” during a failover. The application should retry, ideally with exponential backoff:

// Go example
for attempt := 0; attempt < 5; attempt++ {
    err := pool.Ping(ctx)
    if err == nil {
        break
    }
    backoff := time.Duration(1<<attempt) * 100 * time.Millisecond
    time.Sleep(backoff)
}

Read/write splitting

Use the two HAProxy ports. Writes go to port 5000, reads go to port 5001. Be careful with read-after-write consistency. A write commits on the primary but the replica might be a few hundred milliseconds behind. If your app needs read-after-write, route those reads to the primary too.

A common pattern: stamp a “freshness” requirement on each query at the application layer. Anything that just rendered a form submission, anything that follows a POST in the same user session, anything that reads data the user might have just changed — those go to the primary. Everything else can use a replica. Most data-platform frameworks have hooks for this; if yours doesn’t, build it into your DB access layer.

# Python pattern
primary_pool = psycopg_pool.ConnectionPool("postgresql://haproxy:5000/db")
replica_pool = psycopg_pool.ConnectionPool("postgresql://haproxy:5001/db")

def write(query, params):
    with primary_pool.connection() as conn:
        return conn.execute(query, params)

def read(query, params, fresh=False):
    pool = primary_pool if fresh else replica_pool
    with pool.connection() as conn:
        return conn.execute(query, params).fetchall()

Common Pitfalls

1. Two-node Patroni

People try to save money with two Postgres nodes. Don’t. With two nodes you can’t tell a network partition from a node failure. Patroni will refuse to fail over in that case, leaving you with no primary. Three nodes minimum.

2. Putting etcd on the Postgres nodes only

If your three etcd nodes are co-located with your three Postgres nodes, a network partition that splits one Postgres node also splits one etcd node. That’s fine. But if you lose two of three Postgres nodes (and thus two of three etcd nodes), you’ve lost etcd quorum and Patroni can’t function. Consider running etcd on separate hardware for very high reliability.

3. Ignoring archive_command

Without archive_command, your only safety net is the streaming replicas. If all replicas are corrupted somehow, you can’t recover beyond the WAL still on disk. Always set up WAL archiving with wal-g or pgbackrest.

4. Synchronous replication misconfigured

synchronous_mode_strict: true means commits block if no sync replica is available. This protects data but means a sync replica failure stalls writes. synchronous_mode_strict: false means commits succeed even without a sync replica, which protects availability but allows data loss in rare scenarios. Choose deliberately.

Troubleshooting

Failover doesn’t happen

Check etcd quorum:

ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table

If etcd is unhealthy, Patroni can’t make decisions. Fix etcd first.

Replica falls behind, gets stuck

SELECT * FROM pg_stat_replication;

Check replay_lag. If it’s growing, the replica can’t keep up. Either your replica hardware is too weak or you have a long-running query blocking replay (max_standby_streaming_delay). Postgres 17 made this less common but it still happens.

Split brain after weird network event

Patroni 4.0 is good at preventing this, but if it happens (two nodes both think they’re primary), shut down one immediately:

patronictl -c /etc/patroni/patroni.yml restart pg-node-X --force

Then investigate etcd state and Patroni logs. This is the scenario you most need backups and WAL archives for. Sort out which node has the most committed transactions, keep that one as primary, and rebuild the other from a base backup plus WAL replay. Don’t try to merge writes from a split brain; you’ll lose data or duplicate IDs.

HAProxy stuck routing to old primary

If clients are still hitting an old node after a failover, HAProxy’s health checks didn’t fire fast enough or aren’t configured correctly. Check show stat on the HAProxy admin socket:

echo "show stat" | socat stdio /var/run/haproxy/admin.sock | grep postgres

Look at the chkfail counter and the status column. If the old primary still shows UP, your health check interval is too long or the check isn’t actually hitting Patroni’s /primary endpoint. The inter 3s fall 3 rise 2 configuration above means a node is marked DOWN after 9 seconds of failed checks. That’s usually fine but it does mean 9 seconds of routing to a dead primary in the worst case.

The Postgres replication documentation covers the underlying replication settings Patroni manages.

Wrapping Up

A Patroni HA setup is more nuanced than the install docs suggest, but the underlying design is sound. Three Postgres nodes, three etcd nodes, two HAProxy nodes with health-checked failover. Test failover regularly. Have backups separate from replication. That’s the formula.

The next step is monitoring. The pg_stat_io deep dive covers how to know whether your nodes are healthy under load.