Webhook Reliability Patterns, Retries, Idempotency, Signatures

Webhook Reliability Patterns, Retries, Idempotency, Signatures

September 11, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — Outbound webhooks are a distributed system you forgot you built. Retry with exponential backoff and jitter, sign with HMAC-SHA256, include an idempotency key, log every delivery attempt. The patterns are not hard. Skipping them is what gets you paged at 3am.

The webhook is the lowest-prestige piece of infrastructure in your stack. It’s also the one most likely to break in a way that’s hard to debug six months later. I’ve spent enough on-call shifts staring at half-delivered webhook payloads to have strong opinions about how they should be built.

This post is for the senior backend engineer who’s standing up an outbound webhook capability — to n8n, to Make, to a customer’s endpoint, doesn’t matter. The same patterns apply. The code is in Go 1.23 but the shapes translate.

If you’ve followed the post on exposing APIs to citizen devs, webhooks are the missing direction. They’re the calls from you, to the workflow platform.

The four properties of a webhook you can trust

A production webhook system has four properties. If you’re missing any of them, you’ve got a problem waiting to happen.

At-least-once delivery. The receiver gets the event eventually, even if their endpoint was down at the moment you fired.
Idempotency support. When the receiver gets the same event twice (and they will), they can detect and dedupe.
Authenticity. The receiver can prove the payload came from you, untampered.
Observability. You can answer “what happened to event X” without grepping production logs.

I’ll cover each. The order matters because each builds on the previous.

At-least-once delivery, the retry loop

The naive webhook is “POST and forget.” It’s also the source of every webhook bug. The producer fires the request synchronously inside the business event handler. The receiver is slow or down. The producer times out, marks the event as “failed,” and moves on. Now the event is permanently dropped and you have no recovery.

The correct shape is to decouple the event from the delivery attempt. The business event creates a delivery record in a queue. A separate worker drains the queue and makes the HTTP call. Failed deliveries get rescheduled with backoff.

A minimal delivery worker in Go 1.23 — this is the pattern, not the production code:

package webhooks

import (
	"bytes"
	"context"
	"crypto/hmac"
	"crypto/sha256"
	"encoding/hex"
	"fmt"
	"math"
	"math/rand"
	"net/http"
	"time"
)

type Delivery struct {
	ID         string
	EndpointID string
	URL        string
	Secret     []byte
	Payload    []byte
	Attempt    int
	NextRunAt  time.Time
}

type Result int

const (
	ResultSuccess Result = iota
	ResultRetryable
	ResultPermanent
)

func Deliver(ctx context.Context, d Delivery) (Result, error) {
	req, err := http.NewRequestWithContext(ctx, "POST", d.URL, bytes.NewReader(d.Payload))
	if err != nil {
		return ResultPermanent, err
	}

	ts := time.Now().UTC().Format(time.RFC3339)
	sig := signPayload(d.Secret, ts, d.Payload)

	req.Header.Set("Content-Type", "application/json")
	req.Header.Set("X-Webhook-Id", d.ID)
	req.Header.Set("X-Webhook-Timestamp", ts)
	req.Header.Set("X-Webhook-Signature", sig)
	req.Header.Set("X-Webhook-Attempt", fmt.Sprintf("%d", d.Attempt))

	client := &http.Client{Timeout: 10 * time.Second}
	res, err := client.Do(req)
	if err != nil {
		return ResultRetryable, err
	}
	defer res.Body.Close()

	switch {
	case res.StatusCode >= 200 && res.StatusCode < 300:
		return ResultSuccess, nil
	case res.StatusCode == 410 || res.StatusCode == 404:
		return ResultPermanent, fmt.Errorf("endpoint gone: %d", res.StatusCode)
	case res.StatusCode >= 400 && res.StatusCode < 500:
		return ResultPermanent, fmt.Errorf("client error: %d", res.StatusCode)
	default:
		return ResultRetryable, fmt.Errorf("server error: %d", res.StatusCode)
	}
}

func NextBackoff(attempt int) time.Duration {
	base := math.Pow(2, float64(attempt)) * float64(time.Second)
	cap := float64(15 * time.Minute)
	if base > cap {
		base = cap
	}
	jitter := rand.Float64() * base * 0.5
	return time.Duration(base + jitter)
}

A few things to call out.

Exponential backoff with jitter. The NextBackoff function doubles each attempt up to a 15-minute ceiling, with up to 50% jitter added. Without jitter, every webhook to the same endpoint that failed at the same time retries at the same time. Thundering herds are real. AWS published the classic Marc Brooker piece on exponential backoff and jitter — read it if you haven’t.

Permanent vs retryable. 4xx responses (other than 408, 429) mean the receiver doesn’t want the event. Don’t retry forever. 5xx and network errors are retryable. 410 Gone is the signal to disable the endpoint entirely.

Attempt cap. Not shown in the function but mandatory in production — give up after roughly 24 hours of retries (somewhere around attempt 16–20 with exponential backoff). Move the delivery to a dead-letter table and alert. Retrying forever is how you eat your own queue.

Idempotency, the receiver’s side

Every webhook event carries an idempotency key. I put it in the payload itself rather than relying on the delivery ID, because the delivery ID changes per attempt but the event identity doesn’t.

{
  "event_id": "evt_01J7M4P8YKQX5RZ3V0BHWMNJZA",
  "event_type": "invoice.paid",
  "occurred_at": "2024-09-11T03:14:22Z",
  "data": {
    "invoice_id": "inv_42",
    "amount_cents": 12500
  }
}

event_id is a ULID generated when the business event happens. It’s stable across retries. The receiver dedupes on it, ideally with a Postgres unique constraint on a processed_events table:

CREATE TABLE processed_events (
  event_id TEXT PRIMARY KEY,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- In the handler, transactionally
INSERT INTO processed_events (event_id) VALUES ($1)
ON CONFLICT (event_id) DO NOTHING;
-- If 0 rows affected, the event was already processed

This is the entire idempotency story for most webhook receivers. ON CONFLICT DO NOTHING in the same transaction as the side-effect, return early if the row already existed. Vacuum the table on a retention policy.

The n8n side of this is a little fiddlier. The “Webhook” trigger node doesn’t natively dedupe. You build the dedup as the first node — a small function node that checks against an external KV store or a Redis SETNX.

// n8n Function node, first node after the Webhook trigger
const eventId = $json.headers['x-webhook-id'];
const redis = $getWorkflowStaticData('global').redis; // Init separately
const wasNew = await redis.set(`evt:${eventId}`, '1', 'EX', 86400, 'NX');
if (!wasNew) {
  return []; // Skip downstream, already processed
}
return $input.all();

Signatures, HMAC-SHA256 is fine

Sign the payload with HMAC-SHA256 and a per-endpoint shared secret. Don’t get clever. Don’t roll your own. Match the GitHub or Stripe pattern — most receivers already know how to verify.

The producer side, building on the snippet above:

func signPayload(secret []byte, timestamp string, payload []byte) string {
	mac := hmac.New(sha256.New, secret)
	mac.Write([]byte(timestamp))
	mac.Write([]byte("."))
	mac.Write(payload)
	return "v1=" + hex.EncodeToString(mac.Sum(nil))
}

Two important details. The signature covers the timestamp and the body, not just the body — this prevents replay across time. And the timestamp is verified on the receiver against a window (typically five minutes) to make replay attacks expensive.

The receiver side, in a Node 20 Express handler:

import express from 'express';
import crypto from 'crypto';

const app = express();
app.use(express.raw({ type: 'application/json' }));

app.post('/webhook', (req, res) => {
  const sig = req.header('x-webhook-signature');
  const ts = req.header('x-webhook-timestamp');
  const id = req.header('x-webhook-id');
  if (!sig || !ts || !id) return res.status(400).end();

  const tsDate = new Date(ts);
  if (Math.abs(Date.now() - tsDate.getTime()) > 5 * 60 * 1000) {
    return res.status(400).end(); // Stale timestamp
  }

  const expected = 'v1=' + crypto
    .createHmac('sha256', process.env.WEBHOOK_SECRET)
    .update(ts + '.')
    .update(req.body)
    .digest('hex');

  const a = Buffer.from(sig);
  const b = Buffer.from(expected);
  if (a.length !== b.length || !crypto.timingSafeEqual(a, b)) {
    return res.status(401).end();
  }

  // Now safe to parse and process
  const event = JSON.parse(req.body.toString());
  // ... dedup on event.event_id, then handle
  res.status(200).end();
});

app.listen(3000);

The timingSafeEqual is the part people skip. Don’t. Constant-time comparison closes a timing side-channel on signature verification. It costs nothing.

Observability for outbound webhooks

You need a queryable record of every delivery attempt. Not just the final outcome — every attempt, with response code, response body (truncated), latency, and the attempt number.

The minimum table:

CREATE TABLE webhook_deliveries (
  id BIGSERIAL PRIMARY KEY,
  event_id TEXT NOT NULL,
  endpoint_id TEXT NOT NULL,
  attempt INT NOT NULL,
  attempted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  status_code INT,
  latency_ms INT,
  error TEXT,
  response_body_truncated TEXT
);

CREATE INDEX ON webhook_deliveries (event_id);
CREATE INDEX ON webhook_deliveries (endpoint_id, attempted_at);

When the customer asks “did you deliver event X to my endpoint,” the answer is a single query, not a war room. When you want to find endpoints that are consistently slow or failing, the query is straightforward. When the compliance team asks for delivery records, you have them.

Common pitfalls

What I see go wrong in the wild.

Synchronous delivery in the business handler. The transaction that creates the invoice also tries to POST the webhook. Receiver is slow, transaction holds a lock, things grind. Always decouple. Write a delivery record in the same transaction, deliver async.
No max-attempt cap. Retrying for a week eats queue capacity and floods endpoints. 24 hours is enough. After that, dead-letter and notify.
Missing idempotency key in payload. Without it, the receiver can’t dedupe even if they want to. The producer owes them this.
Putting secrets in the URL. I’ve seen webhook URLs like https://hooks.example.com/abc?token=.... The token ends up in everyone’s access logs. Sign in headers, don’t put auth in the URL.
No timeout on the HTTP client. Default Go http.Client has no timeout. A slow receiver will hold your worker forever. Always set Timeout.
Ignoring 410 Gone. Receivers should be able to permanently turn off an endpoint. Honour 410. Disable the endpoint and stop trying.
Sending the full payload on retries unchanged. This is correct, but make sure the attempt counter is in headers so the receiver can detect retries and decide. Some receivers want to behave differently on first delivery vs retry.
Not testing receiver downtime. Run a chaos drill — block your receiver for 30 minutes during a working day, verify the queue drains correctly afterward. The first time you discover backpressure is broken should not be in a real outage.

What’s next

Outbound webhooks are the kind of system where the patterns are well-known but the implementation discipline is what separates “works in tests” from “works under load.” Retries with jitter, HMAC signatures, idempotency keys, per-attempt logging. None of this is exotic. All of it is required.

Next post in the series goes the other direction — building reusable connectors for n8n and Make that wrap your internal APIs and your webhook receivers behind a single drop-in node. That’s where the leverage really shows up for citizen-developer enablement. See you Monday.

The four properties of a webhook you can trust

At-least-once delivery, the retry loop

Idempotency, the receiver’s side

Signatures, HMAC-SHA256 is fine

Observability for outbound webhooks

Common pitfalls

What’s next

Related posts

Auditing Low Code Workflows for SOC 2 and ISO 27001

When Pro Code Wins Over Low Code, A Decision Matrix

Identity Federation for Citizen Developers, Keycloak and Auth0

Building Reusable Connectors for n8n and Make in 2024

Securely Exposing Enterprise APIs to Citizen Developers

Self Hosting n8n for Engineering Teams in 2024

Low Code in the Enterprise, A Pro Code Engineer's Honest View

Error Handling and Retries for Production n8n Workflows

Let’s Start a Project