Retry logic is supposed to make webhook delivery more reliable. In practice, poorly designed retry behavior can turn a 5-minute outage into a 45-minute one. The mechanism is a retry storm: the destination goes down, senders keep retrying, and when the destination comes back up, it immediately gets hit by every backlogged retry at once — and crashes again.

This post explains exactly how retry storms form, the specific design choices that make them worse, and the patterns that break the cycle.

How a Retry Storm Develops

The sequence is predictable:

›Your destination endpoint experiences a spike, a bad deploy, or a dependency failure. It starts returning 503.
›Every sender that receives a 503 schedules a retry. If you have 10 upstream senders, all 10 are now retrying on the same schedule.
›The destination recovers after 5 minutes. The backlog of retries — all scheduled at roughly the same time — fires simultaneously.
›The destination, which just barely recovered, gets hit with 10× its normal request rate in the first 30 seconds. It crashes again.
›All 10 senders see the new 503s and schedule another round of retries.

This is a textbook thundering herd problem. The destination's recovery triggers the next failure. Without intervention, the system oscillates between brief recovery windows and repeated crashes until someone manually drains the retry queue or sheds load.

The Design Choices That Make It Worse

Not all retry implementations are equally dangerous. The following choices amplify retry storm severity:

Design choice	Why it's dangerous
Fixed retry interval	All retries from all senders fire at the same time
No jitter on backoff	Synchronized retries even with exponential backoff
Unbounded retry concurrency	No cap on how many retries can be in-flight simultaneously
Immediate first retry	Amplifies initial spike before backoff kicks in
Retrying on 429 Too Many Requests	Adds to the load that caused the rate limit in the first place
Same retry schedule for all destinations	A slow destination blocks retries for healthy ones

The most common mistake is implementing exponential backoff without jitter. Exponential backoff without jitter looks like this:

Attempt 1: wait 30s → all senders retry at T+30
Attempt 2: wait 60s → all senders retry at T+90
Attempt 3: wait 120s → all senders retry at T+210

If your destination went down at T=0 and all your senders ingested events in the same 5-minute window, they will all be on the same retry schedule. Exponential backoff just spaces out the synchronized waves — it doesn't break them up.

The Fix: Full Jitter Backoff

The correct approach is to add randomness — jitter — to each retry interval so that retries from different senders (and different events from the same sender) spread out over time.

Full jitter picks a random value between 0 and the computed backoff interval:

func retryDelay(attempt int, baseDelay, maxDelay time.Duration) time.Duration {
    // Exponential backoff: base * 2^attempt
    exp := baseDelay * (1 << attempt)
    if exp > maxDelay {
        exp = maxDelay
    }
    // Full jitter: uniform random in [0, exp)
    jitter := time.Duration(rand.Int63n(int64(exp)))
    return jitter
}

With full jitter and a base delay of 30 seconds, five attempts from a single sender might fire at:

Attempt 1: T + 18s
Attempt 2: T + 47s
Attempt 3: T + 112s
Attempt 4: T + 380s
Attempt 5: T + 2100s

Across 10 senders with independent jitter, the retries spread across the entire backoff window rather than concentrating at the same moment. When the destination recovers, it sees a gradual ramp-up instead of a spike.

The AWS Architecture Blog published foundational research on this: "equal jitter" is better than no jitter, but "full jitter" produces the best load distribution across the retry window. Use full jitter.

Per-Destination Retry Isolation

A subtler problem: if your retry worker uses a shared queue without per-destination isolation, a stormy destination starves healthy ones.

Imagine destination A is down and has 50,000 queued retries. Destination B is healthy but shares the same worker pool. Workers spend most of their time attempting (and failing) delivery to A, while B's events pile up waiting for worker capacity.

The fix is per-destination retry queues with independent concurrency limits:

sql

-- Job queue with per-destination tracking
CREATE TABLE delivery_jobs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id        UUID NOT NULL REFERENCES events(id),
    destination_id  UUID NOT NULL REFERENCES destinations(id),
    attempt_number  INT NOT NULL DEFAULT 1,
    scheduled_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
    locked_at       TIMESTAMPTZ,
    locked_by       TEXT
);

CREATE INDEX delivery_jobs_by_destination ON delivery_jobs
    (destination_id, scheduled_at)
    WHERE locked_at IS NULL;

Your worker poll query should enforce a per-destination concurrency limit:

sql

WITH in_flight AS (
    SELECT destination_id, COUNT(*) AS count
    FROM delivery_jobs
    WHERE locked_at IS NOT NULL
      AND locked_at > now() - INTERVAL '2 minutes'
    GROUP BY destination_id
)
SELECT dj.*
FROM delivery_jobs dj
LEFT JOIN in_flight inf ON inf.destination_id = dj.destination_id
WHERE dj.locked_at IS NULL
  AND dj.scheduled_at <= now()
  AND (inf.count IS NULL OR inf.count < 10)  -- max 10 in-flight per destination
ORDER BY dj.scheduled_at
LIMIT 100
FOR UPDATE OF dj SKIP LOCKED;

This caps how many simultaneous delivery attempts any single destination can absorb, regardless of how large its backlog is. Destination A's storm can no longer consume all worker capacity.

Detecting a Retry Storm in Progress

Your observability stack should surface retry storms before they become extended incidents. The signal to watch is the ratio of retry attempts to first attempts for a given destination over a rolling window.

sql

SELECT
    destination_id,
    COUNT(*) FILTER (WHERE attempt_number = 1) AS first_attempts,
    COUNT(*) FILTER (WHERE attempt_number > 1) AS retry_attempts,
    ROUND(
        COUNT(*) FILTER (WHERE attempt_number > 1)::numeric /
        NULLIF(COUNT(*) FILTER (WHERE attempt_number = 1), 0),
        2
    ) AS retry_ratio
FROM delivery_attempts
WHERE created_at > now() - INTERVAL '15 minutes'
GROUP BY destination_id
HAVING COUNT(*) FILTER (WHERE attempt_number > 1) > 100
ORDER BY retry_ratio DESC;

A retry ratio above 5 for a given destination over a 15-minute window is a clear signal that something is wrong. Alert on this. Do not wait for the dead-letter queue to fill up.

GetHook surfaces destination health in the dashboard — when a destination's error rate crosses a threshold, it's flagged automatically so you can act before the retry backlog grows large enough to become a storm.

Circuit Breakers as a Last Resort

When a destination's failure rate is high enough that retrying is actively harmful, a circuit breaker can stop the retry loop entirely. The circuit opens (stops sending) when failures exceed a threshold, and probes periodically with a single request to check for recovery.

The mechanics are straightforward:

State	Behavior	Transition
Closed (normal)	All attempts proceed normally	Open when error rate > threshold over window
Open (failing)	No attempts; fail immediately	Half-open after cooldown period
Half-open (probing)	Allow one attempt through	Closed if it succeeds; Open if it fails

Circuit breakers complement retry logic — they are not a replacement for it. The right hierarchy is: full jitter backoff to space out retries, per-destination concurrency limits to prevent queue starvation, and circuit breakers to stop retrying a destination that is clearly not recovering.

What to avoid: a circuit breaker that opens so aggressively it triggers on transient blips, or one with a cooldown so long that recovered destinations stay blocked for hours.

The Retry Policy Knobs That Matter

When designing or configuring retry behavior, the parameters that have the most impact on storm risk:

Parameter	Recommended default	Why
Max attempts	5	Beyond 5, you're likely looking at a long-term outage, not a transient failure
Base delay	30 seconds	Enough time for most transient failures to resolve
Max delay	1 hour	Caps the backoff ceiling; prevents multi-day retry windows
Jitter strategy	Full jitter	Best load distribution during recovery
Per-destination concurrency	10	Prevents one bad destination from consuming all workers
Retry on 4xx	Never (except 429)	4xx errors are consumer bugs, not transient failures
Retry on 429	Yes, but respect `Retry-After`	Honor the sender's backpressure signal

The Retry-After header on 429 responses deserves special attention. If your destination is rate-limiting you, the correct behavior is to schedule the retry at the time the destination says it will be ready — not at your standard exponential backoff interval. Ignoring Retry-After and retrying on your own schedule makes you part of the overload problem you're trying to recover from.

What Happens at the Dead-Letter Queue

Events that exhaust all retry attempts land in the dead-letter queue. A large dead-letter queue is a lagging indicator of a retry storm — it tells you the storm already happened and ran to completion.

The more useful signal is the retry queue depth, not the DLQ depth. Monitor COUNT(*) WHERE status = 'retry_scheduled' AND destination_id = X in your events table. A rapidly growing retry queue for a specific destination is an early-warning signal, not a post-mortem artifact.

When you do need to replay from the DLQ after a storm, replay gradually — not all at once. A bulk replay that dumps 50,000 events onto a destination that just recovered is the same thundering herd problem in a different disguise. Replay with a rate limit: 100 events/minute, observe the destination's response rate, and ramp up only once it's clearly stable.

Retry storms are a reliability failure that looks like a destination problem but is often an infrastructure design problem. The destination went down briefly, but the retry behavior is what turned it into an extended outage. Full jitter, per-destination isolation, and circuit breakers are not optional polish — they are the mechanism by which your retry logic stays beneficial rather than becoming the source of its own incidents.

If you want retry behavior that includes full jitter, per-destination concurrency limits, and automatic dead-letter handling without building it yourself, see how GetHook handles delivery and retry →

Webhook Retry Storms: How Cascading Failures Amplify Outages

How a Retry Storm Develops

The Design Choices That Make It Worse

The Fix: Full Jitter Backoff

Per-Destination Retry Isolation

Detecting a Retry Storm in Progress

Circuit Breakers as a Last Resort

The Retry Policy Knobs That Matter

What Happens at the Dead-Letter Queue

Related articles

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Designing a Great Webhook SDK: Verification, Typing, and Developer Ergonomics

Synthetic End-to-End Testing for Webhook Delivery Pipelines

Stop losing webhook events.