Back to Blog
webhooksreliabilityretrybackpressureincident response

Webhook Retry Storms: How Cascading Failures Amplify Outages

When a destination goes down, every upstream sender starts retrying simultaneously. Here's how retry storms form, why they extend outages, and the patterns that prevent them.

M
Marcus Webb
Platform Engineer
April 4, 2026
9 min read

Retry logic is supposed to make webhook delivery more reliable. In practice, poorly designed retry behavior can turn a 5-minute outage into a 45-minute one. The mechanism is a retry storm: the destination goes down, senders keep retrying, and when the destination comes back up, it immediately gets hit by every backlogged retry at once — and crashes again.

This post explains exactly how retry storms form, the specific design choices that make them worse, and the patterns that break the cycle.


How a Retry Storm Develops

The sequence is predictable:

  1. Your destination endpoint experiences a spike, a bad deploy, or a dependency failure. It starts returning 503.
  2. Every sender that receives a 503 schedules a retry. If you have 10 upstream senders, all 10 are now retrying on the same schedule.
  3. The destination recovers after 5 minutes. The backlog of retries — all scheduled at roughly the same time — fires simultaneously.
  4. The destination, which just barely recovered, gets hit with 10× its normal request rate in the first 30 seconds. It crashes again.
  5. All 10 senders see the new 503s and schedule another round of retries.

This is a textbook thundering herd problem. The destination's recovery triggers the next failure. Without intervention, the system oscillates between brief recovery windows and repeated crashes until someone manually drains the retry queue or sheds load.


The Design Choices That Make It Worse

Not all retry implementations are equally dangerous. The following choices amplify retry storm severity:

Design choiceWhy it's dangerous
Fixed retry intervalAll retries from all senders fire at the same time
No jitter on backoffSynchronized retries even with exponential backoff
Unbounded retry concurrencyNo cap on how many retries can be in-flight simultaneously
Immediate first retryAmplifies initial spike before backoff kicks in
Retrying on 429 Too Many RequestsAdds to the load that caused the rate limit in the first place
Same retry schedule for all destinationsA slow destination blocks retries for healthy ones

The most common mistake is implementing exponential backoff without jitter. Exponential backoff without jitter looks like this:

Attempt 1: wait 30s → all senders retry at T+30
Attempt 2: wait 60s → all senders retry at T+90
Attempt 3: wait 120s → all senders retry at T+210

If your destination went down at T=0 and all your senders ingested events in the same 5-minute window, they will all be on the same retry schedule. Exponential backoff just spaces out the synchronized waves — it doesn't break them up.


The Fix: Full Jitter Backoff

The correct approach is to add randomness — jitter — to each retry interval so that retries from different senders (and different events from the same sender) spread out over time.

Full jitter picks a random value between 0 and the computed backoff interval:

go
func retryDelay(attempt int, baseDelay, maxDelay time.Duration) time.Duration {
    // Exponential backoff: base * 2^attempt
    exp := baseDelay * (1 << attempt)
    if exp > maxDelay {
        exp = maxDelay
    }
    // Full jitter: uniform random in [0, exp)
    jitter := time.Duration(rand.Int63n(int64(exp)))
    return jitter
}

With full jitter and a base delay of 30 seconds, five attempts from a single sender might fire at:

Attempt 1: T + 18s
Attempt 2: T + 47s
Attempt 3: T + 112s
Attempt 4: T + 380s
Attempt 5: T + 2100s

Across 10 senders with independent jitter, the retries spread across the entire backoff window rather than concentrating at the same moment. When the destination recovers, it sees a gradual ramp-up instead of a spike.

The AWS Architecture Blog published foundational research on this: "equal jitter" is better than no jitter, but "full jitter" produces the best load distribution across the retry window. Use full jitter.


Per-Destination Retry Isolation

A subtler problem: if your retry worker uses a shared queue without per-destination isolation, a stormy destination starves healthy ones.

Imagine destination A is down and has 50,000 queued retries. Destination B is healthy but shares the same worker pool. Workers spend most of their time attempting (and failing) delivery to A, while B's events pile up waiting for worker capacity.

The fix is per-destination retry queues with independent concurrency limits:

sql
-- Job queue with per-destination tracking
CREATE TABLE delivery_jobs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id        UUID NOT NULL REFERENCES events(id),
    destination_id  UUID NOT NULL REFERENCES destinations(id),
    attempt_number  INT NOT NULL DEFAULT 1,
    scheduled_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
    locked_at       TIMESTAMPTZ,
    locked_by       TEXT
);

CREATE INDEX delivery_jobs_by_destination ON delivery_jobs
    (destination_id, scheduled_at)
    WHERE locked_at IS NULL;

Your worker poll query should enforce a per-destination concurrency limit:

sql
WITH in_flight AS (
    SELECT destination_id, COUNT(*) AS count
    FROM delivery_jobs
    WHERE locked_at IS NOT NULL
      AND locked_at > now() - INTERVAL '2 minutes'
    GROUP BY destination_id
)
SELECT dj.*
FROM delivery_jobs dj
LEFT JOIN in_flight inf ON inf.destination_id = dj.destination_id
WHERE dj.locked_at IS NULL
  AND dj.scheduled_at <= now()
  AND (inf.count IS NULL OR inf.count < 10)  -- max 10 in-flight per destination
ORDER BY dj.scheduled_at
LIMIT 100
FOR UPDATE OF dj SKIP LOCKED;

This caps how many simultaneous delivery attempts any single destination can absorb, regardless of how large its backlog is. Destination A's storm can no longer consume all worker capacity.


Detecting a Retry Storm in Progress

Your observability stack should surface retry storms before they become extended incidents. The signal to watch is the ratio of retry attempts to first attempts for a given destination over a rolling window.

sql
SELECT
    destination_id,
    COUNT(*) FILTER (WHERE attempt_number = 1) AS first_attempts,
    COUNT(*) FILTER (WHERE attempt_number > 1) AS retry_attempts,
    ROUND(
        COUNT(*) FILTER (WHERE attempt_number > 1)::numeric /
        NULLIF(COUNT(*) FILTER (WHERE attempt_number = 1), 0),
        2
    ) AS retry_ratio
FROM delivery_attempts
WHERE created_at > now() - INTERVAL '15 minutes'
GROUP BY destination_id
HAVING COUNT(*) FILTER (WHERE attempt_number > 1) > 100
ORDER BY retry_ratio DESC;

A retry ratio above 5 for a given destination over a 15-minute window is a clear signal that something is wrong. Alert on this. Do not wait for the dead-letter queue to fill up.

GetHook surfaces destination health in the dashboard — when a destination's error rate crosses a threshold, it's flagged automatically so you can act before the retry backlog grows large enough to become a storm.


Circuit Breakers as a Last Resort

When a destination's failure rate is high enough that retrying is actively harmful, a circuit breaker can stop the retry loop entirely. The circuit opens (stops sending) when failures exceed a threshold, and probes periodically with a single request to check for recovery.

The mechanics are straightforward:

StateBehaviorTransition
Closed (normal)All attempts proceed normallyOpen when error rate > threshold over window
Open (failing)No attempts; fail immediatelyHalf-open after cooldown period
Half-open (probing)Allow one attempt throughClosed if it succeeds; Open if it fails

Circuit breakers complement retry logic — they are not a replacement for it. The right hierarchy is: full jitter backoff to space out retries, per-destination concurrency limits to prevent queue starvation, and circuit breakers to stop retrying a destination that is clearly not recovering.

What to avoid: a circuit breaker that opens so aggressively it triggers on transient blips, or one with a cooldown so long that recovered destinations stay blocked for hours.


The Retry Policy Knobs That Matter

When designing or configuring retry behavior, the parameters that have the most impact on storm risk:

ParameterRecommended defaultWhy
Max attempts5Beyond 5, you're likely looking at a long-term outage, not a transient failure
Base delay30 secondsEnough time for most transient failures to resolve
Max delay1 hourCaps the backoff ceiling; prevents multi-day retry windows
Jitter strategyFull jitterBest load distribution during recovery
Per-destination concurrency10Prevents one bad destination from consuming all workers
Retry on 4xxNever (except 429)4xx errors are consumer bugs, not transient failures
Retry on 429Yes, but respect Retry-AfterHonor the sender's backpressure signal

The Retry-After header on 429 responses deserves special attention. If your destination is rate-limiting you, the correct behavior is to schedule the retry at the time the destination says it will be ready — not at your standard exponential backoff interval. Ignoring Retry-After and retrying on your own schedule makes you part of the overload problem you're trying to recover from.


What Happens at the Dead-Letter Queue

Events that exhaust all retry attempts land in the dead-letter queue. A large dead-letter queue is a lagging indicator of a retry storm — it tells you the storm already happened and ran to completion.

The more useful signal is the retry queue depth, not the DLQ depth. Monitor COUNT(*) WHERE status = 'retry_scheduled' AND destination_id = X in your events table. A rapidly growing retry queue for a specific destination is an early-warning signal, not a post-mortem artifact.

When you do need to replay from the DLQ after a storm, replay gradually — not all at once. A bulk replay that dumps 50,000 events onto a destination that just recovered is the same thundering herd problem in a different disguise. Replay with a rate limit: 100 events/minute, observe the destination's response rate, and ramp up only once it's clearly stable.


Retry storms are a reliability failure that looks like a destination problem but is often an infrastructure design problem. The destination went down briefly, but the retry behavior is what turned it into an extended outage. Full jitter, per-destination isolation, and circuit breakers are not optional polish — they are the mechanism by which your retry logic stays beneficial rather than becoming the source of its own incidents.

If you want retry behavior that includes full jitter, per-destination concurrency limits, and automatic dead-letter handling without building it yourself, see how GetHook handles delivery and retry →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.