Retry logic is supposed to make webhook delivery more reliable. In practice, poorly designed retry behavior can turn a 5-minute outage into a 45-minute one. The mechanism is a retry storm: the destination goes down, senders keep retrying, and when the destination comes back up, it immediately gets hit by every backlogged retry at once — and crashes again.
This post explains exactly how retry storms form, the specific design choices that make them worse, and the patterns that break the cycle.
How a Retry Storm Develops
The sequence is predictable:
- ›Your destination endpoint experiences a spike, a bad deploy, or a dependency failure. It starts returning 503.
- ›Every sender that receives a 503 schedules a retry. If you have 10 upstream senders, all 10 are now retrying on the same schedule.
- ›The destination recovers after 5 minutes. The backlog of retries — all scheduled at roughly the same time — fires simultaneously.
- ›The destination, which just barely recovered, gets hit with 10× its normal request rate in the first 30 seconds. It crashes again.
- ›All 10 senders see the new 503s and schedule another round of retries.
This is a textbook thundering herd problem. The destination's recovery triggers the next failure. Without intervention, the system oscillates between brief recovery windows and repeated crashes until someone manually drains the retry queue or sheds load.
The Design Choices That Make It Worse
Not all retry implementations are equally dangerous. The following choices amplify retry storm severity:
| Design choice | Why it's dangerous |
|---|---|
| Fixed retry interval | All retries from all senders fire at the same time |
| No jitter on backoff | Synchronized retries even with exponential backoff |
| Unbounded retry concurrency | No cap on how many retries can be in-flight simultaneously |
| Immediate first retry | Amplifies initial spike before backoff kicks in |
| Retrying on 429 Too Many Requests | Adds to the load that caused the rate limit in the first place |
| Same retry schedule for all destinations | A slow destination blocks retries for healthy ones |
The most common mistake is implementing exponential backoff without jitter. Exponential backoff without jitter looks like this:
Attempt 1: wait 30s → all senders retry at T+30
Attempt 2: wait 60s → all senders retry at T+90
Attempt 3: wait 120s → all senders retry at T+210If your destination went down at T=0 and all your senders ingested events in the same 5-minute window, they will all be on the same retry schedule. Exponential backoff just spaces out the synchronized waves — it doesn't break them up.
The Fix: Full Jitter Backoff
The correct approach is to add randomness — jitter — to each retry interval so that retries from different senders (and different events from the same sender) spread out over time.
Full jitter picks a random value between 0 and the computed backoff interval:
func retryDelay(attempt int, baseDelay, maxDelay time.Duration) time.Duration {
// Exponential backoff: base * 2^attempt
exp := baseDelay * (1 << attempt)
if exp > maxDelay {
exp = maxDelay
}
// Full jitter: uniform random in [0, exp)
jitter := time.Duration(rand.Int63n(int64(exp)))
return jitter
}With full jitter and a base delay of 30 seconds, five attempts from a single sender might fire at:
Attempt 1: T + 18s
Attempt 2: T + 47s
Attempt 3: T + 112s
Attempt 4: T + 380s
Attempt 5: T + 2100sAcross 10 senders with independent jitter, the retries spread across the entire backoff window rather than concentrating at the same moment. When the destination recovers, it sees a gradual ramp-up instead of a spike.
The AWS Architecture Blog published foundational research on this: "equal jitter" is better than no jitter, but "full jitter" produces the best load distribution across the retry window. Use full jitter.
Per-Destination Retry Isolation
A subtler problem: if your retry worker uses a shared queue without per-destination isolation, a stormy destination starves healthy ones.
Imagine destination A is down and has 50,000 queued retries. Destination B is healthy but shares the same worker pool. Workers spend most of their time attempting (and failing) delivery to A, while B's events pile up waiting for worker capacity.
The fix is per-destination retry queues with independent concurrency limits:
-- Job queue with per-destination tracking
CREATE TABLE delivery_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID NOT NULL REFERENCES events(id),
destination_id UUID NOT NULL REFERENCES destinations(id),
attempt_number INT NOT NULL DEFAULT 1,
scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
locked_at TIMESTAMPTZ,
locked_by TEXT
);
CREATE INDEX delivery_jobs_by_destination ON delivery_jobs
(destination_id, scheduled_at)
WHERE locked_at IS NULL;Your worker poll query should enforce a per-destination concurrency limit:
WITH in_flight AS (
SELECT destination_id, COUNT(*) AS count
FROM delivery_jobs
WHERE locked_at IS NOT NULL
AND locked_at > now() - INTERVAL '2 minutes'
GROUP BY destination_id
)
SELECT dj.*
FROM delivery_jobs dj
LEFT JOIN in_flight inf ON inf.destination_id = dj.destination_id
WHERE dj.locked_at IS NULL
AND dj.scheduled_at <= now()
AND (inf.count IS NULL OR inf.count < 10) -- max 10 in-flight per destination
ORDER BY dj.scheduled_at
LIMIT 100
FOR UPDATE OF dj SKIP LOCKED;This caps how many simultaneous delivery attempts any single destination can absorb, regardless of how large its backlog is. Destination A's storm can no longer consume all worker capacity.
Detecting a Retry Storm in Progress
Your observability stack should surface retry storms before they become extended incidents. The signal to watch is the ratio of retry attempts to first attempts for a given destination over a rolling window.
SELECT
destination_id,
COUNT(*) FILTER (WHERE attempt_number = 1) AS first_attempts,
COUNT(*) FILTER (WHERE attempt_number > 1) AS retry_attempts,
ROUND(
COUNT(*) FILTER (WHERE attempt_number > 1)::numeric /
NULLIF(COUNT(*) FILTER (WHERE attempt_number = 1), 0),
2
) AS retry_ratio
FROM delivery_attempts
WHERE created_at > now() - INTERVAL '15 minutes'
GROUP BY destination_id
HAVING COUNT(*) FILTER (WHERE attempt_number > 1) > 100
ORDER BY retry_ratio DESC;A retry ratio above 5 for a given destination over a 15-minute window is a clear signal that something is wrong. Alert on this. Do not wait for the dead-letter queue to fill up.
GetHook surfaces destination health in the dashboard — when a destination's error rate crosses a threshold, it's flagged automatically so you can act before the retry backlog grows large enough to become a storm.
Circuit Breakers as a Last Resort
When a destination's failure rate is high enough that retrying is actively harmful, a circuit breaker can stop the retry loop entirely. The circuit opens (stops sending) when failures exceed a threshold, and probes periodically with a single request to check for recovery.
The mechanics are straightforward:
| State | Behavior | Transition |
|---|---|---|
| Closed (normal) | All attempts proceed normally | Open when error rate > threshold over window |
| Open (failing) | No attempts; fail immediately | Half-open after cooldown period |
| Half-open (probing) | Allow one attempt through | Closed if it succeeds; Open if it fails |
Circuit breakers complement retry logic — they are not a replacement for it. The right hierarchy is: full jitter backoff to space out retries, per-destination concurrency limits to prevent queue starvation, and circuit breakers to stop retrying a destination that is clearly not recovering.
What to avoid: a circuit breaker that opens so aggressively it triggers on transient blips, or one with a cooldown so long that recovered destinations stay blocked for hours.
The Retry Policy Knobs That Matter
When designing or configuring retry behavior, the parameters that have the most impact on storm risk:
| Parameter | Recommended default | Why |
|---|---|---|
| Max attempts | 5 | Beyond 5, you're likely looking at a long-term outage, not a transient failure |
| Base delay | 30 seconds | Enough time for most transient failures to resolve |
| Max delay | 1 hour | Caps the backoff ceiling; prevents multi-day retry windows |
| Jitter strategy | Full jitter | Best load distribution during recovery |
| Per-destination concurrency | 10 | Prevents one bad destination from consuming all workers |
| Retry on 4xx | Never (except 429) | 4xx errors are consumer bugs, not transient failures |
| Retry on 429 | Yes, but respect Retry-After | Honor the sender's backpressure signal |
The Retry-After header on 429 responses deserves special attention. If your destination is rate-limiting you, the correct behavior is to schedule the retry at the time the destination says it will be ready — not at your standard exponential backoff interval. Ignoring Retry-After and retrying on your own schedule makes you part of the overload problem you're trying to recover from.
What Happens at the Dead-Letter Queue
Events that exhaust all retry attempts land in the dead-letter queue. A large dead-letter queue is a lagging indicator of a retry storm — it tells you the storm already happened and ran to completion.
The more useful signal is the retry queue depth, not the DLQ depth. Monitor COUNT(*) WHERE status = 'retry_scheduled' AND destination_id = X in your events table. A rapidly growing retry queue for a specific destination is an early-warning signal, not a post-mortem artifact.
When you do need to replay from the DLQ after a storm, replay gradually — not all at once. A bulk replay that dumps 50,000 events onto a destination that just recovered is the same thundering herd problem in a different disguise. Replay with a rate limit: 100 events/minute, observe the destination's response rate, and ramp up only once it's clearly stable.
Retry storms are a reliability failure that looks like a destination problem but is often an infrastructure design problem. The destination went down briefly, but the retry behavior is what turned it into an extended outage. Full jitter, per-destination isolation, and circuit breakers are not optional polish — they are the mechanism by which your retry logic stays beneficial rather than becoming the source of its own incidents.
If you want retry behavior that includes full jitter, per-destination concurrency limits, and automatic dead-letter handling without building it yourself, see how GetHook handles delivery and retry →