Every webhook delivery system implements retries. Most implement the same thing: exponential backoff with jitter, five attempts, done. That gets you 80% of the way there. The remaining 20% — the part that actually differentiates reliable infrastructure from brittle infrastructure — requires a policy that reasons about why a delivery failed, not just that it failed.
This post covers how to design a retry policy that handles the full complexity of real webhook delivery: differentiating retryable from non-retryable failures, using destination health state to avoid futile retries, adjusting retry schedules for business-critical events, and knowing when to stop.
Start with the Right Mental Model
Retries exist to recover from transient failures. They do not exist to compensate for:
- ›Misconfigured destinations (wrong URL, wrong auth — these will keep failing)
- ›Malformed payloads (your bug, not their availability problem)
- ›Destinations that have decided to reject your events (4xx on every attempt)
Treating all failures as transient causes retry storms, wastes capacity, and misleads your delivery metrics. The first design decision is a clean separation between retryable and non-retryable failure modes.
Classifying Failures Before Retrying
The HTTP response status is the primary signal. Here is a sensible production classification:
| Response | Class | Action |
|---|---|---|
200–299 | Success | Mark delivered, no retry |
408 Request Timeout | Transient | Retry with backoff |
429 Too Many Requests | Transient | Retry, respect Retry-After header |
500 Internal Server Error | Transient | Retry with backoff |
502 Bad Gateway | Transient | Retry with backoff |
503 Service Unavailable | Transient | Retry, respect Retry-After header |
504 Gateway Timeout | Transient | Retry with backoff |
400 Bad Request | Permanent | Dead-letter immediately |
401 Unauthorized | Permanent | Dead-letter, alert operator |
403 Forbidden | Permanent | Dead-letter, alert operator |
404 Not Found | Permanent | Dead-letter, disable destination after N occurrences |
410 Gone | Permanent | Dead-letter, disable destination immediately |
| Network timeout (no response) | Transient | Retry with backoff |
| DNS failure | Transient (short) | Retry, but with tighter bound |
| TLS handshake failure | Transient | Retry with exponential backoff |
A few things to note here:
400 Bad Request is permanent. If the destination rejects the payload as malformed, retrying the identical payload will produce the same result. This is a bug in your payload construction, not a destination availability problem.
401 and 403 are permanent. The credentials are wrong or the destination has revoked access. Retrying does not fix credentials. Dead-letter and alert your operator immediately so they can investigate.
410 Gone is an explicit signal that the destination has been decommissioned. Disable it, do not retry.
429 Too Many Requests is transient, but respect the Retry-After header rather than applying your standard backoff schedule. If the destination says "wait 60 seconds," wait 60 seconds — not 30, not 120.
The Baseline: Exponential Backoff with Jitter
For transient failures, exponential backoff with jitter is still the right foundation:
func nextAttemptAt(attemptNumber int, baseDelay time.Duration, maxDelay time.Duration) time.Time {
// exponential: 0s, 30s, 2m, 10m, 1h
delay := baseDelay * time.Duration(math.Pow(2, float64(attemptNumber)))
if delay > maxDelay {
delay = maxDelay
}
// Add ±25% jitter to spread load from simultaneous failures
jitter := time.Duration(rand.Int63n(int64(delay / 2))) - delay/4
delay += jitter
return time.Now().Add(delay)
}Without jitter, a burst of simultaneous failures (a downstream service restart, a brief network partition) all schedule their first retry at the same moment, causing a synchronized retry storm. Jitter staggers these across a window, converting a spike into a ramp.
A standard production schedule with a 30-second base and 1-hour cap:
| Attempt | Delay (no jitter) | Delivery window |
|---|---|---|
| 1 | Immediate | T+0s |
| 2 | 30 seconds | T+30s |
| 3 | 2 minutes | T+2m30s |
| 4 | 10 minutes | T+12m30s |
| 5 | 1 hour | T+1h12m30s |
Five attempts gives you roughly 90 minutes of retry coverage. That covers the large majority of transient outages — deployments, brief database restarts, spot instance preemptions.
Layering in Destination Health State
The schedule above assumes each event is retried independently. In practice, if destination dst_abc123 returned a 503 on 100 consecutive deliveries over the last 10 minutes, retrying event #101 immediately is wasteful. The destination is clearly unavailable.
A destination health state machine lets you apply a different retry policy to destinations that are demonstrably down:
type DestinationHealth int
const (
HealthHealthy DestinationHealth = iota
HealthDegraded // some recent failures
HealthUnhealthy // majority of recent attempts failed
HealthCircuitOpen // all retries suspended
)
func classifyHealth(recentAttempts []Attempt) DestinationHealth {
if len(recentAttempts) < 5 {
return HealthHealthy
}
failures := 0
for _, a := range recentAttempts {
if a.Outcome != OutcomeSuccess {
failures++
}
}
rate := float64(failures) / float64(len(recentAttempts))
switch {
case rate >= 0.95:
return HealthCircuitOpen
case rate >= 0.6:
return HealthUnhealthy
case rate >= 0.2:
return HealthDegraded
default:
return HealthHealthy
}
}With health state classified, modify the retry schedule per destination:
func retryDelayForHealth(attemptNumber int, health DestinationHealth) time.Duration {
base := nextDelay(attemptNumber) // standard exponential backoff
switch health {
case HealthDegraded:
// Add 20% extra delay — destination is struggling
return base + base/5
case HealthUnhealthy:
// Double the delay — avoid piling on
return base * 2
case HealthCircuitOpen:
// Suspend — re-check after a fixed window
return 15 * time.Minute
default:
return base
}
}When the circuit is open, you are not cancelling events — you are deferring them. Every queued event gets scheduled to re-attempt after the open window. If the destination recovers during that window (you can detect this with periodic probe requests), the circuit closes and normal delivery resumes. If it does not recover, events continue accumulating in the retry queue until your maximum retention period expires.
Respecting Retry-After
The Retry-After header is a contract. When a destination responds with 429 and includes Retry-After: 120, it is telling you not to retry before that window elapses. Ignoring it will get your delivery IP rate-limited more aggressively.
func nextAttemptFromResponse(resp *http.Response, defaultDelay time.Duration) time.Duration {
if resp.StatusCode == http.StatusTooManyRequests {
if ra := resp.Header.Get("Retry-After"); ra != "" {
// Retry-After can be a delta-seconds integer or an HTTP-date
if seconds, err := strconv.Atoi(ra); err == nil {
return time.Duration(seconds) * time.Second
}
if t, err := http.ParseTime(ra); err == nil {
delay := time.Until(t)
if delay > 0 {
return delay
}
}
}
}
return defaultDelay
}One edge case: a Retry-After value of 86400 (24 hours) or larger. Accept it for critical events; for bulk events, you may want to cap it at your maximum retry window and dead-letter if the destination is requesting a longer deferral than your SLA allows.
Priority-Differentiated Retry Budgets
Not all events deserve the same retry persistence. A failed payment notification should retry for 24 hours. A nightly analytics digest should retry for 2 hours and move on.
Map retry budget to event priority:
| Priority | Max attempts | Max retry duration | Dead-letter behavior |
|---|---|---|---|
| Critical | 10 | 24 hours | Alert on-call + retain 30 days |
| High | 7 | 6 hours | Create incident in dashboard |
| Normal | 5 | 90 minutes | Log and retain 7 days |
| Bulk | 3 | 30 minutes | Log only |
Implement this by carrying the priority through to the retry scheduler:
func maxRetryDuration(priority int) time.Duration {
switch priority {
case PriorityCritical:
return 24 * time.Hour
case PriorityHigh:
return 6 * time.Hour
case PriorityNormal:
return 90 * time.Minute
default: // bulk
return 30 * time.Minute
}
}
func shouldDeadLetter(event *Event, lastAttemptAt time.Time) bool {
age := time.Since(event.CreatedAt)
if age > maxRetryDuration(event.Priority) {
return true
}
if event.AttemptsCount >= maxAttempts(event.Priority) {
return true
}
return false
}This gives you a retry policy that is proportional to business impact. You are spending more retry capacity where it matters and less where it does not.
What to Do When You Give Up
A dead-lettered event is not a deleted event. It is an event you have stopped retrying, but not stopped caring about. Your dead-letter handling should differ by priority:
Critical events: Create an alert. The on-call engineer should know a critical event was undeliverable. Include the destination URL, the last response code, the number of attempts, and a direct link to the event in the dashboard for replay.
High and normal events: Write to a dead-letter queue or table. Make replay easy — a single API call per event, a bulk replay by destination ID and time range, or automatic replay when a destination recovers.
Bulk events: Log the failure and move on. Bulk events are often idempotent at the business level — the next scheduled run will cover the gap.
GetHook exposes a POST /v1/events/{id}/replay endpoint that re-queues a dead-lettered event with a fresh attempt counter, preserving the original payload and metadata. For bulk replay after a destination recovery, query the dead-letter state by destination ID and replay the window.
# Replay all dead-lettered events for a destination from the last 6 hours
curl -X POST https://api.gethook.to/v1/events/replay-bulk \
-H "Authorization: Bearer hk_..." \
-H "Content-Type: application/json" \
-d '{
"destination_id": "dst_abc123",
"status": "dead_letter",
"from": "2026-04-21T06:00:00Z",
"to": "2026-04-21T12:00:00Z"
}'Instrumentation for Retry Policies
A retry policy you cannot observe is a policy you cannot tune. The metrics that matter:
| Metric | What it tells you |
|---|---|
delivery.retry_rate by destination | How often a destination requires retry — high values signal a reliability problem |
delivery.dead_letter_rate by priority | Critical events hitting dead-letter is a P0; bulk events is expected |
delivery.attempt_distribution (histogram) | Most events should succeed on attempt 1; attempt 5+ should be rare |
destination.circuit_open_duration | Time spent with circuit open — directly impacts delivery latency |
retry.429_rate by destination | How often you're being rate-limited; consider reducing your concurrency for that destination |
Set alerts on delivery.dead_letter_rate for critical events — any non-zero value warrants investigation. For destination.circuit_open_duration, alert when any destination has had an open circuit for more than 30 minutes.
Exponential backoff handles the easy case. A production retry policy needs to handle the full space: permanent vs. transient failures, destination health degradation, rate limit signals, and proportional persistence based on business impact. Each layer compounds the others — a destination health-aware policy with priority-differentiated budgets and proper dead-letter handling will recover more events, consume less retry capacity, and make on-call incidents far easier to diagnose than a naive "retry five times" approach.
If you want a delivery engine with configurable retry policies, health-aware circuit breaking, and one-click replay from the dashboard, start with GetHook.