Every webhook delivery system implements retries. Most implement the same thing: exponential backoff with jitter, five attempts, done. That gets you 80% of the way there. The remaining 20% — the part that actually differentiates reliable infrastructure from brittle infrastructure — requires a policy that reasons about why a delivery failed, not just that it failed.

This post covers how to design a retry policy that handles the full complexity of real webhook delivery: differentiating retryable from non-retryable failures, using destination health state to avoid futile retries, adjusting retry schedules for business-critical events, and knowing when to stop.

Start with the Right Mental Model

Retries exist to recover from transient failures. They do not exist to compensate for:

›Misconfigured destinations (wrong URL, wrong auth — these will keep failing)
›Malformed payloads (your bug, not their availability problem)
›Destinations that have decided to reject your events (4xx on every attempt)

Treating all failures as transient causes retry storms, wastes capacity, and misleads your delivery metrics. The first design decision is a clean separation between retryable and non-retryable failure modes.

Classifying Failures Before Retrying

The HTTP response status is the primary signal. Here is a sensible production classification:

Response	Class	Action
`200–299`	Success	Mark delivered, no retry
`408 Request Timeout`	Transient	Retry with backoff
`429 Too Many Requests`	Transient	Retry, respect `Retry-After` header
`500 Internal Server Error`	Transient	Retry with backoff
`502 Bad Gateway`	Transient	Retry with backoff
`503 Service Unavailable`	Transient	Retry, respect `Retry-After` header
`504 Gateway Timeout`	Transient	Retry with backoff
`400 Bad Request`	Permanent	Dead-letter immediately
`401 Unauthorized`	Permanent	Dead-letter, alert operator
`403 Forbidden`	Permanent	Dead-letter, alert operator
`404 Not Found`	Permanent	Dead-letter, disable destination after N occurrences
`410 Gone`	Permanent	Dead-letter, disable destination immediately
Network timeout (no response)	Transient	Retry with backoff
DNS failure	Transient (short)	Retry, but with tighter bound
TLS handshake failure	Transient	Retry with exponential backoff

A few things to note here:

400 Bad Request is permanent. If the destination rejects the payload as malformed, retrying the identical payload will produce the same result. This is a bug in your payload construction, not a destination availability problem.

401 and 403 are permanent. The credentials are wrong or the destination has revoked access. Retrying does not fix credentials. Dead-letter and alert your operator immediately so they can investigate.

410 Gone is an explicit signal that the destination has been decommissioned. Disable it, do not retry.

429 Too Many Requests is transient, but respect the Retry-After header rather than applying your standard backoff schedule. If the destination says "wait 60 seconds," wait 60 seconds — not 30, not 120.

The Baseline: Exponential Backoff with Jitter

For transient failures, exponential backoff with jitter is still the right foundation:

func nextAttemptAt(attemptNumber int, baseDelay time.Duration, maxDelay time.Duration) time.Time {
    // exponential: 0s, 30s, 2m, 10m, 1h
    delay := baseDelay * time.Duration(math.Pow(2, float64(attemptNumber)))
    if delay > maxDelay {
        delay = maxDelay
    }

    // Add ±25% jitter to spread load from simultaneous failures
    jitter := time.Duration(rand.Int63n(int64(delay / 2))) - delay/4
    delay += jitter

    return time.Now().Add(delay)
}

Without jitter, a burst of simultaneous failures (a downstream service restart, a brief network partition) all schedule their first retry at the same moment, causing a synchronized retry storm. Jitter staggers these across a window, converting a spike into a ramp.

A standard production schedule with a 30-second base and 1-hour cap:

Attempt	Delay (no jitter)	Delivery window
1	Immediate	T+0s
2	30 seconds	T+30s
3	2 minutes	T+2m30s
4	10 minutes	T+12m30s
5	1 hour	T+1h12m30s

Five attempts gives you roughly 90 minutes of retry coverage. That covers the large majority of transient outages — deployments, brief database restarts, spot instance preemptions.

Layering in Destination Health State

The schedule above assumes each event is retried independently. In practice, if destination dst_abc123 returned a 503 on 100 consecutive deliveries over the last 10 minutes, retrying event #101 immediately is wasteful. The destination is clearly unavailable.

A destination health state machine lets you apply a different retry policy to destinations that are demonstrably down:

type DestinationHealth int

const (
    HealthHealthy    DestinationHealth = iota
    HealthDegraded                     // some recent failures
    HealthUnhealthy                    // majority of recent attempts failed
    HealthCircuitOpen                  // all retries suspended
)

func classifyHealth(recentAttempts []Attempt) DestinationHealth {
    if len(recentAttempts) < 5 {
        return HealthHealthy
    }

    failures := 0
    for _, a := range recentAttempts {
        if a.Outcome != OutcomeSuccess {
            failures++
        }
    }

    rate := float64(failures) / float64(len(recentAttempts))
    switch {
    case rate >= 0.95:
        return HealthCircuitOpen
    case rate >= 0.6:
        return HealthUnhealthy
    case rate >= 0.2:
        return HealthDegraded
    default:
        return HealthHealthy
    }
}

With health state classified, modify the retry schedule per destination:

func retryDelayForHealth(attemptNumber int, health DestinationHealth) time.Duration {
    base := nextDelay(attemptNumber) // standard exponential backoff

    switch health {
    case HealthDegraded:
        // Add 20% extra delay — destination is struggling
        return base + base/5
    case HealthUnhealthy:
        // Double the delay — avoid piling on
        return base * 2
    case HealthCircuitOpen:
        // Suspend — re-check after a fixed window
        return 15 * time.Minute
    default:
        return base
    }
}

When the circuit is open, you are not cancelling events — you are deferring them. Every queued event gets scheduled to re-attempt after the open window. If the destination recovers during that window (you can detect this with periodic probe requests), the circuit closes and normal delivery resumes. If it does not recover, events continue accumulating in the retry queue until your maximum retention period expires.

Respecting `Retry-After`

The Retry-After header is a contract. When a destination responds with 429 and includes Retry-After: 120, it is telling you not to retry before that window elapses. Ignoring it will get your delivery IP rate-limited more aggressively.

func nextAttemptFromResponse(resp *http.Response, defaultDelay time.Duration) time.Duration {
    if resp.StatusCode == http.StatusTooManyRequests {
        if ra := resp.Header.Get("Retry-After"); ra != "" {
            // Retry-After can be a delta-seconds integer or an HTTP-date
            if seconds, err := strconv.Atoi(ra); err == nil {
                return time.Duration(seconds) * time.Second
            }
            if t, err := http.ParseTime(ra); err == nil {
                delay := time.Until(t)
                if delay > 0 {
                    return delay
                }
            }
        }
    }
    return defaultDelay
}

One edge case: a Retry-After value of 86400 (24 hours) or larger. Accept it for critical events; for bulk events, you may want to cap it at your maximum retry window and dead-letter if the destination is requesting a longer deferral than your SLA allows.

Priority-Differentiated Retry Budgets

Not all events deserve the same retry persistence. A failed payment notification should retry for 24 hours. A nightly analytics digest should retry for 2 hours and move on.

Map retry budget to event priority:

Priority	Max attempts	Max retry duration	Dead-letter behavior
Critical	10	24 hours	Alert on-call + retain 30 days
High	7	6 hours	Create incident in dashboard
Normal	5	90 minutes	Log and retain 7 days
Bulk	3	30 minutes	Log only

Implement this by carrying the priority through to the retry scheduler:

func maxRetryDuration(priority int) time.Duration {
    switch priority {
    case PriorityCritical:
        return 24 * time.Hour
    case PriorityHigh:
        return 6 * time.Hour
    case PriorityNormal:
        return 90 * time.Minute
    default: // bulk
        return 30 * time.Minute
    }
}

func shouldDeadLetter(event *Event, lastAttemptAt time.Time) bool {
    age := time.Since(event.CreatedAt)
    if age > maxRetryDuration(event.Priority) {
        return true
    }
    if event.AttemptsCount >= maxAttempts(event.Priority) {
        return true
    }
    return false
}

This gives you a retry policy that is proportional to business impact. You are spending more retry capacity where it matters and less where it does not.

What to Do When You Give Up

A dead-lettered event is not a deleted event. It is an event you have stopped retrying, but not stopped caring about. Your dead-letter handling should differ by priority:

Critical events: Create an alert. The on-call engineer should know a critical event was undeliverable. Include the destination URL, the last response code, the number of attempts, and a direct link to the event in the dashboard for replay.

High and normal events: Write to a dead-letter queue or table. Make replay easy — a single API call per event, a bulk replay by destination ID and time range, or automatic replay when a destination recovers.

Bulk events: Log the failure and move on. Bulk events are often idempotent at the business level — the next scheduled run will cover the gap.

GetHook exposes a POST /v1/events/{id}/replay endpoint that re-queues a dead-lettered event with a fresh attempt counter, preserving the original payload and metadata. For bulk replay after a destination recovery, query the dead-letter state by destination ID and replay the window.

bash

# Replay all dead-lettered events for a destination from the last 6 hours
curl -X POST https://api.gethook.to/v1/events/replay-bulk \
  -H "Authorization: Bearer hk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "destination_id": "dst_abc123",
    "status": "dead_letter",
    "from": "2026-04-21T06:00:00Z",
    "to": "2026-04-21T12:00:00Z"
  }'

Instrumentation for Retry Policies

A retry policy you cannot observe is a policy you cannot tune. The metrics that matter:

Metric	What it tells you
`delivery.retry_rate` by destination	How often a destination requires retry — high values signal a reliability problem
`delivery.dead_letter_rate` by priority	Critical events hitting dead-letter is a P0; bulk events is expected
`delivery.attempt_distribution` (histogram)	Most events should succeed on attempt 1; attempt 5+ should be rare
`destination.circuit_open_duration`	Time spent with circuit open — directly impacts delivery latency
`retry.429_rate` by destination	How often you're being rate-limited; consider reducing your concurrency for that destination

Set alerts on delivery.dead_letter_rate for critical events — any non-zero value warrants investigation. For destination.circuit_open_duration, alert when any destination has had an open circuit for more than 30 minutes.

Exponential backoff handles the easy case. A production retry policy needs to handle the full space: permanent vs. transient failures, destination health degradation, rate limit signals, and proportional persistence based on business impact. Each layer compounds the others — a destination health-aware policy with priority-differentiated budgets and proper dead-letter handling will recover more events, consume less retry capacity, and make on-call incidents far easier to diagnose than a naive "retry five times" approach.

If you want a delivery engine with configurable retry policies, health-aware circuit breaking, and one-click replay from the dashboard, start with GetHook.

Webhook Retry Policy Design: Beyond Exponential Backoff

Start with the Right Mental Model

Classifying Failures Before Retrying

The Baseline: Exponential Backoff with Jitter

Layering in Destination Health State

Respecting `Retry-After`

Priority-Differentiated Retry Budgets

What to Do When You Give Up

Instrumentation for Retry Policies

Related articles

Webhook Payload Transformation: Normalizing, Enriching, and Redacting Events at the Gateway

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Designing a Great Webhook SDK: Verification, Typing, and Developer Ergonomics

Stop losing webhook events.

Webhook Retry Policy Design: Beyond Exponential Backoff

Start with the Right Mental Model

Classifying Failures Before Retrying

The Baseline: Exponential Backoff with Jitter

Layering in Destination Health State

Respecting Retry-After

Priority-Differentiated Retry Budgets

What to Do When You Give Up

Instrumentation for Retry Policies

Related articles

Webhook Payload Transformation: Normalizing, Enriching, and Redacting Events at the Gateway

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Designing a Great Webhook SDK: Verification, Typing, and Developer Ergonomics

Stop losing webhook events.

Respecting `Retry-After`