Back to Blog
webhooksreliabilityretryarchitecturebackend

Webhook Retry Policy Design: Beyond Exponential Backoff

Exponential backoff is the right starting point for webhook retries, but a production retry policy needs to handle HTTP status codes differently, respect destination health state, and avoid hammering an endpoint that is provably down.

C
Camille Beaumont
Backend Architect
April 21, 2026
11 min read

Every webhook delivery system implements retries. Most implement the same thing: exponential backoff with jitter, five attempts, done. That gets you 80% of the way there. The remaining 20% — the part that actually differentiates reliable infrastructure from brittle infrastructure — requires a policy that reasons about why a delivery failed, not just that it failed.

This post covers how to design a retry policy that handles the full complexity of real webhook delivery: differentiating retryable from non-retryable failures, using destination health state to avoid futile retries, adjusting retry schedules for business-critical events, and knowing when to stop.


Start with the Right Mental Model

Retries exist to recover from transient failures. They do not exist to compensate for:

  • Misconfigured destinations (wrong URL, wrong auth — these will keep failing)
  • Malformed payloads (your bug, not their availability problem)
  • Destinations that have decided to reject your events (4xx on every attempt)

Treating all failures as transient causes retry storms, wastes capacity, and misleads your delivery metrics. The first design decision is a clean separation between retryable and non-retryable failure modes.


Classifying Failures Before Retrying

The HTTP response status is the primary signal. Here is a sensible production classification:

ResponseClassAction
200–299SuccessMark delivered, no retry
408 Request TimeoutTransientRetry with backoff
429 Too Many RequestsTransientRetry, respect Retry-After header
500 Internal Server ErrorTransientRetry with backoff
502 Bad GatewayTransientRetry with backoff
503 Service UnavailableTransientRetry, respect Retry-After header
504 Gateway TimeoutTransientRetry with backoff
400 Bad RequestPermanentDead-letter immediately
401 UnauthorizedPermanentDead-letter, alert operator
403 ForbiddenPermanentDead-letter, alert operator
404 Not FoundPermanentDead-letter, disable destination after N occurrences
410 GonePermanentDead-letter, disable destination immediately
Network timeout (no response)TransientRetry with backoff
DNS failureTransient (short)Retry, but with tighter bound
TLS handshake failureTransientRetry with exponential backoff

A few things to note here:

400 Bad Request is permanent. If the destination rejects the payload as malformed, retrying the identical payload will produce the same result. This is a bug in your payload construction, not a destination availability problem.

401 and 403 are permanent. The credentials are wrong or the destination has revoked access. Retrying does not fix credentials. Dead-letter and alert your operator immediately so they can investigate.

410 Gone is an explicit signal that the destination has been decommissioned. Disable it, do not retry.

429 Too Many Requests is transient, but respect the Retry-After header rather than applying your standard backoff schedule. If the destination says "wait 60 seconds," wait 60 seconds — not 30, not 120.


The Baseline: Exponential Backoff with Jitter

For transient failures, exponential backoff with jitter is still the right foundation:

go
func nextAttemptAt(attemptNumber int, baseDelay time.Duration, maxDelay time.Duration) time.Time {
    // exponential: 0s, 30s, 2m, 10m, 1h
    delay := baseDelay * time.Duration(math.Pow(2, float64(attemptNumber)))
    if delay > maxDelay {
        delay = maxDelay
    }

    // Add ±25% jitter to spread load from simultaneous failures
    jitter := time.Duration(rand.Int63n(int64(delay / 2))) - delay/4
    delay += jitter

    return time.Now().Add(delay)
}

Without jitter, a burst of simultaneous failures (a downstream service restart, a brief network partition) all schedule their first retry at the same moment, causing a synchronized retry storm. Jitter staggers these across a window, converting a spike into a ramp.

A standard production schedule with a 30-second base and 1-hour cap:

AttemptDelay (no jitter)Delivery window
1ImmediateT+0s
230 secondsT+30s
32 minutesT+2m30s
410 minutesT+12m30s
51 hourT+1h12m30s

Five attempts gives you roughly 90 minutes of retry coverage. That covers the large majority of transient outages — deployments, brief database restarts, spot instance preemptions.


Layering in Destination Health State

The schedule above assumes each event is retried independently. In practice, if destination dst_abc123 returned a 503 on 100 consecutive deliveries over the last 10 minutes, retrying event #101 immediately is wasteful. The destination is clearly unavailable.

A destination health state machine lets you apply a different retry policy to destinations that are demonstrably down:

go
type DestinationHealth int

const (
    HealthHealthy    DestinationHealth = iota
    HealthDegraded                     // some recent failures
    HealthUnhealthy                    // majority of recent attempts failed
    HealthCircuitOpen                  // all retries suspended
)

func classifyHealth(recentAttempts []Attempt) DestinationHealth {
    if len(recentAttempts) < 5 {
        return HealthHealthy
    }

    failures := 0
    for _, a := range recentAttempts {
        if a.Outcome != OutcomeSuccess {
            failures++
        }
    }

    rate := float64(failures) / float64(len(recentAttempts))
    switch {
    case rate >= 0.95:
        return HealthCircuitOpen
    case rate >= 0.6:
        return HealthUnhealthy
    case rate >= 0.2:
        return HealthDegraded
    default:
        return HealthHealthy
    }
}

With health state classified, modify the retry schedule per destination:

go
func retryDelayForHealth(attemptNumber int, health DestinationHealth) time.Duration {
    base := nextDelay(attemptNumber) // standard exponential backoff

    switch health {
    case HealthDegraded:
        // Add 20% extra delay — destination is struggling
        return base + base/5
    case HealthUnhealthy:
        // Double the delay — avoid piling on
        return base * 2
    case HealthCircuitOpen:
        // Suspend — re-check after a fixed window
        return 15 * time.Minute
    default:
        return base
    }
}

When the circuit is open, you are not cancelling events — you are deferring them. Every queued event gets scheduled to re-attempt after the open window. If the destination recovers during that window (you can detect this with periodic probe requests), the circuit closes and normal delivery resumes. If it does not recover, events continue accumulating in the retry queue until your maximum retention period expires.


Respecting Retry-After

The Retry-After header is a contract. When a destination responds with 429 and includes Retry-After: 120, it is telling you not to retry before that window elapses. Ignoring it will get your delivery IP rate-limited more aggressively.

go
func nextAttemptFromResponse(resp *http.Response, defaultDelay time.Duration) time.Duration {
    if resp.StatusCode == http.StatusTooManyRequests {
        if ra := resp.Header.Get("Retry-After"); ra != "" {
            // Retry-After can be a delta-seconds integer or an HTTP-date
            if seconds, err := strconv.Atoi(ra); err == nil {
                return time.Duration(seconds) * time.Second
            }
            if t, err := http.ParseTime(ra); err == nil {
                delay := time.Until(t)
                if delay > 0 {
                    return delay
                }
            }
        }
    }
    return defaultDelay
}

One edge case: a Retry-After value of 86400 (24 hours) or larger. Accept it for critical events; for bulk events, you may want to cap it at your maximum retry window and dead-letter if the destination is requesting a longer deferral than your SLA allows.


Priority-Differentiated Retry Budgets

Not all events deserve the same retry persistence. A failed payment notification should retry for 24 hours. A nightly analytics digest should retry for 2 hours and move on.

Map retry budget to event priority:

PriorityMax attemptsMax retry durationDead-letter behavior
Critical1024 hoursAlert on-call + retain 30 days
High76 hoursCreate incident in dashboard
Normal590 minutesLog and retain 7 days
Bulk330 minutesLog only

Implement this by carrying the priority through to the retry scheduler:

go
func maxRetryDuration(priority int) time.Duration {
    switch priority {
    case PriorityCritical:
        return 24 * time.Hour
    case PriorityHigh:
        return 6 * time.Hour
    case PriorityNormal:
        return 90 * time.Minute
    default: // bulk
        return 30 * time.Minute
    }
}

func shouldDeadLetter(event *Event, lastAttemptAt time.Time) bool {
    age := time.Since(event.CreatedAt)
    if age > maxRetryDuration(event.Priority) {
        return true
    }
    if event.AttemptsCount >= maxAttempts(event.Priority) {
        return true
    }
    return false
}

This gives you a retry policy that is proportional to business impact. You are spending more retry capacity where it matters and less where it does not.


What to Do When You Give Up

A dead-lettered event is not a deleted event. It is an event you have stopped retrying, but not stopped caring about. Your dead-letter handling should differ by priority:

Critical events: Create an alert. The on-call engineer should know a critical event was undeliverable. Include the destination URL, the last response code, the number of attempts, and a direct link to the event in the dashboard for replay.

High and normal events: Write to a dead-letter queue or table. Make replay easy — a single API call per event, a bulk replay by destination ID and time range, or automatic replay when a destination recovers.

Bulk events: Log the failure and move on. Bulk events are often idempotent at the business level — the next scheduled run will cover the gap.

GetHook exposes a POST /v1/events/{id}/replay endpoint that re-queues a dead-lettered event with a fresh attempt counter, preserving the original payload and metadata. For bulk replay after a destination recovery, query the dead-letter state by destination ID and replay the window.

bash
# Replay all dead-lettered events for a destination from the last 6 hours
curl -X POST https://api.gethook.to/v1/events/replay-bulk \
  -H "Authorization: Bearer hk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "destination_id": "dst_abc123",
    "status": "dead_letter",
    "from": "2026-04-21T06:00:00Z",
    "to": "2026-04-21T12:00:00Z"
  }'

Instrumentation for Retry Policies

A retry policy you cannot observe is a policy you cannot tune. The metrics that matter:

MetricWhat it tells you
delivery.retry_rate by destinationHow often a destination requires retry — high values signal a reliability problem
delivery.dead_letter_rate by priorityCritical events hitting dead-letter is a P0; bulk events is expected
delivery.attempt_distribution (histogram)Most events should succeed on attempt 1; attempt 5+ should be rare
destination.circuit_open_durationTime spent with circuit open — directly impacts delivery latency
retry.429_rate by destinationHow often you're being rate-limited; consider reducing your concurrency for that destination

Set alerts on delivery.dead_letter_rate for critical events — any non-zero value warrants investigation. For destination.circuit_open_duration, alert when any destination has had an open circuit for more than 30 minutes.


Exponential backoff handles the easy case. A production retry policy needs to handle the full space: permanent vs. transient failures, destination health degradation, rate limit signals, and proportional persistence based on business impact. Each layer compounds the others — a destination health-aware policy with priority-differentiated budgets and proper dead-letter handling will recover more events, consume less retry capacity, and make on-call incidents far easier to diagnose than a naive "retry five times" approach.

If you want a delivery engine with configurable retry policies, health-aware circuit breaking, and one-click replay from the dashboard, start with GetHook.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.