Retry logic is the first tool most teams reach for when webhook delivery fails. It works well when failures are transient — a destination that was briefly overloaded recovers, the retry succeeds, and everything continues. But retry logic has a blind spot: it doesn't distinguish between a destination that's temporarily slow and one that's been misconfigured, decommissioned, or silently broken for three days.

When you retry into a broken destination indefinitely, you accumulate a backlog of queued jobs, burn delivery worker capacity on requests that will never succeed, and delay delivery to all your other destinations as the queue grows. The fix is destination health scoring: a lightweight signal that tells your delivery system when to back off, notify the customer, or pause delivery entirely.

What "Unhealthy" Actually Means

Not every failed HTTP response means the destination is broken. A 429 means your delivery rate exceeds the destination's capacity — the fix is throttling, not disabling. A 503 that clears in two minutes is noise. A consistent stream of 500s over 30 minutes is a real signal.

The useful failure categories are:

Outcome	Likely Cause	Response
HTTP 4xx (not 429)	Misconfigured destination, expired auth, wrong URL	Likely broken — alert immediately
HTTP 5xx	Destination is erroring internally	Degrade score, retry with backoff
Connection timeout	Host unreachable or TLS failure	Degrade score aggressively
DNS resolution failure	Destination URL is invalid or domain no longer exists	Pause immediately
HTTP 429	Rate limited by destination	Throttle, do not degrade health score
HTTP 200/201	Success	Recover health score

The key insight is that 4xx responses (other than 429) should degrade the health score faster than 5xx responses. A 404 or 401 is unlikely to resolve on its own. A 503 might.

The Health Score Model

A simple and durable model is a sliding-window success rate computed over the last N delivery attempts per destination. You don't need a time-series database for this — your delivery attempts table already has the data.

sql

-- Health score: success rate over the last 50 attempts
SELECT
    destination_id,
    COUNT(*) AS total,
    SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes,
    ROUND(
        100.0 * SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) / COUNT(*),
        1
    ) AS success_rate_pct
FROM (
    SELECT destination_id, outcome
    FROM delivery_attempts
    WHERE destination_id = $1
    ORDER BY created_at DESC
    LIMIT 50
) recent
GROUP BY destination_id;

This gives you a percentage. You can define thresholds:

Success Rate	Status	Action
95–100%	Healthy	No action
75–94%	Degraded	Alert customer, increase retry interval
50–74%	Critical	Alert customer, consider pause
< 50%	Unhealthy	Auto-pause delivery

The window size (50 attempts) is a tradeoff. A smaller window reacts faster but produces more false positives for bursty failures. A larger window is more accurate but slower to catch a newly broken destination. 50 is a reasonable starting point; tune it based on your typical delivery volume per destination.

Storing and Updating the Score

You don't want to run the scoring query on every delivery attempt — at high volume, that's an expensive aggregation per delivery. Instead, maintain a materialized health state on the destination row and update it asynchronously.

Add a few columns to your destinations table:

sql

ALTER TABLE destinations
    ADD COLUMN health_status TEXT NOT NULL DEFAULT 'healthy',
    -- healthy | degraded | critical | unhealthy | paused
    ADD COLUMN health_score  INT  NOT NULL DEFAULT 100,
    -- 0–100, higher is better
    ADD COLUMN health_checked_at TIMESTAMPTZ,
    ADD COLUMN consecutive_failures INT NOT NULL DEFAULT 0,
    ADD COLUMN paused_at TIMESTAMPTZ,
    ADD COLUMN pause_reason TEXT;

After each delivery attempt, your worker updates the destination's consecutive failure count. A separate health evaluation job — running every minute via your Postgres-backed job queue — recomputes the sliding-window score and transitions health status.

// Called by the health evaluation job for each active destination
func (s *DestinationStore) EvaluateHealth(ctx context.Context, destID uuid.UUID) error {
    const window = 50

    var total, successes int
    err := s.db.QueryRowContext(ctx, `
        SELECT
            COUNT(*) AS total,
            SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes
        FROM (
            SELECT outcome
            FROM delivery_attempts
            WHERE destination_id = $1
            ORDER BY created_at DESC
            LIMIT $2
        ) recent
    `, destID, window).Scan(&total, &successes)
    if err != nil {
        return fmt.Errorf("query health: %w", err)
    }

    if total == 0 {
        return nil // no data, leave as healthy
    }

    score := int(float64(successes) / float64(total) * 100)
    status := healthStatus(score)

    _, err = s.db.ExecContext(ctx, `
        UPDATE destinations
        SET
            health_score      = $2,
            health_status     = $3,
            health_checked_at = NOW(),
            paused_at         = CASE WHEN $3 = 'unhealthy' AND paused_at IS NULL THEN NOW() ELSE paused_at END,
            pause_reason      = CASE WHEN $3 = 'unhealthy' AND paused_at IS NULL THEN 'auto: health score below threshold' ELSE pause_reason END
        WHERE id = $1
    `, destID, score, status)
    return err
}

func healthStatus(score int) string {
    switch {
    case score >= 95:
        return "healthy"
    case score >= 75:
        return "degraded"
    case score >= 50:
        return "critical"
    default:
        return "unhealthy"
    }
}

When the status transitions to unhealthy, the row is automatically paused. Your delivery worker's dispatch query skips destinations where paused_at IS NOT NULL:

sql

SELECT d.*
FROM destinations d
JOIN routes r ON r.destination_id = d.id
WHERE r.source_id = $1
  AND d.paused_at IS NULL;

Early Warning: Consecutive Failure Tracking

The sliding window catches persistent degradation but is slow to react to sudden breaks. If a destination has been healthy for months and then starts returning 404 on every attempt, you'll see 47 successes and 3 failures in your 50-attempt window — a 94% score that doesn't trigger any alert.

Add a consecutive failure counter that runs in parallel with the window-based score:

func (w *DeliveryWorker) recordAttempt(ctx context.Context, destID uuid.UUID, outcome string) error {
    if outcome == "success" {
        // Reset consecutive failures on any success
        _, err := w.db.ExecContext(ctx,
            `UPDATE destinations SET consecutive_failures = 0 WHERE id = $1`,
            destID,
        )
        return err
    }

    // Increment and check threshold
    var consecutiveFailures int
    err := w.db.QueryRowContext(ctx, `
        UPDATE destinations
        SET consecutive_failures = consecutive_failures + 1
        WHERE id = $1
        RETURNING consecutive_failures
    `, destID).Scan(&consecutiveFailures)
    if err != nil {
        return err
    }

    // Hard pause after 10 consecutive failures, regardless of window score
    if consecutiveFailures >= 10 {
        _, err = w.db.ExecContext(ctx, `
            UPDATE destinations
            SET
                paused_at    = NOW(),
                pause_reason = 'auto: 10 consecutive failures',
                health_status = 'unhealthy'
            WHERE id = $1 AND paused_at IS NULL
        `, destID)
    }
    return err
}

This catches the cliff scenario: a destination that was working perfectly and then breaks completely will hit 10 consecutive failures and pause well before the sliding window degrades below the threshold.

Recovery: Automatic or Manual?

Once a destination is paused, you have two options for resuming delivery.

Manual resume is simpler and safer. The customer (or your support team) investigates the failure, fixes the destination, and clicks "Resume delivery" in the dashboard. This prevents a fixed-but-fragile destination from being hammered by a backlog immediately after resuming.

Automatic recovery with a probe works well for destinations that self-heal. Every 5 minutes, your worker sends a synthetic probe request to paused destinations. If the probe succeeds, the destination transitions back to degraded status and delivery resumes with a reduced concurrency cap until the sliding window score recovers to healthy.

The probe request should be a lightweight, idempotent payload — a ping event type that the destination can ignore:

json

{
  "id": "probe_01HX9P3...",
  "type": "webhook.probe",
  "created_at": "2026-04-04T09:00:00Z",
  "data": {}
}

Don't send probe requests to destinations that were paused due to a 4xx error. A 404 or 401 won't self-heal — probing them wastes requests and confuses the customer. Only auto-probe destinations paused due to 5xx or network errors.

What to Surface to Customers

Health scoring is only valuable if customers can see and act on it. The minimum viable surface area:

›A health badge on each destination in the dashboard (Healthy, Degraded, Paused)
›A timestamp and reason on paused destinations ("Paused automatically: 10 consecutive failures since 2026-04-04 09:14 UTC")
›An email alert when a destination transitions from degraded to unhealthy
›A "Resume delivery" button that clears paused_at and resets consecutive_failures

Don't send an alert for every single delivery failure — that's noise. Alert on state transitions: the first time a destination enters degraded, and when it enters unhealthy. One alert per transition is the right signal density.

GetHook exposes destination health status as a field on every GET /v1/destinations/{id} response, so customers can also poll or webhook-off-webhooks (using a monitoring integration) to detect paused destinations programmatically.

Backfill vs. Discard for Paused Destinations

When you resume a paused destination, you have a choice: deliver the events that accumulated while the destination was paused, or discard them and start fresh.

The right answer depends on the event type:

Event Type	On Resume
Idempotent, time-insensitive (e.g., record updates)	Backfill — the destination wants the latest state
Time-sensitive (e.g., fraud alerts, real-time triggers)	Discard or let customer choose — stale events may cause incorrect behavior
Ordered sequences (e.g., state machine transitions)	Backfill in strict order — out-of-order delivery is worse than no delivery

The safest default is to hold events for a configurable retention window (24–72 hours) and let the customer decide on resume. Expose a replay_on_resume flag in the destination configuration.

Putting It Together

A complete health scoring system is less than 200 lines of Go and SQL. The value is disproportionate to the implementation cost: customers stop filing support tickets asking why their integration stopped working three days ago, your delivery worker stops burning cycles on permanently broken endpoints, and your queue stays shallow.

The pieces:

›A consecutive_failures counter updated inline in the delivery worker
›A health_evaluation job running every minute over your Postgres job queue
›A pause gate in the dispatch query (paused_at IS NULL)
›A probe job for 5xx-paused destinations
›A state-transition alert (email or webhook) to the customer

Set up destination health monitoring with GetHook →

Webhook Destination Health Scoring: Detecting and Disabling Unhealthy Endpoints

What "Unhealthy" Actually Means

The Health Score Model

Storing and Updating the Score

Early Warning: Consecutive Failure Tracking

Recovery: Automatic or Manual?

What to Surface to Customers

Backfill vs. Discard for Paused Destinations

Putting It Together

Related articles

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Designing a Great Webhook SDK: Verification, Typing, and Developer Ergonomics

Synthetic End-to-End Testing for Webhook Delivery Pipelines

Stop losing webhook events.