Back to Blog
webhooksreliabilityinfrastructuredeliveryobservability

Webhook Destination Health Scoring: Detecting and Disabling Unhealthy Endpoints

Blindly retrying every failed webhook delivery wastes resources and delays real alerts. A health scoring system lets you identify degraded destinations early, pause delivery automatically, and surface the right signal to your customers.

T
Tomasz Brzezinski
Staff Infrastructure Engineer
April 4, 2026
10 min read

Retry logic is the first tool most teams reach for when webhook delivery fails. It works well when failures are transient — a destination that was briefly overloaded recovers, the retry succeeds, and everything continues. But retry logic has a blind spot: it doesn't distinguish between a destination that's temporarily slow and one that's been misconfigured, decommissioned, or silently broken for three days.

When you retry into a broken destination indefinitely, you accumulate a backlog of queued jobs, burn delivery worker capacity on requests that will never succeed, and delay delivery to all your other destinations as the queue grows. The fix is destination health scoring: a lightweight signal that tells your delivery system when to back off, notify the customer, or pause delivery entirely.


What "Unhealthy" Actually Means

Not every failed HTTP response means the destination is broken. A 429 means your delivery rate exceeds the destination's capacity — the fix is throttling, not disabling. A 503 that clears in two minutes is noise. A consistent stream of 500s over 30 minutes is a real signal.

The useful failure categories are:

OutcomeLikely CauseResponse
HTTP 4xx (not 429)Misconfigured destination, expired auth, wrong URLLikely broken — alert immediately
HTTP 5xxDestination is erroring internallyDegrade score, retry with backoff
Connection timeoutHost unreachable or TLS failureDegrade score aggressively
DNS resolution failureDestination URL is invalid or domain no longer existsPause immediately
HTTP 429Rate limited by destinationThrottle, do not degrade health score
HTTP 200/201SuccessRecover health score

The key insight is that 4xx responses (other than 429) should degrade the health score faster than 5xx responses. A 404 or 401 is unlikely to resolve on its own. A 503 might.


The Health Score Model

A simple and durable model is a sliding-window success rate computed over the last N delivery attempts per destination. You don't need a time-series database for this — your delivery attempts table already has the data.

sql
-- Health score: success rate over the last 50 attempts
SELECT
    destination_id,
    COUNT(*) AS total,
    SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes,
    ROUND(
        100.0 * SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) / COUNT(*),
        1
    ) AS success_rate_pct
FROM (
    SELECT destination_id, outcome
    FROM delivery_attempts
    WHERE destination_id = $1
    ORDER BY created_at DESC
    LIMIT 50
) recent
GROUP BY destination_id;

This gives you a percentage. You can define thresholds:

Success RateStatusAction
95–100%HealthyNo action
75–94%DegradedAlert customer, increase retry interval
50–74%CriticalAlert customer, consider pause
< 50%UnhealthyAuto-pause delivery

The window size (50 attempts) is a tradeoff. A smaller window reacts faster but produces more false positives for bursty failures. A larger window is more accurate but slower to catch a newly broken destination. 50 is a reasonable starting point; tune it based on your typical delivery volume per destination.


Storing and Updating the Score

You don't want to run the scoring query on every delivery attempt — at high volume, that's an expensive aggregation per delivery. Instead, maintain a materialized health state on the destination row and update it asynchronously.

Add a few columns to your destinations table:

sql
ALTER TABLE destinations
    ADD COLUMN health_status TEXT NOT NULL DEFAULT 'healthy',
    -- healthy | degraded | critical | unhealthy | paused
    ADD COLUMN health_score  INT  NOT NULL DEFAULT 100,
    -- 0–100, higher is better
    ADD COLUMN health_checked_at TIMESTAMPTZ,
    ADD COLUMN consecutive_failures INT NOT NULL DEFAULT 0,
    ADD COLUMN paused_at TIMESTAMPTZ,
    ADD COLUMN pause_reason TEXT;

After each delivery attempt, your worker updates the destination's consecutive failure count. A separate health evaluation job — running every minute via your Postgres-backed job queue — recomputes the sliding-window score and transitions health status.

go
// Called by the health evaluation job for each active destination
func (s *DestinationStore) EvaluateHealth(ctx context.Context, destID uuid.UUID) error {
    const window = 50

    var total, successes int
    err := s.db.QueryRowContext(ctx, `
        SELECT
            COUNT(*) AS total,
            SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes
        FROM (
            SELECT outcome
            FROM delivery_attempts
            WHERE destination_id = $1
            ORDER BY created_at DESC
            LIMIT $2
        ) recent
    `, destID, window).Scan(&total, &successes)
    if err != nil {
        return fmt.Errorf("query health: %w", err)
    }

    if total == 0 {
        return nil // no data, leave as healthy
    }

    score := int(float64(successes) / float64(total) * 100)
    status := healthStatus(score)

    _, err = s.db.ExecContext(ctx, `
        UPDATE destinations
        SET
            health_score      = $2,
            health_status     = $3,
            health_checked_at = NOW(),
            paused_at         = CASE WHEN $3 = 'unhealthy' AND paused_at IS NULL THEN NOW() ELSE paused_at END,
            pause_reason      = CASE WHEN $3 = 'unhealthy' AND paused_at IS NULL THEN 'auto: health score below threshold' ELSE pause_reason END
        WHERE id = $1
    `, destID, score, status)
    return err
}

func healthStatus(score int) string {
    switch {
    case score >= 95:
        return "healthy"
    case score >= 75:
        return "degraded"
    case score >= 50:
        return "critical"
    default:
        return "unhealthy"
    }
}

When the status transitions to unhealthy, the row is automatically paused. Your delivery worker's dispatch query skips destinations where paused_at IS NOT NULL:

sql
SELECT d.*
FROM destinations d
JOIN routes r ON r.destination_id = d.id
WHERE r.source_id = $1
  AND d.paused_at IS NULL;

Early Warning: Consecutive Failure Tracking

The sliding window catches persistent degradation but is slow to react to sudden breaks. If a destination has been healthy for months and then starts returning 404 on every attempt, you'll see 47 successes and 3 failures in your 50-attempt window — a 94% score that doesn't trigger any alert.

Add a consecutive failure counter that runs in parallel with the window-based score:

go
func (w *DeliveryWorker) recordAttempt(ctx context.Context, destID uuid.UUID, outcome string) error {
    if outcome == "success" {
        // Reset consecutive failures on any success
        _, err := w.db.ExecContext(ctx,
            `UPDATE destinations SET consecutive_failures = 0 WHERE id = $1`,
            destID,
        )
        return err
    }

    // Increment and check threshold
    var consecutiveFailures int
    err := w.db.QueryRowContext(ctx, `
        UPDATE destinations
        SET consecutive_failures = consecutive_failures + 1
        WHERE id = $1
        RETURNING consecutive_failures
    `, destID).Scan(&consecutiveFailures)
    if err != nil {
        return err
    }

    // Hard pause after 10 consecutive failures, regardless of window score
    if consecutiveFailures >= 10 {
        _, err = w.db.ExecContext(ctx, `
            UPDATE destinations
            SET
                paused_at    = NOW(),
                pause_reason = 'auto: 10 consecutive failures',
                health_status = 'unhealthy'
            WHERE id = $1 AND paused_at IS NULL
        `, destID)
    }
    return err
}

This catches the cliff scenario: a destination that was working perfectly and then breaks completely will hit 10 consecutive failures and pause well before the sliding window degrades below the threshold.


Recovery: Automatic or Manual?

Once a destination is paused, you have two options for resuming delivery.

Manual resume is simpler and safer. The customer (or your support team) investigates the failure, fixes the destination, and clicks "Resume delivery" in the dashboard. This prevents a fixed-but-fragile destination from being hammered by a backlog immediately after resuming.

Automatic recovery with a probe works well for destinations that self-heal. Every 5 minutes, your worker sends a synthetic probe request to paused destinations. If the probe succeeds, the destination transitions back to degraded status and delivery resumes with a reduced concurrency cap until the sliding window score recovers to healthy.

The probe request should be a lightweight, idempotent payload — a ping event type that the destination can ignore:

json
{
  "id": "probe_01HX9P3...",
  "type": "webhook.probe",
  "created_at": "2026-04-04T09:00:00Z",
  "data": {}
}

Don't send probe requests to destinations that were paused due to a 4xx error. A 404 or 401 won't self-heal — probing them wastes requests and confuses the customer. Only auto-probe destinations paused due to 5xx or network errors.


What to Surface to Customers

Health scoring is only valuable if customers can see and act on it. The minimum viable surface area:

  • A health badge on each destination in the dashboard (Healthy, Degraded, Paused)
  • A timestamp and reason on paused destinations ("Paused automatically: 10 consecutive failures since 2026-04-04 09:14 UTC")
  • An email alert when a destination transitions from degraded to unhealthy
  • A "Resume delivery" button that clears paused_at and resets consecutive_failures

Don't send an alert for every single delivery failure — that's noise. Alert on state transitions: the first time a destination enters degraded, and when it enters unhealthy. One alert per transition is the right signal density.

GetHook exposes destination health status as a field on every GET /v1/destinations/{id} response, so customers can also poll or webhook-off-webhooks (using a monitoring integration) to detect paused destinations programmatically.


Backfill vs. Discard for Paused Destinations

When you resume a paused destination, you have a choice: deliver the events that accumulated while the destination was paused, or discard them and start fresh.

The right answer depends on the event type:

Event TypeOn Resume
Idempotent, time-insensitive (e.g., record updates)Backfill — the destination wants the latest state
Time-sensitive (e.g., fraud alerts, real-time triggers)Discard or let customer choose — stale events may cause incorrect behavior
Ordered sequences (e.g., state machine transitions)Backfill in strict order — out-of-order delivery is worse than no delivery

The safest default is to hold events for a configurable retention window (24–72 hours) and let the customer decide on resume. Expose a replay_on_resume flag in the destination configuration.


Putting It Together

A complete health scoring system is less than 200 lines of Go and SQL. The value is disproportionate to the implementation cost: customers stop filing support tickets asking why their integration stopped working three days ago, your delivery worker stops burning cycles on permanently broken endpoints, and your queue stays shallow.

The pieces:

  1. A consecutive_failures counter updated inline in the delivery worker
  2. A health_evaluation job running every minute over your Postgres job queue
  3. A pause gate in the dispatch query (paused_at IS NULL)
  4. A probe job for 5xx-paused destinations
  5. A state-transition alert (email or webhook) to the customer

Set up destination health monitoring with GetHook →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.