Retry logic is the first tool most teams reach for when webhook delivery fails. It works well when failures are transient — a destination that was briefly overloaded recovers, the retry succeeds, and everything continues. But retry logic has a blind spot: it doesn't distinguish between a destination that's temporarily slow and one that's been misconfigured, decommissioned, or silently broken for three days.
When you retry into a broken destination indefinitely, you accumulate a backlog of queued jobs, burn delivery worker capacity on requests that will never succeed, and delay delivery to all your other destinations as the queue grows. The fix is destination health scoring: a lightweight signal that tells your delivery system when to back off, notify the customer, or pause delivery entirely.
What "Unhealthy" Actually Means
Not every failed HTTP response means the destination is broken. A 429 means your delivery rate exceeds the destination's capacity — the fix is throttling, not disabling. A 503 that clears in two minutes is noise. A consistent stream of 500s over 30 minutes is a real signal.
The useful failure categories are:
| Outcome | Likely Cause | Response |
|---|---|---|
| HTTP 4xx (not 429) | Misconfigured destination, expired auth, wrong URL | Likely broken — alert immediately |
| HTTP 5xx | Destination is erroring internally | Degrade score, retry with backoff |
| Connection timeout | Host unreachable or TLS failure | Degrade score aggressively |
| DNS resolution failure | Destination URL is invalid or domain no longer exists | Pause immediately |
| HTTP 429 | Rate limited by destination | Throttle, do not degrade health score |
| HTTP 200/201 | Success | Recover health score |
The key insight is that 4xx responses (other than 429) should degrade the health score faster than 5xx responses. A 404 or 401 is unlikely to resolve on its own. A 503 might.
The Health Score Model
A simple and durable model is a sliding-window success rate computed over the last N delivery attempts per destination. You don't need a time-series database for this — your delivery attempts table already has the data.
-- Health score: success rate over the last 50 attempts
SELECT
destination_id,
COUNT(*) AS total,
SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes,
ROUND(
100.0 * SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) / COUNT(*),
1
) AS success_rate_pct
FROM (
SELECT destination_id, outcome
FROM delivery_attempts
WHERE destination_id = $1
ORDER BY created_at DESC
LIMIT 50
) recent
GROUP BY destination_id;This gives you a percentage. You can define thresholds:
| Success Rate | Status | Action |
|---|---|---|
| 95–100% | Healthy | No action |
| 75–94% | Degraded | Alert customer, increase retry interval |
| 50–74% | Critical | Alert customer, consider pause |
| < 50% | Unhealthy | Auto-pause delivery |
The window size (50 attempts) is a tradeoff. A smaller window reacts faster but produces more false positives for bursty failures. A larger window is more accurate but slower to catch a newly broken destination. 50 is a reasonable starting point; tune it based on your typical delivery volume per destination.
Storing and Updating the Score
You don't want to run the scoring query on every delivery attempt — at high volume, that's an expensive aggregation per delivery. Instead, maintain a materialized health state on the destination row and update it asynchronously.
Add a few columns to your destinations table:
ALTER TABLE destinations
ADD COLUMN health_status TEXT NOT NULL DEFAULT 'healthy',
-- healthy | degraded | critical | unhealthy | paused
ADD COLUMN health_score INT NOT NULL DEFAULT 100,
-- 0–100, higher is better
ADD COLUMN health_checked_at TIMESTAMPTZ,
ADD COLUMN consecutive_failures INT NOT NULL DEFAULT 0,
ADD COLUMN paused_at TIMESTAMPTZ,
ADD COLUMN pause_reason TEXT;After each delivery attempt, your worker updates the destination's consecutive failure count. A separate health evaluation job — running every minute via your Postgres-backed job queue — recomputes the sliding-window score and transitions health status.
// Called by the health evaluation job for each active destination
func (s *DestinationStore) EvaluateHealth(ctx context.Context, destID uuid.UUID) error {
const window = 50
var total, successes int
err := s.db.QueryRowContext(ctx, `
SELECT
COUNT(*) AS total,
SUM(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) AS successes
FROM (
SELECT outcome
FROM delivery_attempts
WHERE destination_id = $1
ORDER BY created_at DESC
LIMIT $2
) recent
`, destID, window).Scan(&total, &successes)
if err != nil {
return fmt.Errorf("query health: %w", err)
}
if total == 0 {
return nil // no data, leave as healthy
}
score := int(float64(successes) / float64(total) * 100)
status := healthStatus(score)
_, err = s.db.ExecContext(ctx, `
UPDATE destinations
SET
health_score = $2,
health_status = $3,
health_checked_at = NOW(),
paused_at = CASE WHEN $3 = 'unhealthy' AND paused_at IS NULL THEN NOW() ELSE paused_at END,
pause_reason = CASE WHEN $3 = 'unhealthy' AND paused_at IS NULL THEN 'auto: health score below threshold' ELSE pause_reason END
WHERE id = $1
`, destID, score, status)
return err
}
func healthStatus(score int) string {
switch {
case score >= 95:
return "healthy"
case score >= 75:
return "degraded"
case score >= 50:
return "critical"
default:
return "unhealthy"
}
}When the status transitions to unhealthy, the row is automatically paused. Your delivery worker's dispatch query skips destinations where paused_at IS NOT NULL:
SELECT d.*
FROM destinations d
JOIN routes r ON r.destination_id = d.id
WHERE r.source_id = $1
AND d.paused_at IS NULL;Early Warning: Consecutive Failure Tracking
The sliding window catches persistent degradation but is slow to react to sudden breaks. If a destination has been healthy for months and then starts returning 404 on every attempt, you'll see 47 successes and 3 failures in your 50-attempt window — a 94% score that doesn't trigger any alert.
Add a consecutive failure counter that runs in parallel with the window-based score:
func (w *DeliveryWorker) recordAttempt(ctx context.Context, destID uuid.UUID, outcome string) error {
if outcome == "success" {
// Reset consecutive failures on any success
_, err := w.db.ExecContext(ctx,
`UPDATE destinations SET consecutive_failures = 0 WHERE id = $1`,
destID,
)
return err
}
// Increment and check threshold
var consecutiveFailures int
err := w.db.QueryRowContext(ctx, `
UPDATE destinations
SET consecutive_failures = consecutive_failures + 1
WHERE id = $1
RETURNING consecutive_failures
`, destID).Scan(&consecutiveFailures)
if err != nil {
return err
}
// Hard pause after 10 consecutive failures, regardless of window score
if consecutiveFailures >= 10 {
_, err = w.db.ExecContext(ctx, `
UPDATE destinations
SET
paused_at = NOW(),
pause_reason = 'auto: 10 consecutive failures',
health_status = 'unhealthy'
WHERE id = $1 AND paused_at IS NULL
`, destID)
}
return err
}This catches the cliff scenario: a destination that was working perfectly and then breaks completely will hit 10 consecutive failures and pause well before the sliding window degrades below the threshold.
Recovery: Automatic or Manual?
Once a destination is paused, you have two options for resuming delivery.
Manual resume is simpler and safer. The customer (or your support team) investigates the failure, fixes the destination, and clicks "Resume delivery" in the dashboard. This prevents a fixed-but-fragile destination from being hammered by a backlog immediately after resuming.
Automatic recovery with a probe works well for destinations that self-heal. Every 5 minutes, your worker sends a synthetic probe request to paused destinations. If the probe succeeds, the destination transitions back to degraded status and delivery resumes with a reduced concurrency cap until the sliding window score recovers to healthy.
The probe request should be a lightweight, idempotent payload — a ping event type that the destination can ignore:
{
"id": "probe_01HX9P3...",
"type": "webhook.probe",
"created_at": "2026-04-04T09:00:00Z",
"data": {}
}Don't send probe requests to destinations that were paused due to a 4xx error. A 404 or 401 won't self-heal — probing them wastes requests and confuses the customer. Only auto-probe destinations paused due to 5xx or network errors.
What to Surface to Customers
Health scoring is only valuable if customers can see and act on it. The minimum viable surface area:
- ›A health badge on each destination in the dashboard (
Healthy,Degraded,Paused) - ›A timestamp and reason on paused destinations ("Paused automatically: 10 consecutive failures since 2026-04-04 09:14 UTC")
- ›An email alert when a destination transitions from
degradedtounhealthy - ›A "Resume delivery" button that clears
paused_atand resetsconsecutive_failures
Don't send an alert for every single delivery failure — that's noise. Alert on state transitions: the first time a destination enters degraded, and when it enters unhealthy. One alert per transition is the right signal density.
GetHook exposes destination health status as a field on every GET /v1/destinations/{id} response, so customers can also poll or webhook-off-webhooks (using a monitoring integration) to detect paused destinations programmatically.
Backfill vs. Discard for Paused Destinations
When you resume a paused destination, you have a choice: deliver the events that accumulated while the destination was paused, or discard them and start fresh.
The right answer depends on the event type:
| Event Type | On Resume |
|---|---|
| Idempotent, time-insensitive (e.g., record updates) | Backfill — the destination wants the latest state |
| Time-sensitive (e.g., fraud alerts, real-time triggers) | Discard or let customer choose — stale events may cause incorrect behavior |
| Ordered sequences (e.g., state machine transitions) | Backfill in strict order — out-of-order delivery is worse than no delivery |
The safest default is to hold events for a configurable retention window (24–72 hours) and let the customer decide on resume. Expose a replay_on_resume flag in the destination configuration.
Putting It Together
A complete health scoring system is less than 200 lines of Go and SQL. The value is disproportionate to the implementation cost: customers stop filing support tickets asking why their integration stopped working three days ago, your delivery worker stops burning cycles on permanently broken endpoints, and your queue stays shallow.
The pieces:
- ›A
consecutive_failurescounter updated inline in the delivery worker - ›A
health_evaluationjob running every minute over your Postgres job queue - ›A pause gate in the dispatch query (
paused_at IS NULL) - ›A probe job for 5xx-paused destinations
- ›A state-transition alert (email or webhook) to the customer