Webhook delivery failures are a special category of incident. Unlike a downed API endpoint, the failure is often silent from the sender's perspective — you sent the event, you got a 200 back from your gateway, and the consumer has no idea they missed anything. By the time an alert fires, the failure may have been accumulating for minutes or hours.
This post is a structured runbook for diagnosing webhook delivery incidents. It is not a post about theory. It is the decision tree we have refined through production incidents — the questions you ask, in the order you ask them, so you can stop guessing and start closing.
Before the Incident: What You Need to Have
A runbook is only useful if your observability is in place before things break. The minimum viable monitoring stack for a webhook delivery system:
| Signal | What to measure | Alert threshold |
|---|---|---|
| Delivery success rate | delivered / (delivered + failed) per destination, 5-min window | Below 95% for any destination |
| Retry queue depth | COUNT(*) WHERE status = 'retry_scheduled' per destination | Above 500 and growing |
| Dead-letter rate | Events entering DLQ per minute | Any sustained rate above baseline |
| Worker lag | now() - MIN(next_attempt_at) for unlocked jobs | Above 2 minutes |
| Ingest latency (p99) | Time from HTTP request received to event persisted | Above 500 ms |
| Destination response time (p95) | HTTP round-trip to each destination | Above destination's configured timeout |
If you do not have these signals, the runbook below will require you to query the database manually for every step. That is still workable — the queries are provided — but set up the dashboards before the next incident.
Step 1: Determine the Blast Radius
The first question is not "what broke" — it is "how much broke and for whom."
SELECT
d.id,
d.name,
d.url,
COUNT(*) FILTER (WHERE e.status = 'delivered') AS delivered,
COUNT(*) FILTER (WHERE e.status IN ('retry_scheduled', 'dead_letter')) AS failed,
COUNT(*) FILTER (WHERE e.status = 'retry_scheduled') AS in_retry,
COUNT(*) FILTER (WHERE e.status = 'dead_letter') AS dead_lettered,
MIN(e.created_at) AS oldest_affected_event
FROM events e
JOIN routes r ON r.id = e.route_id
JOIN destinations d ON d.id = r.destination_id
WHERE e.created_at > now() - INTERVAL '1 hour'
GROUP BY d.id, d.name, d.url
HAVING COUNT(*) FILTER (WHERE e.status != 'delivered') > 0
ORDER BY failed DESC;This tells you immediately whether the failure is:
- ›Isolated to one destination — likely a destination-side issue (firewall change, deploy, certificate expiry)
- ›Affecting all destinations — likely an infrastructure issue (worker down, database overloaded, network partition)
- ›Affecting a subset of destinations — likely a routing, pattern-matching, or configuration issue
Do not proceed past this step without a clear answer. The next steps diverge sharply based on blast radius.
Step 2: Is the Worker Running?
If the delivery worker is not running, events will accumulate in the queue without being processed. The retry queue depth grows. No delivery attempts are being made.
Check for recent delivery attempts:
SELECT
MAX(created_at) AS last_attempt,
now() - MAX(created_at) AS time_since_last_attempt,
COUNT(*) AS attempts_last_10_min
FROM delivery_attempts
WHERE created_at > now() - INTERVAL '10 minutes';If attempts_last_10_min is zero and your event volume is non-trivial, the worker is not polling. Check the worker process directly:
# If running as a systemd service
systemctl status gethook-worker
# If running in Kubernetes
kubectl get pods -l app=gethook-worker -n production
kubectl logs -l app=gethook-worker --tail=100 -n production
# If running in Docker
docker ps | grep worker
docker logs gethook-worker --tail=100The most common causes of a stopped worker:
- ›OOM kill (check
dmesgor the orchestrator's event log) - ›Panic during delivery (check logs for
panic:orruntime error:) - ›Database connection pool exhausted (check for
too many clientsin worker logs) - ›Graceful shutdown that was not followed by a restart (deploy gone wrong)
Resolution: Restart the worker. Once it is running, verify that delivery attempts resume within 30 seconds.
Step 3: Are Delivery Attempts Being Made But Failing?
If the worker is running but delivery is failing, look at recent attempt outcomes:
SELECT
da.destination_id,
d.name,
d.url,
da.outcome,
da.response_status,
COUNT(*) AS count,
MAX(da.created_at) AS most_recent
FROM delivery_attempts da
JOIN destinations d ON d.id = da.destination_id
WHERE da.created_at > now() - INTERVAL '30 minutes'
AND da.outcome != 'success'
GROUP BY da.destination_id, d.name, d.url, da.outcome, da.response_status
ORDER BY count DESC;Match the outcome to the likely cause:
| Outcome | Response status | Most likely cause |
|---|---|---|
timeout | — | Destination is slow or unresponsive; check destination health independently |
network_error | — | DNS failure, connection refused, TLS handshake failure |
http_4xx | 401 or 403 | Authentication config changed (signing secret, auth header) |
http_4xx | 404 | Destination URL changed or endpoint was removed |
http_4xx | 410 | Destination intentionally gone; stop retrying |
http_4xx | 429 | Destination is rate-limiting your delivery attempts |
http_5xx | 500–503 | Destination is throwing errors; may be transient |
For network_error, the response body is usually empty. Get the last attempt's error detail:
SELECT
da.id,
da.destination_id,
da.outcome,
da.response_body,
da.created_at
FROM delivery_attempts da
WHERE da.destination_id = '<destination-id>'
AND da.outcome = 'network_error'
ORDER BY da.created_at DESC
LIMIT 5;Then verify connectivity to the destination from your delivery worker's network:
# Test TCP connectivity
curl -v --max-time 10 https://destination.example.com/webhooks
# Check TLS certificate validity
openssl s_client -connect destination.example.com:443 -servername destination.example.com \
</dev/null 2>&1 | grep -E "Verify|expire|subject"A TLS certificate expiry is one of the most common causes of sudden network_error failures with no other infrastructure change. Check it early.
Step 4: Destination-Specific Isolation
If you have confirmed the failure is limited to one or a few destinations, the issue is almost certainly on the destination side or in how you are authenticating to it.
Run through this checklist:
# 1. Verify the destination URL is still correct
curl -I --max-time 5 https://destination.example.com/webhooks
# 2. Test with a manual delivery (replace values as appropriate)
TIMESTAMP=$(date +%s)
PAYLOAD='{"id":"test_evt","type":"test.ping","created_at":"2026-04-14T00:00:00Z"}'
SECRET="your-signing-secret"
SIG=$(echo -n "${TIMESTAMP}.${PAYLOAD}" | openssl dgst -sha256 -hmac "$SECRET" | awk '{print $2}')
curl -X POST https://destination.example.com/webhooks \
-H "Content-Type: application/json" \
-H "Webhook-Signature: t=${TIMESTAMP},v1=${SIG}" \
-d "$PAYLOAD" \
-v --max-time 10If the manual delivery succeeds but automated delivery is failing, the problem is likely in your signing secret configuration or the automated delivery's auth headers. Compare the signing secret stored in your delivery system against what the destination expects.
If the manual delivery also fails, the problem is squarely on the destination side. Contact the destination operator. Until the destination recovers, you have two options:
- ›Let the retry queue accumulate and replay once the destination is healthy (appropriate for short outages with few events)
- ›Route to a fallback destination if one is configured
Step 5: Replay Strategy After Recovery
Once the underlying issue is resolved, you have a backlog to clear. Do not replay everything at once.
First, confirm the destination is healthy:
# Check the destination's health endpoint if it has one
curl -I --max-time 5 https://destination.example.com/health
# Send a test event and verify a 2xx responseThen replay in controlled batches. If you are using a SQL-backed job queue, you can update the next_attempt_at for dead-lettered events in waves:
-- Requeue the oldest 100 dead-lettered events for a specific destination
UPDATE events
SET status = 'queued',
next_attempt_at = now(),
attempts_count = 0
WHERE id IN (
SELECT e.id
FROM events e
JOIN routes r ON r.id = e.route_id
WHERE r.destination_id = '<destination-id>'
AND e.status = 'dead_letter'
ORDER BY e.created_at ASC
LIMIT 100
);Wait 2–3 minutes. Check that those 100 events deliver successfully before requeuing the next batch. A destination that just recovered may not be at full capacity immediately — a sudden flood of replay traffic can knock it over again.
GetHook's replay feature applies this pacing automatically: you can replay dead-lettered events for a destination with a configurable rate cap, and the system surfaces delivery success in real time so you can confirm recovery before accelerating.
Step 6: Verify and Close
Before marking the incident resolved, confirm all three of the following:
1. The backlog is draining:
SELECT status, COUNT(*)
FROM events
WHERE created_at > '<incident-start-time>'
GROUP BY status;The retry_scheduled and dead_letter counts should be decreasing.
2. The delivery success rate has returned to baseline:
SELECT
date_trunc('minute', da.created_at) AS minute,
COUNT(*) FILTER (WHERE da.outcome = 'success') AS successes,
COUNT(*) FILTER (WHERE da.outcome != 'success') AS failures,
ROUND(
COUNT(*) FILTER (WHERE da.outcome = 'success')::numeric /
NULLIF(COUNT(*), 0) * 100, 1
) AS success_pct
FROM delivery_attempts da
WHERE da.created_at > now() - INTERVAL '30 minutes'
GROUP BY 1
ORDER BY 1;Success percentage should be at or above your normal baseline (typically 98–99%).
3. No new events are entering the dead-letter queue:
SELECT COUNT(*), MAX(created_at) AS most_recent_dlq_entry
FROM events
WHERE status = 'dead_letter'
AND created_at > now() - INTERVAL '10 minutes';If all three checks pass, the incident is resolved.
Post-Incident: What to Document
A runbook improves over time only if you document what happened. After each webhook delivery incident, capture:
- ›Timeline: When did the first event fail? When was the alert fired? When was the issue identified? When was it resolved?
- ›Root cause: One sentence. "TLS certificate on destination expired." "Worker OOM-killed at 03:17 due to 64 KB payload spike."
- ›Blast radius: How many events were affected? How many destinations? Which customers?
- ›Time to detect vs. time to resolve: If detect took longer than resolve, invest in alerting. If resolve took longer than detect, invest in tooling.
- ›Runbook gaps: Did you hit a step where this runbook did not help? Add the missing step.
The most expensive webhook incidents are the ones you cannot explain. If you cannot write a one-sentence root cause after resolution, you have not finished the incident — you have paused it.
Webhook delivery incidents are rarely mysterious once you have the right queries and know the order to run them. The work is in setting up the observability before the incident, and in documenting the outcomes after.
If you want a delivery system that surfaces delivery status, retry queues, and dead-letter counts in a real-time dashboard — so your runbook starts at step 4 instead of step 1 — set up GetHook in under 10 minutes.