Back to Blog
webhooksincident responsereliabilityoperationsobservability

Webhook Incident Runbooks: A Field Guide to Diagnosing Delivery Failures Under Pressure

When webhook delivery breaks at 2 AM, you need a decision tree, not a philosophy. Here's the structured runbook we use to triage, diagnose, and resolve webhook delivery incidents systematically.

Y
Yuki Tanaka
Founding Engineer
April 14, 2026
10 min read

Webhook delivery failures are a special category of incident. Unlike a downed API endpoint, the failure is often silent from the sender's perspective — you sent the event, you got a 200 back from your gateway, and the consumer has no idea they missed anything. By the time an alert fires, the failure may have been accumulating for minutes or hours.

This post is a structured runbook for diagnosing webhook delivery incidents. It is not a post about theory. It is the decision tree we have refined through production incidents — the questions you ask, in the order you ask them, so you can stop guessing and start closing.


Before the Incident: What You Need to Have

A runbook is only useful if your observability is in place before things break. The minimum viable monitoring stack for a webhook delivery system:

SignalWhat to measureAlert threshold
Delivery success ratedelivered / (delivered + failed) per destination, 5-min windowBelow 95% for any destination
Retry queue depthCOUNT(*) WHERE status = 'retry_scheduled' per destinationAbove 500 and growing
Dead-letter rateEvents entering DLQ per minuteAny sustained rate above baseline
Worker lagnow() - MIN(next_attempt_at) for unlocked jobsAbove 2 minutes
Ingest latency (p99)Time from HTTP request received to event persistedAbove 500 ms
Destination response time (p95)HTTP round-trip to each destinationAbove destination's configured timeout

If you do not have these signals, the runbook below will require you to query the database manually for every step. That is still workable — the queries are provided — but set up the dashboards before the next incident.


Step 1: Determine the Blast Radius

The first question is not "what broke" — it is "how much broke and for whom."

sql
SELECT
    d.id,
    d.name,
    d.url,
    COUNT(*) FILTER (WHERE e.status = 'delivered') AS delivered,
    COUNT(*) FILTER (WHERE e.status IN ('retry_scheduled', 'dead_letter')) AS failed,
    COUNT(*) FILTER (WHERE e.status = 'retry_scheduled') AS in_retry,
    COUNT(*) FILTER (WHERE e.status = 'dead_letter') AS dead_lettered,
    MIN(e.created_at) AS oldest_affected_event
FROM events e
JOIN routes r ON r.id = e.route_id
JOIN destinations d ON d.id = r.destination_id
WHERE e.created_at > now() - INTERVAL '1 hour'
GROUP BY d.id, d.name, d.url
HAVING COUNT(*) FILTER (WHERE e.status != 'delivered') > 0
ORDER BY failed DESC;

This tells you immediately whether the failure is:

  • Isolated to one destination — likely a destination-side issue (firewall change, deploy, certificate expiry)
  • Affecting all destinations — likely an infrastructure issue (worker down, database overloaded, network partition)
  • Affecting a subset of destinations — likely a routing, pattern-matching, or configuration issue

Do not proceed past this step without a clear answer. The next steps diverge sharply based on blast radius.


Step 2: Is the Worker Running?

If the delivery worker is not running, events will accumulate in the queue without being processed. The retry queue depth grows. No delivery attempts are being made.

Check for recent delivery attempts:

sql
SELECT
    MAX(created_at) AS last_attempt,
    now() - MAX(created_at) AS time_since_last_attempt,
    COUNT(*) AS attempts_last_10_min
FROM delivery_attempts
WHERE created_at > now() - INTERVAL '10 minutes';

If attempts_last_10_min is zero and your event volume is non-trivial, the worker is not polling. Check the worker process directly:

bash
# If running as a systemd service
systemctl status gethook-worker

# If running in Kubernetes
kubectl get pods -l app=gethook-worker -n production
kubectl logs -l app=gethook-worker --tail=100 -n production

# If running in Docker
docker ps | grep worker
docker logs gethook-worker --tail=100

The most common causes of a stopped worker:

  • OOM kill (check dmesg or the orchestrator's event log)
  • Panic during delivery (check logs for panic: or runtime error:)
  • Database connection pool exhausted (check for too many clients in worker logs)
  • Graceful shutdown that was not followed by a restart (deploy gone wrong)

Resolution: Restart the worker. Once it is running, verify that delivery attempts resume within 30 seconds.


Step 3: Are Delivery Attempts Being Made But Failing?

If the worker is running but delivery is failing, look at recent attempt outcomes:

sql
SELECT
    da.destination_id,
    d.name,
    d.url,
    da.outcome,
    da.response_status,
    COUNT(*) AS count,
    MAX(da.created_at) AS most_recent
FROM delivery_attempts da
JOIN destinations d ON d.id = da.destination_id
WHERE da.created_at > now() - INTERVAL '30 minutes'
  AND da.outcome != 'success'
GROUP BY da.destination_id, d.name, d.url, da.outcome, da.response_status
ORDER BY count DESC;

Match the outcome to the likely cause:

OutcomeResponse statusMost likely cause
timeoutDestination is slow or unresponsive; check destination health independently
network_errorDNS failure, connection refused, TLS handshake failure
http_4xx401 or 403Authentication config changed (signing secret, auth header)
http_4xx404Destination URL changed or endpoint was removed
http_4xx410Destination intentionally gone; stop retrying
http_4xx429Destination is rate-limiting your delivery attempts
http_5xx500–503Destination is throwing errors; may be transient

For network_error, the response body is usually empty. Get the last attempt's error detail:

sql
SELECT
    da.id,
    da.destination_id,
    da.outcome,
    da.response_body,
    da.created_at
FROM delivery_attempts da
WHERE da.destination_id = '<destination-id>'
  AND da.outcome = 'network_error'
ORDER BY da.created_at DESC
LIMIT 5;

Then verify connectivity to the destination from your delivery worker's network:

bash
# Test TCP connectivity
curl -v --max-time 10 https://destination.example.com/webhooks

# Check TLS certificate validity
openssl s_client -connect destination.example.com:443 -servername destination.example.com \
    </dev/null 2>&1 | grep -E "Verify|expire|subject"

A TLS certificate expiry is one of the most common causes of sudden network_error failures with no other infrastructure change. Check it early.


Step 4: Destination-Specific Isolation

If you have confirmed the failure is limited to one or a few destinations, the issue is almost certainly on the destination side or in how you are authenticating to it.

Run through this checklist:

bash
# 1. Verify the destination URL is still correct
curl -I --max-time 5 https://destination.example.com/webhooks

# 2. Test with a manual delivery (replace values as appropriate)
TIMESTAMP=$(date +%s)
PAYLOAD='{"id":"test_evt","type":"test.ping","created_at":"2026-04-14T00:00:00Z"}'
SECRET="your-signing-secret"
SIG=$(echo -n "${TIMESTAMP}.${PAYLOAD}" | openssl dgst -sha256 -hmac "$SECRET" | awk '{print $2}')

curl -X POST https://destination.example.com/webhooks \
    -H "Content-Type: application/json" \
    -H "Webhook-Signature: t=${TIMESTAMP},v1=${SIG}" \
    -d "$PAYLOAD" \
    -v --max-time 10

If the manual delivery succeeds but automated delivery is failing, the problem is likely in your signing secret configuration or the automated delivery's auth headers. Compare the signing secret stored in your delivery system against what the destination expects.

If the manual delivery also fails, the problem is squarely on the destination side. Contact the destination operator. Until the destination recovers, you have two options:

  1. Let the retry queue accumulate and replay once the destination is healthy (appropriate for short outages with few events)
  2. Route to a fallback destination if one is configured

Step 5: Replay Strategy After Recovery

Once the underlying issue is resolved, you have a backlog to clear. Do not replay everything at once.

First, confirm the destination is healthy:

bash
# Check the destination's health endpoint if it has one
curl -I --max-time 5 https://destination.example.com/health

# Send a test event and verify a 2xx response

Then replay in controlled batches. If you are using a SQL-backed job queue, you can update the next_attempt_at for dead-lettered events in waves:

sql
-- Requeue the oldest 100 dead-lettered events for a specific destination
UPDATE events
SET status = 'queued',
    next_attempt_at = now(),
    attempts_count = 0
WHERE id IN (
    SELECT e.id
    FROM events e
    JOIN routes r ON r.id = e.route_id
    WHERE r.destination_id = '<destination-id>'
      AND e.status = 'dead_letter'
    ORDER BY e.created_at ASC
    LIMIT 100
);

Wait 2–3 minutes. Check that those 100 events deliver successfully before requeuing the next batch. A destination that just recovered may not be at full capacity immediately — a sudden flood of replay traffic can knock it over again.

GetHook's replay feature applies this pacing automatically: you can replay dead-lettered events for a destination with a configurable rate cap, and the system surfaces delivery success in real time so you can confirm recovery before accelerating.


Step 6: Verify and Close

Before marking the incident resolved, confirm all three of the following:

1. The backlog is draining:

sql
SELECT status, COUNT(*) 
FROM events 
WHERE created_at > '<incident-start-time>'
GROUP BY status;

The retry_scheduled and dead_letter counts should be decreasing.

2. The delivery success rate has returned to baseline:

sql
SELECT
    date_trunc('minute', da.created_at) AS minute,
    COUNT(*) FILTER (WHERE da.outcome = 'success') AS successes,
    COUNT(*) FILTER (WHERE da.outcome != 'success') AS failures,
    ROUND(
        COUNT(*) FILTER (WHERE da.outcome = 'success')::numeric /
        NULLIF(COUNT(*), 0) * 100, 1
    ) AS success_pct
FROM delivery_attempts da
WHERE da.created_at > now() - INTERVAL '30 minutes'
GROUP BY 1
ORDER BY 1;

Success percentage should be at or above your normal baseline (typically 98–99%).

3. No new events are entering the dead-letter queue:

sql
SELECT COUNT(*), MAX(created_at) AS most_recent_dlq_entry
FROM events
WHERE status = 'dead_letter'
  AND created_at > now() - INTERVAL '10 minutes';

If all three checks pass, the incident is resolved.


Post-Incident: What to Document

A runbook improves over time only if you document what happened. After each webhook delivery incident, capture:

  • Timeline: When did the first event fail? When was the alert fired? When was the issue identified? When was it resolved?
  • Root cause: One sentence. "TLS certificate on destination expired." "Worker OOM-killed at 03:17 due to 64 KB payload spike."
  • Blast radius: How many events were affected? How many destinations? Which customers?
  • Time to detect vs. time to resolve: If detect took longer than resolve, invest in alerting. If resolve took longer than detect, invest in tooling.
  • Runbook gaps: Did you hit a step where this runbook did not help? Add the missing step.

The most expensive webhook incidents are the ones you cannot explain. If you cannot write a one-sentence root cause after resolution, you have not finished the incident — you have paused it.


Webhook delivery incidents are rarely mysterious once you have the right queries and know the order to run them. The work is in setting up the observability before the incident, and in documenting the outcomes after.

If you want a delivery system that surfaces delivery status, retry queues, and dead-letter counts in a real-time dashboard — so your runbook starts at step 4 instead of step 1 — set up GetHook in under 10 minutes.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.