Back to Blog
reliabilityengineeringarchitecturedead letter queue

Dead Letter Queues: What To Do When Webhooks Stop Retrying

After 5 failed delivery attempts, an event lands in the dead-letter queue. Now what? A systematic guide to triaging DLQ events, diagnosing root causes, replaying safely, and preventing accumulation.

D
Dmitri Volkov
Distributed Systems Engineer
January 15, 2026
9 min read

The dead-letter queue (DLQ) is where events go to die — or, more optimistically, where they wait for human intervention.

After your retry policy exhausts its attempts (5 tries over ~90 minutes in GetHook's default schedule), an event gets marked as dead_letter. No more automatic retries. No more chances. The event sits there until someone decides what to do with it.

Most teams ignore DLQ accumulation until it becomes a crisis. This post covers how to treat your DLQ as a first-class operational concern.


Why Events End Up in the DLQ

Understanding the root cause before replaying is critical. Replaying into the same broken destination just fills the DLQ again.

Category 1: Transient Destination Failures

Your destination was down for longer than your retry window. A deployment went wrong, a certificate expired, a database ran out of connections. By the time all 5 retries were exhausted, the outage lasted more than an hour.

Resolution: Fix the destination, then replay. These events are safe to replay.

Category 2: Permanent Destination Errors

The destination is returning 4xx errors. Your API contract changed and the destination endpoint was removed. The authentication credentials rotated but the destination config wasn't updated.

Resolution: Fix the root cause first. Replaying before the fix wastes events and may trigger unintended side effects.

Category 3: Payload Schema Errors

The event payload doesn't match what the destination expects. A schema change was deployed to the sender but the receiver wasn't updated to handle the new format.

Resolution: Deploy the receiver fix first, then replay. Consider whether you need a transformation layer.

Category 4: Rate Limiting by Destination

Your destination is returning 429 Too Many Requests. Your retry logic interpreted this as a transient failure and kept retrying — but each retry made the rate limiting worse (thundering herd on recovery).

Resolution: Address the rate limiting strategy. GetHook respects Retry-After headers from destinations; ensure your retry configuration accounts for rate limits.


The DLQ Triage Process

When you discover events in the dead-letter queue, resist the urge to immediately replay everything. Follow this sequence:

Step 1: Quantify the accumulation

sql
SELECT
  destination_id,
  COUNT(*) as dead_letter_count,
  MIN(created_at) as oldest_event,
  MAX(created_at) as newest_event
FROM events
WHERE status = 'dead_letter'
GROUP BY destination_id
ORDER BY dead_letter_count DESC;

Knowing which destination has the most DLQ events tells you where to focus.

Step 2: Sample the failure reasons

Look at delivery attempts for DLQ events:

sql
SELECT
  da.outcome,
  da.response_status,
  da.error_message,
  COUNT(*) as count
FROM delivery_attempts da
JOIN events e ON e.id = da.event_id
WHERE e.status = 'dead_letter'
  AND da.attempt_number = 5  -- Final attempt
GROUP BY da.outcome, da.response_status, da.error_message
ORDER BY count DESC;

If 95% of final attempts show connection_refused, the destination was down. If they show http_4xx, there's a schema or auth problem.

Step 3: Verify the destination is healthy now

Before replaying, confirm the destination accepts test traffic:

bash
curl -X POST https://your-destination.com/webhooks \
  -H "Content-Type: application/json" \
  -d '{"test": true}' \
  -w "\nHTTP Status: %{http_code}\n"

Don't replay until you get a 200.

Step 4: Replay in batches

Don't replay 50,000 events simultaneously. Your destination has finite capacity. A flood of replayed events can cause the very outage you just recovered from.

Replay in batches of 100–500 events, watching your destination's error rate between batches:

Batch 1 (100 events) → watch for 2 minutes → check error rate
Batch 2 (200 events) → watch for 2 minutes → check error rate
Batch 3 (500 events) → if stable, increase batch size

Replay Safety: Which Events Are Safe?

Not all events are idempotent. Before replaying, classify your event types:

Event TypeSafe to Replay?Notes
user.created✅ With idempotency keyCheck if user already exists
payment.succeeded⚠️ DangerousCan trigger duplicate charges if handler isn't idempotent
order.shipped⚠️ DangerousCan send duplicate shipping notifications
file.uploaded✅ Usually safeIf handler checks file existence first
webhook.test✅ Always safeNo real side effects
subscription.cancelled❌ High riskReplaying this twice could cancel an already-cancelled subscription or trigger double-refunds

Rule: If your handler creates external side effects (emails, charges, shipments), ensure it checks idempotency before acting.


Implementing DLQ Monitoring

The DLQ should have an alert, not a dashboard you check manually. Here are the metrics to monitor:

Alert: DLQ accumulation rate

Fire an alert when more than X events enter the DLQ per hour. The threshold depends on your volume:

Daily event volumeDLQ alert threshold
< 10,000 events/day> 10 events/hour
10,000–100,000 events/day> 100 events/hour
> 1M events/day> 1,000 events/hour

Alert: DLQ age

If an event has been in the DLQ for more than 24 hours without manual intervention, something is being ignored. Alert on it.

Metric: DLQ rate as percentage of total events

DLQ rate = dead_letter events / total events (per day)

Target: < 0.3%. Alert if: > 1%.

A rising DLQ rate over multiple days is a stronger signal than a single spike — it indicates a systemic problem, not a one-time outage.


Preventing DLQ Accumulation

The best DLQ strategy is prevention.

Increase retry window for known long outages

If your destination regularly has maintenance windows longer than 90 minutes, increase your retry schedule. With GetHook, you can configure custom retry policies per route:

json
{
  "retry_policy": {
    "max_attempts": 8,
    "schedule": [0, 60, 300, 900, 3600, 14400, 43200, 86400]
  }
}

This extends the retry window to 24 hours for critical routes.

Implement circuit breakers on destinations

Rather than continuing to attempt delivery to a clearly-down destination, pause delivery and accumulate events in a retry_scheduled state. Resume when the destination recovers. This prevents retry storms and gives you more time before events reach DLQ.

Add health check probing

Before attempting to deliver a batch of events to a destination, probe its health endpoint. If it returns non-200, skip the batch for this cycle and reschedule. This is more efficient than attempting delivery and collecting failure responses.


DLQ as a Debugging Tool

The DLQ is also one of your best debugging tools. When your team ships a breaking change, events start failing immediately. The DLQ gives you:

  • A complete record of which events were affected
  • The exact payloads that triggered failures
  • The response bodies from your destination (what error was returned)
  • The timestamp sequence to understand when the problem started

This makes post-incident analysis much easier than reconstructing events from logs.


GetHook's DLQ Design

GetHook's DLQ implementation is designed for operational simplicity:

  • Per-event visibility — each dead-letter event shows all 5 delivery attempts, response codes, and error messages
  • Bulk replay — replay all DLQ events for a destination in one operation, with rate limiting to prevent thundering herd
  • Replay audit trail — replayed events create a new event record linked to the original, so you can track what was replayed and when
  • DLQ webhooks — configure a webhook.dead_letter notification to your ops channel when events start accumulating

The goal is to make the DLQ something your on-call team can act on in minutes, not hours.

View dead-letter queue documentation →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.