The dead-letter queue (DLQ) is where events go to die — or, more optimistically, where they wait for human intervention.
After your retry policy exhausts its attempts (5 tries over ~90 minutes in GetHook's default schedule), an event gets marked as dead_letter. No more automatic retries. No more chances. The event sits there until someone decides what to do with it.
Most teams ignore DLQ accumulation until it becomes a crisis. This post covers how to treat your DLQ as a first-class operational concern.
Why Events End Up in the DLQ
Understanding the root cause before replaying is critical. Replaying into the same broken destination just fills the DLQ again.
Category 1: Transient Destination Failures
Your destination was down for longer than your retry window. A deployment went wrong, a certificate expired, a database ran out of connections. By the time all 5 retries were exhausted, the outage lasted more than an hour.
Resolution: Fix the destination, then replay. These events are safe to replay.
Category 2: Permanent Destination Errors
The destination is returning 4xx errors. Your API contract changed and the destination endpoint was removed. The authentication credentials rotated but the destination config wasn't updated.
Resolution: Fix the root cause first. Replaying before the fix wastes events and may trigger unintended side effects.
Category 3: Payload Schema Errors
The event payload doesn't match what the destination expects. A schema change was deployed to the sender but the receiver wasn't updated to handle the new format.
Resolution: Deploy the receiver fix first, then replay. Consider whether you need a transformation layer.
Category 4: Rate Limiting by Destination
Your destination is returning 429 Too Many Requests. Your retry logic interpreted this as a transient failure and kept retrying — but each retry made the rate limiting worse (thundering herd on recovery).
Resolution: Address the rate limiting strategy. GetHook respects Retry-After headers from destinations; ensure your retry configuration accounts for rate limits.
The DLQ Triage Process
When you discover events in the dead-letter queue, resist the urge to immediately replay everything. Follow this sequence:
Step 1: Quantify the accumulation
SELECT
destination_id,
COUNT(*) as dead_letter_count,
MIN(created_at) as oldest_event,
MAX(created_at) as newest_event
FROM events
WHERE status = 'dead_letter'
GROUP BY destination_id
ORDER BY dead_letter_count DESC;Knowing which destination has the most DLQ events tells you where to focus.
Step 2: Sample the failure reasons
Look at delivery attempts for DLQ events:
SELECT
da.outcome,
da.response_status,
da.error_message,
COUNT(*) as count
FROM delivery_attempts da
JOIN events e ON e.id = da.event_id
WHERE e.status = 'dead_letter'
AND da.attempt_number = 5 -- Final attempt
GROUP BY da.outcome, da.response_status, da.error_message
ORDER BY count DESC;If 95% of final attempts show connection_refused, the destination was down. If they show http_4xx, there's a schema or auth problem.
Step 3: Verify the destination is healthy now
Before replaying, confirm the destination accepts test traffic:
curl -X POST https://your-destination.com/webhooks \
-H "Content-Type: application/json" \
-d '{"test": true}' \
-w "\nHTTP Status: %{http_code}\n"Don't replay until you get a 200.
Step 4: Replay in batches
Don't replay 50,000 events simultaneously. Your destination has finite capacity. A flood of replayed events can cause the very outage you just recovered from.
Replay in batches of 100–500 events, watching your destination's error rate between batches:
Batch 1 (100 events) → watch for 2 minutes → check error rate
Batch 2 (200 events) → watch for 2 minutes → check error rate
Batch 3 (500 events) → if stable, increase batch sizeReplay Safety: Which Events Are Safe?
Not all events are idempotent. Before replaying, classify your event types:
| Event Type | Safe to Replay? | Notes |
|---|---|---|
user.created | ✅ With idempotency key | Check if user already exists |
payment.succeeded | ⚠️ Dangerous | Can trigger duplicate charges if handler isn't idempotent |
order.shipped | ⚠️ Dangerous | Can send duplicate shipping notifications |
file.uploaded | ✅ Usually safe | If handler checks file existence first |
webhook.test | ✅ Always safe | No real side effects |
subscription.cancelled | ❌ High risk | Replaying this twice could cancel an already-cancelled subscription or trigger double-refunds |
Rule: If your handler creates external side effects (emails, charges, shipments), ensure it checks idempotency before acting.
Implementing DLQ Monitoring
The DLQ should have an alert, not a dashboard you check manually. Here are the metrics to monitor:
Alert: DLQ accumulation rate
Fire an alert when more than X events enter the DLQ per hour. The threshold depends on your volume:
| Daily event volume | DLQ alert threshold |
|---|---|
| < 10,000 events/day | > 10 events/hour |
| 10,000–100,000 events/day | > 100 events/hour |
| > 1M events/day | > 1,000 events/hour |
Alert: DLQ age
If an event has been in the DLQ for more than 24 hours without manual intervention, something is being ignored. Alert on it.
Metric: DLQ rate as percentage of total events
DLQ rate = dead_letter events / total events (per day)Target: < 0.3%. Alert if: > 1%.
A rising DLQ rate over multiple days is a stronger signal than a single spike — it indicates a systemic problem, not a one-time outage.
Preventing DLQ Accumulation
The best DLQ strategy is prevention.
Increase retry window for known long outages
If your destination regularly has maintenance windows longer than 90 minutes, increase your retry schedule. With GetHook, you can configure custom retry policies per route:
{
"retry_policy": {
"max_attempts": 8,
"schedule": [0, 60, 300, 900, 3600, 14400, 43200, 86400]
}
}This extends the retry window to 24 hours for critical routes.
Implement circuit breakers on destinations
Rather than continuing to attempt delivery to a clearly-down destination, pause delivery and accumulate events in a retry_scheduled state. Resume when the destination recovers. This prevents retry storms and gives you more time before events reach DLQ.
Add health check probing
Before attempting to deliver a batch of events to a destination, probe its health endpoint. If it returns non-200, skip the batch for this cycle and reschedule. This is more efficient than attempting delivery and collecting failure responses.
DLQ as a Debugging Tool
The DLQ is also one of your best debugging tools. When your team ships a breaking change, events start failing immediately. The DLQ gives you:
- ›A complete record of which events were affected
- ›The exact payloads that triggered failures
- ›The response bodies from your destination (what error was returned)
- ›The timestamp sequence to understand when the problem started
This makes post-incident analysis much easier than reconstructing events from logs.
GetHook's DLQ Design
GetHook's DLQ implementation is designed for operational simplicity:
- ›Per-event visibility — each dead-letter event shows all 5 delivery attempts, response codes, and error messages
- ›Bulk replay — replay all DLQ events for a destination in one operation, with rate limiting to prevent thundering herd
- ›Replay audit trail — replayed events create a new event record linked to the original, so you can track what was replayed and when
- ›DLQ webhooks — configure a
webhook.dead_letternotification to your ops channel when events start accumulating
The goal is to make the DLQ something your on-call team can act on in minutes, not hours.