Every engineering team ships bugs. Most bugs are containable — a bad API response, a broken UI, a mishandled edge case. But when the bug lives in your webhook handler, the damage is different. Events have already been processed. Side effects have already fired. Orders have already been updated (or not). Payments have already been recorded (incorrectly). And unlike a UI bug, there's no user staring at a screen who will immediately report it.
The operational question when you catch this isn't "how do I fix the handler?" — that's the easy part. It's: "how do I re-process only the events my broken handler touched, in a way that produces correct state without creating new problems?"
This post walks through the mechanics of doing that safely.
The Anatomy of the Problem
A bad-deploy replay scenario typically has three boundaries:
- ›The deploy timestamp — when the broken handler went live
- ›The fix timestamp — when the corrected handler was deployed
- ›The affected event types — which webhook events were mishandled
The replay window is everything between timestamps 1 and 2. Not all events in that window are necessarily affected — only those that hit the broken code path. If your bug was in the order.fulfilled handler, you only need to replay order.fulfilled events. Replaying payment.captured events from the same window would be harmless but pointless.
Before writing any replay logic, produce this table explicitly:
| Field | Value |
|---|---|
| Bug introduced at | 2026-04-12 09:14:00 UTC |
| Fix deployed at | 2026-04-12 13:02:00 UTC |
| Affected event types | order.fulfilled |
| Events affected (count) | 247 |
| Side effects to re-execute | Inventory update, fulfillment notification |
| Side effects already correct | Email to customer (separate system, not broken) |
Getting this table right before touching anything is the most important step. Incomplete scope leads to under-replay (you miss events) or over-replay (you re-process events that were handled correctly and create duplicates).
Querying the Affected Events
If you're using a durable event store, the affected events are already in your database. Pull them using the time window and event type filter:
SELECT
id,
source_id,
received_at,
payload,
status
FROM events
WHERE received_at >= '2026-04-12 09:14:00+00'
AND received_at < '2026-04-12 13:02:00+00'
AND payload->>'type' = 'order.fulfilled'
AND status IN ('delivered', 'dead_letter')
ORDER BY received_at ASC;Two things to note about this query:
Filter by status. You only want events that were processed during the broken window — delivered or dead_letter. Events that are still queued or retry_scheduled will be processed by your now-fixed handler automatically. Do not replay those.
Order by received_at ASC. If the events represent state transitions that depend on each other — an order moving from created to processing to fulfilled — processing them out of order produces incorrect state. Always replay in the order they originally arrived.
Idempotency Is Your Safety Net
Before triggering any replay, verify that your handler is idempotent. If it isn't, stop and fix that first.
An idempotent handler produces the same result regardless of how many times it's called with the same input. For webhook handlers, this typically means:
// Non-idempotent: creates a duplicate fulfillment record every time
func handleOrderFulfilled(ctx context.Context, event OrderFulfilledEvent) error {
return db.InsertFulfillmentRecord(ctx, event.OrderID, event.FulfilledAt)
}
// Idempotent: upserts on the stable event ID
func handleOrderFulfilled(ctx context.Context, event OrderFulfilledEvent) error {
return db.UpsertFulfillmentRecord(ctx, UpsertParams{
EventID: event.ID, // stable, unique per event
OrderID: event.OrderID,
FulfilledAt: event.FulfilledAt,
})
}The UpsertFulfillmentRecord uses the webhook event's id as a natural idempotency key. If the event has already been processed correctly after your fix, re-processing it produces the same row — no duplicate side effects.
If your downstream systems don't support idempotent writes natively, add an idempotency_keys table:
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
result_code INT
);
-- In your handler:
INSERT INTO idempotency_keys (key, result_code)
VALUES ($1, $2)
ON CONFLICT (key) DO NOTHING
RETURNING key;
-- If no row returned, this event was already processed — skip.With this guard in place, replaying an event that was already correctly processed after your fix is a no-op.
Replay Strategy: Batch or Stream?
You have 247 events to replay. How you feed them through your handler matters.
Batch replay (all at once) is fastest but dangerous if your handler calls downstream APIs with rate limits, or if your fix introduced a state dependency (event N must complete before event N+1 starts).
Streamed replay with concurrency control is slower but safer. You process a configurable number of events in parallel, rate-limited to protect downstream systems.
A minimal replay script in Go:
func replayEvents(ctx context.Context, store *EventStore, handler EventHandler) error {
const (
batchSize = 10
workerCount = 3
delayBetweenBatches = 500 * time.Millisecond
)
events, err := store.QueryAffectedEvents(ctx, QueryParams{
After: time.Date(2026, 4, 12, 9, 14, 0, 0, time.UTC),
Before: time.Date(2026, 4, 12, 13, 2, 0, 0, time.UTC),
EventType: "order.fulfilled",
Statuses: []string{"delivered", "dead_letter"},
OrderByAsc: true,
})
if err != nil {
return fmt.Errorf("query affected events: %w", err)
}
log.Printf("replaying %d events", len(events))
sem := make(chan struct{}, workerCount)
var wg sync.WaitGroup
var mu sync.Mutex
var failed []string
for i, ev := range events {
sem <- struct{}{}
wg.Add(1)
go func(ev Event) {
defer wg.Done()
defer func() { <-sem }()
if err := handler.Handle(ctx, ev); err != nil {
mu.Lock()
failed = append(failed, ev.ID.String())
mu.Unlock()
log.Printf("replay failed for event %s: %v", ev.ID, err)
} else {
log.Printf("replayed event %s", ev.ID)
}
}(ev)
// Throttle between batches
if (i+1)%batchSize == 0 {
wg.Wait()
time.Sleep(delayBetweenBatches)
}
}
wg.Wait()
if len(failed) > 0 {
return fmt.Errorf("replay completed with %d failures: %v", len(failed), failed)
}
return nil
}The workerCount = 3 and delayBetweenBatches = 500ms are conservative defaults. Tune them based on your downstream systems' throughput. If your fulfillment API is rate-limited to 10 req/s, keep worker count low and add a longer delay.
Tracking Replay Status
A replay that runs for 30 minutes and fails silently halfway through is worse than a replay that never ran. You need to track which events have been replayed and which haven't.
Add a replay_attempts table (or, if you already have a delivery attempts table, tag replays with a replay_run_id):
ALTER TABLE delivery_attempts
ADD COLUMN replay_run_id UUID;
-- When querying replay progress:
SELECT
COUNT(*) FILTER (WHERE replay_run_id = $1) AS replayed,
COUNT(*) FILTER (WHERE replay_run_id IS NULL AND status = 'delivered') AS not_yet_replayed,
COUNT(*) FILTER (WHERE replay_run_id = $1 AND outcome = 'success') AS replay_succeeded,
COUNT(*) FILTER (WHERE replay_run_id = $1 AND outcome != 'success') AS replay_failed
FROM delivery_attempts da
JOIN events e ON e.id = da.event_id
WHERE e.received_at >= $2
AND e.received_at < $3
AND e.payload->>'type' = $4;This lets you query replay progress at any point, resume a partial replay from where it left off (skip events where replay_run_id IS NOT NULL), and produce a final audit record of what was replayed and what failed.
Side Effects You Must Not Replay
Not every side effect should run again. This is the hardest part to get right and the most common source of replay-induced incidents.
| Side effect | Replay safe? | Reason |
|---|---|---|
| Update internal DB record (idempotent upsert) | Yes | Same result on re-run |
| Push notification to mobile | No | User gets duplicate notification |
| Email confirmation | No | User gets duplicate email |
| Charge a payment | No | Catastrophic if run twice |
| Update an inventory count (non-idempotent) | No | Count goes negative |
| Enqueue a downstream job (idempotent check) | Yes, with guard | Check job doesn't already exist |
| Call a partner API | Depends | Check their idempotency key support |
The pattern for suppressing non-replayable side effects during a replay run is a feature flag on the handler:
type HandlerConfig struct {
IsReplay bool
// ... other config
}
func handleOrderFulfilled(ctx context.Context, event OrderFulfilledEvent, cfg HandlerConfig) error {
// Always safe to re-run
if err := db.UpsertFulfillmentRecord(ctx, event); err != nil {
return err
}
// Skip notifications during replay — user already received these
if !cfg.IsReplay {
if err := notifications.SendFulfillmentEmail(ctx, event.OrderID); err != nil {
return err
}
}
return nil
}Set IsReplay = true when running your replay script. The handler re-executes the core data mutation but skips external notifications.
Using GetHook's Built-In Replay
If your inbound events flow through GetHook, you don't need to write the replay infrastructure from scratch. GetHook stores every inbound event durably and exposes a replay endpoint:
# Replay a specific event
curl -X POST https://api.gethook.to/v1/events/{event_id}/replay \
-H "Authorization: Bearer hk_..."
# For bulk replay, query your event list with filters and loop
curl "https://api.gethook.to/v1/events?type=order.fulfilled&after=2026-04-12T09:14:00Z&before=2026-04-12T13:02:00Z&status=delivered" \
-H "Authorization: Bearer hk_..." | jq -r '.data[].id' | while read id; do
curl -s -X POST "https://api.gethook.to/v1/events/$id/replay" \
-H "Authorization: Bearer hk_..." > /dev/null
echo "replayed $id"
doneThe replayed events flow through the same delivery pipeline — signature verification, routing, retry logic — but with a replayed status tag so you can distinguish them in your delivery logs. Your idempotency guards in the handler remain your protection against duplicate side effects; the gateway re-delivers, you decide what to re-execute.
Post-Replay Audit
Once the replay is complete, verify correctness before closing the incident:
-- Check that all affected events now have a successful replay attempt
SELECT
e.id,
e.received_at,
da_original.outcome AS original_outcome,
da_replay.outcome AS replay_outcome
FROM events e
JOIN delivery_attempts da_original
ON da_original.event_id = e.id AND da_original.replay_run_id IS NULL
LEFT JOIN delivery_attempts da_replay
ON da_replay.event_id = e.id AND da_replay.replay_run_id = $1
WHERE e.received_at >= $2
AND e.received_at < $3
AND da_replay.outcome IS DISTINCT FROM 'success'
ORDER BY e.received_at;Any row returned by this query is an event that was in the affected window but did not replay successfully. Work through those failures individually — they may represent events where downstream state is genuinely ambiguous and requires manual resolution.
A bad deploy affecting your webhook handler is recoverable if you have durable event storage and idempotent handlers. The replay itself is the easy part; the discipline is knowing exactly which events to replay, which side effects to suppress, and how to verify that the replay produced correct state. Get those three right and a four-hour handler outage becomes a one-hour remediation.
Set up durable event storage and one-click replay with GetHook