Back to Blog
webhooksreliabilityevent replaydebuggingoperations

Replaying Webhooks After a Bad Deploy: Re-Processing Events Your Handler Got Wrong

You shipped a bug in your webhook handler and events were processed incorrectly for the last four hours. Here's how to replay only the affected events, in the right order, without re-triggering side effects you've already fixed.

Y
Yuki Tanaka
Founding Engineer
April 12, 2026
10 min read

Every engineering team ships bugs. Most bugs are containable — a bad API response, a broken UI, a mishandled edge case. But when the bug lives in your webhook handler, the damage is different. Events have already been processed. Side effects have already fired. Orders have already been updated (or not). Payments have already been recorded (incorrectly). And unlike a UI bug, there's no user staring at a screen who will immediately report it.

The operational question when you catch this isn't "how do I fix the handler?" — that's the easy part. It's: "how do I re-process only the events my broken handler touched, in a way that produces correct state without creating new problems?"

This post walks through the mechanics of doing that safely.


The Anatomy of the Problem

A bad-deploy replay scenario typically has three boundaries:

  1. The deploy timestamp — when the broken handler went live
  2. The fix timestamp — when the corrected handler was deployed
  3. The affected event types — which webhook events were mishandled

The replay window is everything between timestamps 1 and 2. Not all events in that window are necessarily affected — only those that hit the broken code path. If your bug was in the order.fulfilled handler, you only need to replay order.fulfilled events. Replaying payment.captured events from the same window would be harmless but pointless.

Before writing any replay logic, produce this table explicitly:

FieldValue
Bug introduced at2026-04-12 09:14:00 UTC
Fix deployed at2026-04-12 13:02:00 UTC
Affected event typesorder.fulfilled
Events affected (count)247
Side effects to re-executeInventory update, fulfillment notification
Side effects already correctEmail to customer (separate system, not broken)

Getting this table right before touching anything is the most important step. Incomplete scope leads to under-replay (you miss events) or over-replay (you re-process events that were handled correctly and create duplicates).


Querying the Affected Events

If you're using a durable event store, the affected events are already in your database. Pull them using the time window and event type filter:

sql
SELECT
    id,
    source_id,
    received_at,
    payload,
    status
FROM events
WHERE received_at >= '2026-04-12 09:14:00+00'
  AND received_at <  '2026-04-12 13:02:00+00'
  AND payload->>'type' = 'order.fulfilled'
  AND status IN ('delivered', 'dead_letter')
ORDER BY received_at ASC;

Two things to note about this query:

Filter by status. You only want events that were processed during the broken window — delivered or dead_letter. Events that are still queued or retry_scheduled will be processed by your now-fixed handler automatically. Do not replay those.

Order by received_at ASC. If the events represent state transitions that depend on each other — an order moving from created to processing to fulfilled — processing them out of order produces incorrect state. Always replay in the order they originally arrived.


Idempotency Is Your Safety Net

Before triggering any replay, verify that your handler is idempotent. If it isn't, stop and fix that first.

An idempotent handler produces the same result regardless of how many times it's called with the same input. For webhook handlers, this typically means:

go
// Non-idempotent: creates a duplicate fulfillment record every time
func handleOrderFulfilled(ctx context.Context, event OrderFulfilledEvent) error {
    return db.InsertFulfillmentRecord(ctx, event.OrderID, event.FulfilledAt)
}

// Idempotent: upserts on the stable event ID
func handleOrderFulfilled(ctx context.Context, event OrderFulfilledEvent) error {
    return db.UpsertFulfillmentRecord(ctx, UpsertParams{
        EventID:     event.ID,   // stable, unique per event
        OrderID:     event.OrderID,
        FulfilledAt: event.FulfilledAt,
    })
}

The UpsertFulfillmentRecord uses the webhook event's id as a natural idempotency key. If the event has already been processed correctly after your fix, re-processing it produces the same row — no duplicate side effects.

If your downstream systems don't support idempotent writes natively, add an idempotency_keys table:

sql
CREATE TABLE idempotency_keys (
    key         TEXT PRIMARY KEY,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    result_code INT
);

-- In your handler:
INSERT INTO idempotency_keys (key, result_code)
VALUES ($1, $2)
ON CONFLICT (key) DO NOTHING
RETURNING key;

-- If no row returned, this event was already processed — skip.

With this guard in place, replaying an event that was already correctly processed after your fix is a no-op.


Replay Strategy: Batch or Stream?

You have 247 events to replay. How you feed them through your handler matters.

Batch replay (all at once) is fastest but dangerous if your handler calls downstream APIs with rate limits, or if your fix introduced a state dependency (event N must complete before event N+1 starts).

Streamed replay with concurrency control is slower but safer. You process a configurable number of events in parallel, rate-limited to protect downstream systems.

A minimal replay script in Go:

go
func replayEvents(ctx context.Context, store *EventStore, handler EventHandler) error {
    const (
        batchSize   = 10
        workerCount = 3
        delayBetweenBatches = 500 * time.Millisecond
    )

    events, err := store.QueryAffectedEvents(ctx, QueryParams{
        After:      time.Date(2026, 4, 12, 9, 14, 0, 0, time.UTC),
        Before:     time.Date(2026, 4, 12, 13, 2, 0, 0, time.UTC),
        EventType:  "order.fulfilled",
        Statuses:   []string{"delivered", "dead_letter"},
        OrderByAsc: true,
    })
    if err != nil {
        return fmt.Errorf("query affected events: %w", err)
    }

    log.Printf("replaying %d events", len(events))

    sem := make(chan struct{}, workerCount)
    var wg sync.WaitGroup
    var mu sync.Mutex
    var failed []string

    for i, ev := range events {
        sem <- struct{}{}
        wg.Add(1)
        go func(ev Event) {
            defer wg.Done()
            defer func() { <-sem }()

            if err := handler.Handle(ctx, ev); err != nil {
                mu.Lock()
                failed = append(failed, ev.ID.String())
                mu.Unlock()
                log.Printf("replay failed for event %s: %v", ev.ID, err)
            } else {
                log.Printf("replayed event %s", ev.ID)
            }
        }(ev)

        // Throttle between batches
        if (i+1)%batchSize == 0 {
            wg.Wait()
            time.Sleep(delayBetweenBatches)
        }
    }
    wg.Wait()

    if len(failed) > 0 {
        return fmt.Errorf("replay completed with %d failures: %v", len(failed), failed)
    }
    return nil
}

The workerCount = 3 and delayBetweenBatches = 500ms are conservative defaults. Tune them based on your downstream systems' throughput. If your fulfillment API is rate-limited to 10 req/s, keep worker count low and add a longer delay.


Tracking Replay Status

A replay that runs for 30 minutes and fails silently halfway through is worse than a replay that never ran. You need to track which events have been replayed and which haven't.

Add a replay_attempts table (or, if you already have a delivery attempts table, tag replays with a replay_run_id):

sql
ALTER TABLE delivery_attempts
    ADD COLUMN replay_run_id UUID;

-- When querying replay progress:
SELECT
    COUNT(*) FILTER (WHERE replay_run_id = $1) AS replayed,
    COUNT(*) FILTER (WHERE replay_run_id IS NULL AND status = 'delivered') AS not_yet_replayed,
    COUNT(*) FILTER (WHERE replay_run_id = $1 AND outcome = 'success') AS replay_succeeded,
    COUNT(*) FILTER (WHERE replay_run_id = $1 AND outcome != 'success') AS replay_failed
FROM delivery_attempts da
JOIN events e ON e.id = da.event_id
WHERE e.received_at >= $2
  AND e.received_at <  $3
  AND e.payload->>'type' = $4;

This lets you query replay progress at any point, resume a partial replay from where it left off (skip events where replay_run_id IS NOT NULL), and produce a final audit record of what was replayed and what failed.


Side Effects You Must Not Replay

Not every side effect should run again. This is the hardest part to get right and the most common source of replay-induced incidents.

Side effectReplay safe?Reason
Update internal DB record (idempotent upsert)YesSame result on re-run
Push notification to mobileNoUser gets duplicate notification
Email confirmationNoUser gets duplicate email
Charge a paymentNoCatastrophic if run twice
Update an inventory count (non-idempotent)NoCount goes negative
Enqueue a downstream job (idempotent check)Yes, with guardCheck job doesn't already exist
Call a partner APIDependsCheck their idempotency key support

The pattern for suppressing non-replayable side effects during a replay run is a feature flag on the handler:

go
type HandlerConfig struct {
    IsReplay bool
    // ... other config
}

func handleOrderFulfilled(ctx context.Context, event OrderFulfilledEvent, cfg HandlerConfig) error {
    // Always safe to re-run
    if err := db.UpsertFulfillmentRecord(ctx, event); err != nil {
        return err
    }

    // Skip notifications during replay — user already received these
    if !cfg.IsReplay {
        if err := notifications.SendFulfillmentEmail(ctx, event.OrderID); err != nil {
            return err
        }
    }

    return nil
}

Set IsReplay = true when running your replay script. The handler re-executes the core data mutation but skips external notifications.


Using GetHook's Built-In Replay

If your inbound events flow through GetHook, you don't need to write the replay infrastructure from scratch. GetHook stores every inbound event durably and exposes a replay endpoint:

bash
# Replay a specific event
curl -X POST https://api.gethook.to/v1/events/{event_id}/replay \
  -H "Authorization: Bearer hk_..."

# For bulk replay, query your event list with filters and loop
curl "https://api.gethook.to/v1/events?type=order.fulfilled&after=2026-04-12T09:14:00Z&before=2026-04-12T13:02:00Z&status=delivered" \
  -H "Authorization: Bearer hk_..." | jq -r '.data[].id' | while read id; do
    curl -s -X POST "https://api.gethook.to/v1/events/$id/replay" \
      -H "Authorization: Bearer hk_..." > /dev/null
    echo "replayed $id"
  done

The replayed events flow through the same delivery pipeline — signature verification, routing, retry logic — but with a replayed status tag so you can distinguish them in your delivery logs. Your idempotency guards in the handler remain your protection against duplicate side effects; the gateway re-delivers, you decide what to re-execute.


Post-Replay Audit

Once the replay is complete, verify correctness before closing the incident:

sql
-- Check that all affected events now have a successful replay attempt
SELECT
    e.id,
    e.received_at,
    da_original.outcome AS original_outcome,
    da_replay.outcome   AS replay_outcome
FROM events e
JOIN delivery_attempts da_original
    ON da_original.event_id = e.id AND da_original.replay_run_id IS NULL
LEFT JOIN delivery_attempts da_replay
    ON da_replay.event_id = e.id AND da_replay.replay_run_id = $1
WHERE e.received_at >= $2
  AND e.received_at <  $3
  AND da_replay.outcome IS DISTINCT FROM 'success'
ORDER BY e.received_at;

Any row returned by this query is an event that was in the affected window but did not replay successfully. Work through those failures individually — they may represent events where downstream state is genuinely ambiguous and requires manual resolution.


A bad deploy affecting your webhook handler is recoverable if you have durable event storage and idempotent handlers. The replay itself is the easy part; the discipline is knowing exactly which events to replay, which side effects to suppress, and how to verify that the replay produced correct state. Get those three right and a four-hour handler outage becomes a one-hour remediation.

Set up durable event storage and one-click replay with GetHook

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.