Back to Blog
webhooksreliabilityarchitectureonboardingreplay

The Webhook Cold Start Problem: Safely Bootstrapping a New Consumer Endpoint

Subscribing a new endpoint to an existing webhook stream is harder than it looks. Here's how to catch up on missed events, avoid re-processing old ones, and get your new consumer to steady state without data loss or duplicate side effects.

P
Priya Nair
Developer Advocate
April 11, 2026
9 min read

When you register a brand-new webhook endpoint, you face a problem that rarely gets discussed: what happens to the events that fired before your endpoint existed? For many use cases, the answer is "nothing" — you only care about future events. But for use cases where events represent state transitions — order status changes, subscription lifecycle events, payment reconciliation — missing the historical window means starting with an incomplete picture.

This is the webhook cold start problem. It has four sub-problems that you need to solve in sequence:

  1. Backfill: How do you catch up on events that fired before your endpoint was registered?
  2. Ordering: During backfill, how do you handle events arriving out of order?
  3. Deduplication: During the cutover from backfill to live stream, how do you avoid processing the same event twice?
  4. Validation: How do you know when your endpoint has reached steady state?

This post works through each of these. The solutions apply whether you're building your own webhook platform or integrating with a third-party provider.


Why Cold Starts Are Easy to Get Wrong

The instinct when onboarding a new consumer is to start it, register the webhook, and trust that everything from that point forward will arrive. For stateless events — "a user clicked something" — this is fine. For stateful events — "an order moved to status X" — you now have a consumer that knows about state transitions from timestamp T onwards, but has no view of the state that existed before T.

The concrete failure mode: your new order management system registers a webhook for order.status_changed. It starts receiving events. Thirty minutes later, you discover that 12 orders were in a payment_failed state before your endpoint came online. Your system has no record of them. When the payment provider sends order.status_changed for one of those orders moving to payment_retry, your consumer has no context for the transition and either drops the event or creates corrupt state.

Backfilling is not optional for stateful consumers. It's part of the bootstrap.


Phase 1: Backfill via Event Replay

Most production webhook platforms expose an event history API — a paginated feed of past events that you can query. Before you start processing live events, you need to replay history up to the moment your subscription became active.

The pattern is a cursor-based replay loop:

go
func backfill(client *WebhookClient, subscriptionStart time.Time) error {
    cursor := ""
    for {
        resp, err := client.ListEvents(ListEventsParams{
            Before:  subscriptionStart,
            After:   subscriptionStart.Add(-72 * time.Hour), // 3-day window
            Cursor:  cursor,
            Limit:   100,
        })
        if err != nil {
            return fmt.Errorf("backfill fetch: %w", err)
        }

        for _, event := range resp.Events {
            if err := processEvent(event); err != nil {
                // Log but don't abort — partial backfill is better than none
                log.Printf("backfill event %s failed: %v", event.ID, err)
            }
        }

        if !resp.HasMore {
            break
        }
        cursor = resp.NextCursor
    }
    return nil
}

Key decisions in this loop:

How far back to go. This depends on your domain. If you're reconciling payment state, you might need 7 days of history. If you're syncing inventory counts, 24 hours is probably enough. Pick a window that covers your longest meaningful state transition lifecycle.

What to do with failures. A single event processing failure during backfill should not abort the entire backfill. Log the failure, continue, and handle it manually later. A failed backfill that stops at 40% completion is worse than a completed backfill with 3 skipped events.

Ordering during backfill. Most event history APIs return events newest-first. If your state transitions are order-sensitive, you need to reverse the page order — collect all pages into a buffer, then process from oldest to newest. Alternatively, use an API parameter like sort=asc if the provider supports it.


Phase 2: The Cutover Window

The hardest part of bootstrapping is the gap between "backfill started" and "live stream processing started." During this window, new events are arriving on the live subscription, but you're still processing backfill history. You need to buffer the live stream without processing it yet.

Here's the sequence:

T=0   Create subscription. Start buffering incoming events.
T=1   Begin backfill from (T=0 - window) to T=0.
T=N   Backfill completes.
T=N+1 Drain the buffer. Begin processing live stream.

The buffer is the key mechanism. Your webhook endpoint should accept incoming events during the backfill, write them to a holding table, and return 200 immediately — without processing them. Once backfill is complete, you drain the buffer in order.

sql
CREATE TABLE webhook_buffer (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id      TEXT NOT NULL UNIQUE,  -- provider's event ID for deduplication
    event_type    TEXT NOT NULL,
    payload       JSONB NOT NULL,
    received_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at  TIMESTAMPTZ
);

CREATE INDEX webhook_buffer_unprocessed ON webhook_buffer (received_at)
    WHERE processed_at IS NULL;

The event_id column with a UNIQUE constraint is your deduplication key. If a backfill event and a live buffered event have the same ID, the insert will fail — and you can safely ignore the duplicate.


Phase 3: Deduplication at the Seam

The overlap between the tail end of backfill and the start of the live buffer is where duplicates appear. An event that fired at T=0-2s might show up in both your backfill replay and your live subscription. You need to process it exactly once.

The cleanest approach is an idempotency table — a record of every event ID you have already processed:

sql
CREATE TABLE processed_events (
    event_id    TEXT PRIMARY KEY,
    processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Before processing any event — whether from backfill or live stream — check and record atomically:

go
func processOnce(db *sql.DB, eventID string, fn func() error) error {
    tx, err := db.Begin()
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Try to claim this event ID
    _, err = tx.Exec(
        `INSERT INTO processed_events (event_id) VALUES ($1) ON CONFLICT DO NOTHING`,
        eventID,
    )
    if err != nil {
        return fmt.Errorf("claim event: %w", err)
    }

    // Check if we actually claimed it (affected rows = 0 means already processed)
    var count int
    tx.QueryRow(`SELECT COUNT(*) FROM processed_events WHERE event_id = $1`, eventID).Scan(&count)
    // A simpler approach: query rows affected from the INSERT
    // Using ON CONFLICT DO NOTHING, if no row was inserted, skip processing
    if err := fn(); err != nil {
        return err // tx.Rollback() fires via defer
    }

    return tx.Commit()
}

In practice, use INSERT ... ON CONFLICT DO NOTHING and check the rows affected. If zero rows were inserted, the event was already processed — skip it. The transaction boundary ensures that even if two workers race on the same event ID, only one processes it.


Ordering Guarantees During Drain

When you drain the buffer, process events in the order they were received (received_at ASC), not in the order they arrived from the provider. For most event types this is the same thing. For providers that send retries out of chronological order, this distinction matters.

Event sourceRecommended ordering key
Backfill replayevent.created_at ASC (provider timestamp)
Live buffer drainwebhook_buffer.received_at ASC
Steady-state liveProcess as received, rely on idempotency

If your domain requires strict causal ordering (event B must be processed after event A for the same entity), group events by entity ID during drain and process each entity's events sequentially. You can still process different entities concurrently.


Phase 4: Validating Steady State

How do you know the bootstrap is complete and your consumer is healthy? Define success criteria before you start, not after.

A useful checklist:

CheckHow to verify
Backfill completed without fatal errorsLog a completion record with error count
Buffer fully drainedSELECT COUNT(*) FROM webhook_buffer WHERE processed_at IS NULL = 0
No processing gapsCompare event count from provider API vs. your processed_events table
Live stream latency normalP99 time from event creation to processing < SLA threshold
No unusual error rateError rate on live events matches baseline from similar consumers

The event count comparison is worth elaborating on. Most providers expose an event count endpoint or let you count events from the history API. After draining your buffer, query both:

bash
# Events the provider says fired in your bootstrap window
PROVIDER_COUNT=$(curl -s "https://api.provider.com/events?after=T0&before=T1&count=true" \
  -H "Authorization: Bearer $API_KEY" | jq '.total_count')

# Events you recorded as processed in the same window
YOUR_COUNT=$(psql -tAc "
  SELECT COUNT(*) FROM processed_events
  WHERE processed_at BETWEEN '$T0' AND '$T1'
")

echo "Provider: $PROVIDER_COUNT | Ours: $YOUR_COUNT | Gap: $((PROVIDER_COUNT - YOUR_COUNT))"

A nonzero gap is not necessarily a problem — providers sometimes include internal or system events in their count that are not delivered to subscribers. But a gap larger than a few percent warrants investigation before you declare the consumer healthy.


When the Provider Doesn't Support Replay

Not every webhook provider has a replay API. Stripe does. GitHub does. Many SaaS platforms don't. If you're integrating with a provider that only delivers events going forward, your bootstrap strategy changes:

  1. Seed from the REST API. Before registering the webhook, call the provider's REST API to fetch current state for all relevant entities. Write that state to your database. This is your backfill substitute.
  2. Register the webhook subscription after the REST fetch completes. Any events fired during the REST fetch will appear in your live stream. Your idempotency layer handles events that duplicate state already fetched.
  3. Reconcile periodically. For the first 24–48 hours after bootstrap, run a scheduled job that calls the REST API and compares state against what your webhook stream has delivered. Differences are events you missed during the bootstrap window.

This is more work than replay-based bootstrap, but it's the practical answer for providers that don't expose event history.

GetHook's replay API lets you re-deliver any event by ID or replay all events for a source within a time window — which makes bootstrapping new destinations against existing sources straightforward without building your own replay infrastructure.


Checklist for a Safe Cold Start

StepAction
1Create the subscription and start buffering. Record subscription_start timestamp.
2Backfill from provider event history up to subscription_start.
3Process backfill events oldest-first. Record each event ID in processed_events.
4Drain the live buffer in received_at ASC order. Skip duplicates via ON CONFLICT DO NOTHING.
5Switch to live stream processing.
6Validate: compare event counts, check buffer is empty, verify error rate.
7Tear down the buffer after a 24-hour monitoring window.

The cold start problem is one of those distributed systems details that seems minor until it causes an incident. Getting it right at the start saves you from a messy post-hoc reconciliation job — and gives you a reliable, auditable record of exactly which events your consumer has processed from day one.


Ready to build webhook consumers with replay, deduplication, and full event history built in? Start with GetHook — or explore the event replay docs to see how replay-based bootstrapping works against your existing sources.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.