Back to Blog
reliabilitydisaster recoverywebhooksinfrastructurearchitecture

Webhook Gateway Disaster Recovery: What to Do When Your Gateway Goes Down

Your webhook gateway is the single point of contact between upstream providers and your system. When it goes down, events stop flowing. Here's how to design for recoverability — without losing a single event.

J
Jordan Okafor
Senior Backend Engineer
April 11, 2026
10 min read

Most teams think about webhook reliability in terms of retries: if a delivery fails, retry it. What they don't think about until it's too late is what happens when the gateway itself fails — the process responsible for accepting, persisting, and forwarding events.

When your webhook gateway goes down, the consequences depend on a single question: does the upstream provider buffer your events, or does it discard them?

The answer varies by provider and is almost never documented prominently. Stripe buffers and retries for up to 72 hours. Many smaller providers retry for 30 minutes or less. Some don't retry at all. If your gateway is down for longer than the provider's retry window, those events are gone.

This post covers how to design your gateway for recoverability — what durability guarantees to build in, how to reason about replay windows, and how to handle the operational runbook when things actually go wrong.


The Failure Modes That Actually Kill You

Not all gateway outages are equal. The ones that cause event loss typically involve one of three scenarios:

Failure modeEvent loss riskWhy
Process crash, DB healthyLowEvents already persisted survive; process restarts and resumes
Network partition (gateway can't reach DB)MediumEvents accepted over network but not persisted; in-memory queue lost on restart
Full host failure with ephemeral storageHighAny in-flight state is gone; depends entirely on provider retry behavior
DB failure with process healthyMediumGateway can't persist; must reject inbound events (which triggers provider retries)
Deployment with no graceful drainLow-MediumIn-flight requests that haven't committed to DB may be dropped

The pattern here is clear: event loss happens when you accept an event (return a 2xx) before you've persisted it durably. If the process dies between "return 200" and "write to Postgres," the event is lost from your side. The upstream provider believes it delivered successfully and won't retry.

The fix is deceptively simple: do not return 2xx until the event is committed to durable storage. Accept, write, then respond. This adds a database round-trip to your ingest latency, but it gives you the durability guarantee that makes recovery possible.


Ingest Durability: The Foundation

Here's the ingest path that ensures no event is acknowledged before it's persisted:

go
func (h *IngestHandler) Handle(w http.ResponseWriter, r *http.Request) {
    // Verify signature before touching the DB
    body, err := io.ReadAll(io.LimitReader(r.Body, 2<<20))
    if err != nil {
        httpx.BadRequest(w, "failed to read body")
        return
    }
    if err := h.verifier.Verify(r, body); err != nil {
        httpx.Unauthorized(w, "invalid signature")
        return
    }

    // Write to DB — this must succeed before we respond
    event := &events.Event{
        ID:        uuid.New(),
        AccountID: h.accountID,
        SourceID:  h.source.ID,
        Status:    events.StatusReceived,
        Payload:   body,
        ReceivedAt: time.Now().UTC(),
    }
    if err := h.store.Insert(r.Context(), event); err != nil {
        // Return 500 — provider will retry
        httpx.InternalError(w, err)
        return
    }

    // Only now do we acknowledge receipt
    httpx.Ok(w, map[string]string{"id": event.ID.String()})
}

Two things to notice:

First, we return a 500 on database failure, not a 200. This is the correct behavior. A 500 tells the upstream provider "I didn't get this, please retry." A 200 tells it "I have this, you can move on." The former triggers provider retry logic; the latter permanently discards the event from the provider's perspective.

Second, we're relying on the request context for the DB write. If the client disconnects before the write completes, the context is cancelled and the write aborts cleanly — no partial commit. The provider sees a network error and retries.


Deployment Safety: Graceful Draining

A surprisingly common source of event loss is deployments. If your process is killed mid-request before the DB write commits, the event is lost.

The solution is a graceful shutdown sequence:

go
func main() {
    srv := &http.Server{
        Addr:    ":8080",
        Handler: buildRouter(),
    }

    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)

    go func() {
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatalf("server error: %v", err)
        }
    }()

    <-quit
    log.Println("shutdown signal received, draining in-flight requests...")

    // Give in-flight requests up to 30 seconds to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Fatalf("server forced to shutdown: %v", err)
    }
    log.Println("shutdown complete")
}

http.Server.Shutdown stops accepting new connections immediately and waits for all in-flight requests to complete before returning. Set the timeout conservatively — 30 seconds is enough for any reasonable DB write to complete. Your load balancer should stop routing to the instance as soon as the SIGTERM lands, so new requests won't pile up during the drain window.

In Kubernetes, set terminationGracePeriodSeconds to at least 45 (the 30s drain window plus a buffer), and configure a preStop sleep of 5 seconds to give the load balancer time to deregister the pod before SIGTERM arrives.


Database High Availability: Don't Let the DB Be Your Single Point of Failure

If your gateway's Postgres instance is on a single host with no replica, you've traded one single point of failure for another. A Postgres primary failure means your gateway can't persist events — and correctly returns 500s to providers — but the provider's retry window is now your recovery time objective.

The standard configuration for production:

  • Streaming replication to at least one standby. Write-ahead log (WAL) ships to the replica synchronously or asynchronously depending on your durability/latency trade-off.
  • Synchronous replication for the ingest path. With synchronous_commit = on and a synchronous standby, a committed write is on at least two hosts before you return 200. A single host failure loses no committed data.
  • Automatic failover via Patroni or similar. Manual failover during an incident is slow and error-prone. Automatic failover with a short election timeout (10–30 seconds) keeps your 500 window short.
sql
-- On the primary: verify synchronous replication is configured
SHOW synchronous_commit;       -- should be 'on' or 'remote_apply'
SHOW synchronous_standby_names; -- should list your standby

-- Check replication lag from the primary
SELECT
    application_name,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;

If replication lag is consistently above a few megabytes, your standby can't keep up with write throughput. Either tune your write patterns, scale up the standby, or accept that failover will replay some WAL — and that replay window becomes your recovery time.


Replay Windows and Cursor Tracking

Even with durable ingest and HA Postgres, you need a plan for replaying missed events after an outage. The core question: which events did your delivery worker process, and which did it not?

This is a cursor problem. Your worker needs to track the last successfully processed event position so that after a restart, it picks up exactly where it left off — not from the beginning, not from "now."

A simple cursor table:

sql
CREATE TABLE worker_cursors (
    worker_id   TEXT PRIMARY KEY,
    last_event_id   UUID,
    last_processed_at   TIMESTAMPTZ,
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

The worker updates this cursor atomically with the delivery attempt:

sql
-- In a single transaction:
BEGIN;

-- Mark the event as delivered
UPDATE events
SET status = 'delivered', delivered_at = now()
WHERE id = $1;

-- Advance the cursor
INSERT INTO worker_cursors (worker_id, last_event_id, last_processed_at, updated_at)
VALUES ($2, $1, now(), now())
ON CONFLICT (worker_id) DO UPDATE
SET last_event_id = EXCLUDED.last_event_id,
    last_processed_at = EXCLUDED.last_processed_at,
    updated_at = EXCLUDED.updated_at;

COMMIT;

On restart, the worker reads its cursor and resumes from that position:

sql
SELECT id, payload, status
FROM events
WHERE id > (SELECT last_event_id FROM worker_cursors WHERE worker_id = $1)
  AND status IN ('received', 'queued')
ORDER BY id
LIMIT 100;

This gives you exactly-once cursor semantics within Postgres. Events processed before the crash have their status updated; events after the cursor position are re-queued on restart.


Runbook: When the Gateway Actually Goes Down

Having the right architecture is necessary but not sufficient. You need a runbook your on-call engineer can follow at 3am without making things worse.

StepActionWhy
1. Confirm scopeCheck health endpoint, DB connectivity, worker status separatelyDetermines which component failed
2. Check provider retry windowsLook up each active source's provider docsEstablishes your time pressure
3. Restore serviceRestart process, restore DB, promote standby — whichever appliesGet ingest accepting again
4. Audit ingest gapQuery events table for gap in received_at timestampsIdentifies missing events by time range
5. Contact providers if neededFor critical providers, use their event log UI to identify events sent during gapManual recovery for events outside provider retry window
6. Replay from providerUse provider's event replay API if available (Stripe has one; GitHub does not)Fills in events that didn't survive the gap
7. Update post-mortemDocument gap duration, events lost (if any), root cause, fixPrevents recurrence

Step 4 is worth elaborating. A gap in your received_at timestamps is the clearest signal that your gateway was down and not accepting events. If you see no events between 14:32 and 14:47, that's your outage window. Query your events table:

sql
SELECT
    source_id,
    date_trunc('minute', received_at) AS minute,
    COUNT(*) AS event_count
FROM events
WHERE received_at BETWEEN '2026-04-11 14:00:00' AND '2026-04-11 15:00:00'
GROUP BY source_id, minute
ORDER BY minute;

A missing minute in an otherwise active source is a confirmed gap. A source that normally processes 50 events/minute dropping to 0 for 15 minutes, then recovering, is your outage fingerprint.


Testing Your Recovery Path

Disaster recovery plans that have never been tested don't work when you need them. Run a quarterly chaos drill:

bash
# Kill the gateway process (simulates crash)
kill -9 $(pgrep -f gethook-api)

# Send a test event directly to the source endpoint
curl -X POST https://your-gateway.example.com/ingest/src_test123 \
  -H "Content-Type: application/json" \
  -d '{"type": "test.event", "id": "drill-001"}'

# Expect 502 or connection refused — gateway is down

# Restart the gateway
./bin/gethook-api &

# Verify the cursor position and that the worker resumes correctly
psql $DATABASE_URL -c "SELECT * FROM worker_cursors WHERE worker_id = 'primary-worker';"

# Verify no events were lost from before the crash
psql $DATABASE_URL -c "SELECT COUNT(*) FROM events WHERE status = 'delivered' AND received_at > now() - INTERVAL '10 minutes';"

Run this against staging with real event volume, not just a single test curl. The failure mode that bites you is usually at load — in-flight requests that were mid-commit when the process died.

GetHook's delivery worker uses a Postgres-backed cursor with transactional status updates, so restarts are safe by default — the worker picks up from its last committed position without manual intervention. But if you're running your own infrastructure, building and testing this cursor behavior yourself is non-negotiable before going to production.


Webhook gateway disaster recovery is not glamorous, but it's the difference between an outage that loses events and one that doesn't. The core principle is straightforward: persist before acknowledging, drain before shutting down, and know exactly where your worker left off. Everything else — HA Postgres, replay windows, provider retry budget — is detail work on top of that foundation.

Get started with GetHook and skip building durable ingest infrastructure from scratch →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.