Most teams think about webhook reliability in terms of retries: if a delivery fails, retry it. What they don't think about until it's too late is what happens when the gateway itself fails — the process responsible for accepting, persisting, and forwarding events.
When your webhook gateway goes down, the consequences depend on a single question: does the upstream provider buffer your events, or does it discard them?
The answer varies by provider and is almost never documented prominently. Stripe buffers and retries for up to 72 hours. Many smaller providers retry for 30 minutes or less. Some don't retry at all. If your gateway is down for longer than the provider's retry window, those events are gone.
This post covers how to design your gateway for recoverability — what durability guarantees to build in, how to reason about replay windows, and how to handle the operational runbook when things actually go wrong.
The Failure Modes That Actually Kill You
Not all gateway outages are equal. The ones that cause event loss typically involve one of three scenarios:
| Failure mode | Event loss risk | Why |
|---|---|---|
| Process crash, DB healthy | Low | Events already persisted survive; process restarts and resumes |
| Network partition (gateway can't reach DB) | Medium | Events accepted over network but not persisted; in-memory queue lost on restart |
| Full host failure with ephemeral storage | High | Any in-flight state is gone; depends entirely on provider retry behavior |
| DB failure with process healthy | Medium | Gateway can't persist; must reject inbound events (which triggers provider retries) |
| Deployment with no graceful drain | Low-Medium | In-flight requests that haven't committed to DB may be dropped |
The pattern here is clear: event loss happens when you accept an event (return a 2xx) before you've persisted it durably. If the process dies between "return 200" and "write to Postgres," the event is lost from your side. The upstream provider believes it delivered successfully and won't retry.
The fix is deceptively simple: do not return 2xx until the event is committed to durable storage. Accept, write, then respond. This adds a database round-trip to your ingest latency, but it gives you the durability guarantee that makes recovery possible.
Ingest Durability: The Foundation
Here's the ingest path that ensures no event is acknowledged before it's persisted:
func (h *IngestHandler) Handle(w http.ResponseWriter, r *http.Request) {
// Verify signature before touching the DB
body, err := io.ReadAll(io.LimitReader(r.Body, 2<<20))
if err != nil {
httpx.BadRequest(w, "failed to read body")
return
}
if err := h.verifier.Verify(r, body); err != nil {
httpx.Unauthorized(w, "invalid signature")
return
}
// Write to DB — this must succeed before we respond
event := &events.Event{
ID: uuid.New(),
AccountID: h.accountID,
SourceID: h.source.ID,
Status: events.StatusReceived,
Payload: body,
ReceivedAt: time.Now().UTC(),
}
if err := h.store.Insert(r.Context(), event); err != nil {
// Return 500 — provider will retry
httpx.InternalError(w, err)
return
}
// Only now do we acknowledge receipt
httpx.Ok(w, map[string]string{"id": event.ID.String()})
}Two things to notice:
First, we return a 500 on database failure, not a 200. This is the correct behavior. A 500 tells the upstream provider "I didn't get this, please retry." A 200 tells it "I have this, you can move on." The former triggers provider retry logic; the latter permanently discards the event from the provider's perspective.
Second, we're relying on the request context for the DB write. If the client disconnects before the write completes, the context is cancelled and the write aborts cleanly — no partial commit. The provider sees a network error and retries.
Deployment Safety: Graceful Draining
A surprisingly common source of event loss is deployments. If your process is killed mid-request before the DB write commits, the event is lost.
The solution is a graceful shutdown sequence:
func main() {
srv := &http.Server{
Addr: ":8080",
Handler: buildRouter(),
}
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
go func() {
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("server error: %v", err)
}
}()
<-quit
log.Println("shutdown signal received, draining in-flight requests...")
// Give in-flight requests up to 30 seconds to complete
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := srv.Shutdown(ctx); err != nil {
log.Fatalf("server forced to shutdown: %v", err)
}
log.Println("shutdown complete")
}http.Server.Shutdown stops accepting new connections immediately and waits for all in-flight requests to complete before returning. Set the timeout conservatively — 30 seconds is enough for any reasonable DB write to complete. Your load balancer should stop routing to the instance as soon as the SIGTERM lands, so new requests won't pile up during the drain window.
In Kubernetes, set terminationGracePeriodSeconds to at least 45 (the 30s drain window plus a buffer), and configure a preStop sleep of 5 seconds to give the load balancer time to deregister the pod before SIGTERM arrives.
Database High Availability: Don't Let the DB Be Your Single Point of Failure
If your gateway's Postgres instance is on a single host with no replica, you've traded one single point of failure for another. A Postgres primary failure means your gateway can't persist events — and correctly returns 500s to providers — but the provider's retry window is now your recovery time objective.
The standard configuration for production:
- ›Streaming replication to at least one standby. Write-ahead log (WAL) ships to the replica synchronously or asynchronously depending on your durability/latency trade-off.
- ›Synchronous replication for the ingest path. With
synchronous_commit = onand a synchronous standby, a committed write is on at least two hosts before you return 200. A single host failure loses no committed data. - ›Automatic failover via Patroni or similar. Manual failover during an incident is slow and error-prone. Automatic failover with a short election timeout (10–30 seconds) keeps your 500 window short.
-- On the primary: verify synchronous replication is configured
SHOW synchronous_commit; -- should be 'on' or 'remote_apply'
SHOW synchronous_standby_names; -- should list your standby
-- Check replication lag from the primary
SELECT
application_name,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;If replication lag is consistently above a few megabytes, your standby can't keep up with write throughput. Either tune your write patterns, scale up the standby, or accept that failover will replay some WAL — and that replay window becomes your recovery time.
Replay Windows and Cursor Tracking
Even with durable ingest and HA Postgres, you need a plan for replaying missed events after an outage. The core question: which events did your delivery worker process, and which did it not?
This is a cursor problem. Your worker needs to track the last successfully processed event position so that after a restart, it picks up exactly where it left off — not from the beginning, not from "now."
A simple cursor table:
CREATE TABLE worker_cursors (
worker_id TEXT PRIMARY KEY,
last_event_id UUID,
last_processed_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);The worker updates this cursor atomically with the delivery attempt:
-- In a single transaction:
BEGIN;
-- Mark the event as delivered
UPDATE events
SET status = 'delivered', delivered_at = now()
WHERE id = $1;
-- Advance the cursor
INSERT INTO worker_cursors (worker_id, last_event_id, last_processed_at, updated_at)
VALUES ($2, $1, now(), now())
ON CONFLICT (worker_id) DO UPDATE
SET last_event_id = EXCLUDED.last_event_id,
last_processed_at = EXCLUDED.last_processed_at,
updated_at = EXCLUDED.updated_at;
COMMIT;On restart, the worker reads its cursor and resumes from that position:
SELECT id, payload, status
FROM events
WHERE id > (SELECT last_event_id FROM worker_cursors WHERE worker_id = $1)
AND status IN ('received', 'queued')
ORDER BY id
LIMIT 100;This gives you exactly-once cursor semantics within Postgres. Events processed before the crash have their status updated; events after the cursor position are re-queued on restart.
Runbook: When the Gateway Actually Goes Down
Having the right architecture is necessary but not sufficient. You need a runbook your on-call engineer can follow at 3am without making things worse.
| Step | Action | Why |
|---|---|---|
| 1. Confirm scope | Check health endpoint, DB connectivity, worker status separately | Determines which component failed |
| 2. Check provider retry windows | Look up each active source's provider docs | Establishes your time pressure |
| 3. Restore service | Restart process, restore DB, promote standby — whichever applies | Get ingest accepting again |
| 4. Audit ingest gap | Query events table for gap in received_at timestamps | Identifies missing events by time range |
| 5. Contact providers if needed | For critical providers, use their event log UI to identify events sent during gap | Manual recovery for events outside provider retry window |
| 6. Replay from provider | Use provider's event replay API if available (Stripe has one; GitHub does not) | Fills in events that didn't survive the gap |
| 7. Update post-mortem | Document gap duration, events lost (if any), root cause, fix | Prevents recurrence |
Step 4 is worth elaborating. A gap in your received_at timestamps is the clearest signal that your gateway was down and not accepting events. If you see no events between 14:32 and 14:47, that's your outage window. Query your events table:
SELECT
source_id,
date_trunc('minute', received_at) AS minute,
COUNT(*) AS event_count
FROM events
WHERE received_at BETWEEN '2026-04-11 14:00:00' AND '2026-04-11 15:00:00'
GROUP BY source_id, minute
ORDER BY minute;A missing minute in an otherwise active source is a confirmed gap. A source that normally processes 50 events/minute dropping to 0 for 15 minutes, then recovering, is your outage fingerprint.
Testing Your Recovery Path
Disaster recovery plans that have never been tested don't work when you need them. Run a quarterly chaos drill:
# Kill the gateway process (simulates crash)
kill -9 $(pgrep -f gethook-api)
# Send a test event directly to the source endpoint
curl -X POST https://your-gateway.example.com/ingest/src_test123 \
-H "Content-Type: application/json" \
-d '{"type": "test.event", "id": "drill-001"}'
# Expect 502 or connection refused — gateway is down
# Restart the gateway
./bin/gethook-api &
# Verify the cursor position and that the worker resumes correctly
psql $DATABASE_URL -c "SELECT * FROM worker_cursors WHERE worker_id = 'primary-worker';"
# Verify no events were lost from before the crash
psql $DATABASE_URL -c "SELECT COUNT(*) FROM events WHERE status = 'delivered' AND received_at > now() - INTERVAL '10 minutes';"Run this against staging with real event volume, not just a single test curl. The failure mode that bites you is usually at load — in-flight requests that were mid-commit when the process died.
GetHook's delivery worker uses a Postgres-backed cursor with transactional status updates, so restarts are safe by default — the worker picks up from its last committed position without manual intervention. But if you're running your own infrastructure, building and testing this cursor behavior yourself is non-negotiable before going to production.
Webhook gateway disaster recovery is not glamorous, but it's the difference between an outage that loses events and one that doesn't. The core principle is straightforward: persist before acknowledging, drain before shutting down, and know exactly where your worker left off. Everything else — HA Postgres, replay windows, provider retry budget — is detail work on top of that foundation.
Get started with GetHook and skip building durable ingest infrastructure from scratch →