Every webhook provider has a timeout budget. Stripe gives you 30 seconds. GitHub gives you 10. Shopify gives you 5. Exceed that budget, and the provider treats your endpoint as failed and schedules a retry — even if your handler eventually succeeds.
This sounds simple until you're running database migrations, processing image uploads, or calling three downstream services inside a webhook handler. Suddenly a "quick handler" takes 12 seconds, and you have a retry storm on your hands.
This post covers the mechanics of webhook acknowledgment, patterns for staying inside timeout budgets, and how async handoff actually works in practice.
What Providers Actually Measure
When a provider sends a webhook, the clock starts the moment their request leaves their servers. They measure time to first byte of your HTTP response — not time to process, not time to persist, not time to trigger downstream work.
Here's what different providers will do when your endpoint times out:
| Provider | Timeout | Retry schedule | Max attempts |
|---|---|---|---|
| Stripe | 30s | 1h, 1h, 2h, 4h, 8h, 20h, 1d, 3d | ~8 over 3 days |
| GitHub | 10s | ~1min, gradual backoff | ~3 |
| Shopify | 5s | 1min, 5min, 1h, 8h, 24h, 48h, 72h | ~7 over 48h |
| Twilio | 15s | 5min, 10min, 30min, 1h, 4h | 5 |
| PagerDuty | 16s | 30s, 1min, 2min, 5min | ~5 |
| SendGrid | 30s | None (best-effort) | 1 |
A few observations from this table: SendGrid does not retry at all, so if you miss it, you miss it. GitHub's retry window is extremely narrow. Shopify's 5-second budget is the tightest of the major providers and catches most teams off-guard the first time they integrate.
The Anti-Pattern: Synchronous Handlers
The most common webhook reliability mistake is processing synchronously inside the handler:
func (h *WebhookHandler) HandleStripe(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
// Signature verification — fast, ~1ms
if !verifyStripe(body, r.Header.Get("Stripe-Signature"), h.secret) {
http.Error(w, "unauthorized", http.StatusUnauthorized)
return
}
var event StripeEvent
json.Unmarshal(body, &event)
// ❌ Everything below happens synchronously, inside the provider's timeout window
if event.Type == "customer.subscription.created" {
h.db.CreateSubscriptionRecord(r.Context(), event) // 50-300ms normally, 5s+ under load
h.email.SendWelcomeEmail(r.Context(), event) // 200-2000ms, external HTTP call
h.crm.UpdateCustomerPlan(r.Context(), event) // 100-800ms, external HTTP call
h.billing.ProvisionLimits(r.Context(), event) // 100-500ms, might hit DB lock
h.slack.NotifyTeam(r.Context(), event) // 200-1000ms, external HTTP call
}
w.WriteHeader(http.StatusOK)
}In steady state, this might complete in 800ms. Under load, when your database is slow, when Slack's API is degraded, or when you're in the middle of a deploy, this blows past 5 seconds easily. The provider retries. Your idempotency layer (if you have one) saves you from duplicate charges. If you don't have one, you're double-provisioning customers.
The Right Pattern: Acknowledge First, Process Second
The fix is a strict two-phase design:
- ›Receive and persist — verify the signature, write the raw event to a durable store, return 200
- ›Process asynchronously — a worker reads from the store and executes your business logic
func (h *WebhookHandler) HandleStripe(w http.ResponseWriter, r *http.Request) {
body, err := io.ReadAll(io.LimitReader(r.Body, 10<<20))
if err != nil {
http.Error(w, "read error", http.StatusBadRequest)
return
}
sig := r.Header.Get("Stripe-Signature")
if !verifyStripe(body, sig, h.secret) {
http.Error(w, "unauthorized", http.StatusUnauthorized)
return
}
// ✅ Only fast, low-risk operations before the 200
eventID, err := h.queue.Enqueue(r.Context(), QueuedEvent{
Source: "stripe",
Body: body,
Headers: extractHeaders(r.Header),
ReceivedAt: time.Now().UTC(),
})
if err != nil {
// Queue write failed — return 500, provider will retry
http.Error(w, "internal error", http.StatusInternalServerError)
return
}
w.Header().Set("X-Event-ID", eventID)
w.WriteHeader(http.StatusOK)
}The enqueue operation should be fast — a single INSERT into Postgres or a push to a local queue. Target under 50ms. This gives you a 250ms budget for signature verification, body reading, and enqueueing — well inside even Shopify's 5-second limit.
Choosing the Right Queue Backend
Your queue backend determines how much durability and throughput you get for the async handoff.
| Queue | Durability | Latency | Throughput | Operational overhead |
|---|---|---|---|---|
| Postgres (FOR UPDATE SKIP LOCKED) | High | 5–50ms | 1K–10K/s | None (already running) |
| Redis (LPUSH/BRPOP) | Low (AOF only) | 1–5ms | 100K+/s | Separate service, config |
| SQS | High | 20–100ms | Effectively unlimited | AWS dependency |
| RabbitMQ | High | 2–10ms | 50K+/s | Separate service, ops |
| Kafka | Very high | 10–50ms | Very high | High ops complexity |
For most teams at up to 50K events/day, Postgres is the right answer. You already have it. It gives you transactional guarantees — you can INSERT the event and enqueue the job in the same transaction. If your process crashes between the enqueue and the job pickup, nothing is lost.
-- Insert event and job atomically
BEGIN;
INSERT INTO webhook_events (id, source, body, received_at)
VALUES ($1, $2, $3, $4);
INSERT INTO webhook_jobs (id, event_id, status, next_attempt_at)
VALUES (gen_random_uuid(), $1, 'queued', NOW());
COMMIT;Redis is faster but gives you a narrow durability window with default configuration. If your Redis process crashes between the enqueue and the worker pickup, the event is gone. For webhook workloads where the provider will retry on your behalf, this may be acceptable — but verify the provider's retry behavior before relying on it.
Timeout Budget Accounting
Here's a model for thinking about your budget before you write a single line of handler code:
Provider timeout (e.g., 5s for Shopify)
- Network round-trip from provider: ~20-80ms
- TLS handshake (if new connection): ~50-100ms
- Reverse proxy overhead (nginx/LB): ~5-20ms
- Request body read (10KB payload): ~1-5ms
- Signature verification (HMAC-SHA256): ~1-2ms
- Queue INSERT (Postgres on same AZ): ~5-50ms
- Response write + flush: ~1-5ms
---------------------------------------------------
Budget consumed: ~85-260ms
Remaining margin: ~4.7-4.9sThat margin looks comfortable — until you factor in:
- ›Cold database connections on a freshly restarted process: add 200-500ms
- ›Connection pool exhaustion during traffic spikes: add 500ms-2s of wait
- ›GC pause in a garbage-collected runtime: add 50-500ms
- ›P99 queue INSERT under load: can be 10-20x the P50
The practical rule: target under 200ms for the acknowledge-and-enqueue path, measured at P99. If you're regularly hitting 500ms+, something is wrong — investigate before it causes intermittent timeouts under load.
Handling Queue Write Failures
What happens if your queue write fails? You have two options:
Option A: Return 500, let the provider retry. This works if the provider's retry window is long enough and your failure is transient. Stripe will retry for 3 days. GitHub will retry 3 times in a narrow window — less forgiving.
Option B: Accept the event anyway and log for manual recovery. Write the raw event to a fallback store (even a local file) and return 200. This prevents provider retries from amplifying a failure, but requires manual intervention to process the logged events.
For most teams, Option A is correct. The provider's retry is your safety net for transient failures. The danger is if your queue is consistently unavailable — in that case, provider retries will exhaust and events will be lost. Alert on queue write failure rates, not just queue depths.
Testing Your Timeout Margin
Add a test that verifies your handler's P99 latency under simulated load. Here's a simple benchmark using Go's testing package:
func BenchmarkWebhookAcknowledge(b *testing.B) {
// Set up test server with real Postgres queue
srv, db := setupTestServer(b)
defer db.Close()
body := generateStripePayload()
sig := signStripePayload(body, testSecret)
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
req := httptest.NewRequest("POST", "/webhooks/stripe", bytes.NewReader(body))
req.Header.Set("Stripe-Signature", sig)
rr := httptest.NewRecorder()
srv.HandleStripe(rr, req)
if rr.Code != http.StatusOK {
b.Errorf("expected 200, got %d", rr.Code)
}
}
})
}Run this with -benchtime=10s -count=3 and look at the ns/op. Under 10 concurrent goroutines simulating realistic load, you should see P99 under 200ms. If you don't, trace where the latency is coming from before going to production.
What GetHook Does for You
When you use GetHook as your ingest layer, the acknowledge-and-enqueue path is handled for you. Events arrive at your GetHook source endpoint, are verified, persisted to a durable Postgres-backed queue, and your 200 is returned — all within the provider's timeout budget. Your application receives a clean, pre-verified event via outbound delivery with its own retry logic, decoupled from the provider's retry window.
This means the pattern described in this post — acknowledge fast, process async — is the default architecture, not something you have to build and maintain.
Summary
The five-second rule is not optional — it's enforced by providers through retries and eventual event abandonment. Design your webhook handlers around it from day one:
- ›Verify the signature and return 200 in under 200ms at P99.
- ›Write the raw event to a durable queue before returning 200.
- ›Process business logic asynchronously in a worker.
- ›Benchmark your acknowledge path under realistic concurrency.
- ›Alert on queue write failures, not just slow processing.
Reliability is a property of the whole system, but the acknowledge path is where it starts. Get that right, and everything downstream is recoverable.
Set up a GetHook source and get acknowledgment + retry handled automatically →