Back to Blog
webhooksreliabilitybackendengineering

Webhook Acknowledgment Patterns: Timeout Budgets and the 5-Second Rule

Most webhook failures start before your business logic ever runs. Understanding acknowledgment timeouts, async handoff patterns, and provider retry behavior is the foundation of a reliable webhook consumer.

T
Tomasz Brzezinski
Staff Infrastructure Engineer
April 2, 2026
9 min read

Every webhook provider has a timeout budget. Stripe gives you 30 seconds. GitHub gives you 10. Shopify gives you 5. Exceed that budget, and the provider treats your endpoint as failed and schedules a retry — even if your handler eventually succeeds.

This sounds simple until you're running database migrations, processing image uploads, or calling three downstream services inside a webhook handler. Suddenly a "quick handler" takes 12 seconds, and you have a retry storm on your hands.

This post covers the mechanics of webhook acknowledgment, patterns for staying inside timeout budgets, and how async handoff actually works in practice.


What Providers Actually Measure

When a provider sends a webhook, the clock starts the moment their request leaves their servers. They measure time to first byte of your HTTP response — not time to process, not time to persist, not time to trigger downstream work.

Here's what different providers will do when your endpoint times out:

ProviderTimeoutRetry scheduleMax attempts
Stripe30s1h, 1h, 2h, 4h, 8h, 20h, 1d, 3d~8 over 3 days
GitHub10s~1min, gradual backoff~3
Shopify5s1min, 5min, 1h, 8h, 24h, 48h, 72h~7 over 48h
Twilio15s5min, 10min, 30min, 1h, 4h5
PagerDuty16s30s, 1min, 2min, 5min~5
SendGrid30sNone (best-effort)1

A few observations from this table: SendGrid does not retry at all, so if you miss it, you miss it. GitHub's retry window is extremely narrow. Shopify's 5-second budget is the tightest of the major providers and catches most teams off-guard the first time they integrate.


The Anti-Pattern: Synchronous Handlers

The most common webhook reliability mistake is processing synchronously inside the handler:

go
func (h *WebhookHandler) HandleStripe(w http.ResponseWriter, r *http.Request) {
    body, _ := io.ReadAll(r.Body)

    // Signature verification — fast, ~1ms
    if !verifyStripe(body, r.Header.Get("Stripe-Signature"), h.secret) {
        http.Error(w, "unauthorized", http.StatusUnauthorized)
        return
    }

    var event StripeEvent
    json.Unmarshal(body, &event)

    // ❌ Everything below happens synchronously, inside the provider's timeout window
    if event.Type == "customer.subscription.created" {
        h.db.CreateSubscriptionRecord(r.Context(), event)     // 50-300ms normally, 5s+ under load
        h.email.SendWelcomeEmail(r.Context(), event)          // 200-2000ms, external HTTP call
        h.crm.UpdateCustomerPlan(r.Context(), event)          // 100-800ms, external HTTP call
        h.billing.ProvisionLimits(r.Context(), event)         // 100-500ms, might hit DB lock
        h.slack.NotifyTeam(r.Context(), event)                // 200-1000ms, external HTTP call
    }

    w.WriteHeader(http.StatusOK)
}

In steady state, this might complete in 800ms. Under load, when your database is slow, when Slack's API is degraded, or when you're in the middle of a deploy, this blows past 5 seconds easily. The provider retries. Your idempotency layer (if you have one) saves you from duplicate charges. If you don't have one, you're double-provisioning customers.


The Right Pattern: Acknowledge First, Process Second

The fix is a strict two-phase design:

  1. Receive and persist — verify the signature, write the raw event to a durable store, return 200
  2. Process asynchronously — a worker reads from the store and executes your business logic
go
func (h *WebhookHandler) HandleStripe(w http.ResponseWriter, r *http.Request) {
    body, err := io.ReadAll(io.LimitReader(r.Body, 10<<20))
    if err != nil {
        http.Error(w, "read error", http.StatusBadRequest)
        return
    }

    sig := r.Header.Get("Stripe-Signature")
    if !verifyStripe(body, sig, h.secret) {
        http.Error(w, "unauthorized", http.StatusUnauthorized)
        return
    }

    // ✅ Only fast, low-risk operations before the 200
    eventID, err := h.queue.Enqueue(r.Context(), QueuedEvent{
        Source:    "stripe",
        Body:      body,
        Headers:   extractHeaders(r.Header),
        ReceivedAt: time.Now().UTC(),
    })
    if err != nil {
        // Queue write failed — return 500, provider will retry
        http.Error(w, "internal error", http.StatusInternalServerError)
        return
    }

    w.Header().Set("X-Event-ID", eventID)
    w.WriteHeader(http.StatusOK)
}

The enqueue operation should be fast — a single INSERT into Postgres or a push to a local queue. Target under 50ms. This gives you a 250ms budget for signature verification, body reading, and enqueueing — well inside even Shopify's 5-second limit.


Choosing the Right Queue Backend

Your queue backend determines how much durability and throughput you get for the async handoff.

QueueDurabilityLatencyThroughputOperational overhead
Postgres (FOR UPDATE SKIP LOCKED)High5–50ms1K–10K/sNone (already running)
Redis (LPUSH/BRPOP)Low (AOF only)1–5ms100K+/sSeparate service, config
SQSHigh20–100msEffectively unlimitedAWS dependency
RabbitMQHigh2–10ms50K+/sSeparate service, ops
KafkaVery high10–50msVery highHigh ops complexity

For most teams at up to 50K events/day, Postgres is the right answer. You already have it. It gives you transactional guarantees — you can INSERT the event and enqueue the job in the same transaction. If your process crashes between the enqueue and the job pickup, nothing is lost.

sql
-- Insert event and job atomically
BEGIN;

INSERT INTO webhook_events (id, source, body, received_at)
VALUES ($1, $2, $3, $4);

INSERT INTO webhook_jobs (id, event_id, status, next_attempt_at)
VALUES (gen_random_uuid(), $1, 'queued', NOW());

COMMIT;

Redis is faster but gives you a narrow durability window with default configuration. If your Redis process crashes between the enqueue and the worker pickup, the event is gone. For webhook workloads where the provider will retry on your behalf, this may be acceptable — but verify the provider's retry behavior before relying on it.


Timeout Budget Accounting

Here's a model for thinking about your budget before you write a single line of handler code:

Provider timeout (e.g., 5s for Shopify)
  - Network round-trip from provider:        ~20-80ms
  - TLS handshake (if new connection):       ~50-100ms
  - Reverse proxy overhead (nginx/LB):       ~5-20ms
  - Request body read (10KB payload):        ~1-5ms
  - Signature verification (HMAC-SHA256):    ~1-2ms
  - Queue INSERT (Postgres on same AZ):      ~5-50ms
  - Response write + flush:                  ~1-5ms
---------------------------------------------------
  Budget consumed:                           ~85-260ms
  Remaining margin:                          ~4.7-4.9s

That margin looks comfortable — until you factor in:

  • Cold database connections on a freshly restarted process: add 200-500ms
  • Connection pool exhaustion during traffic spikes: add 500ms-2s of wait
  • GC pause in a garbage-collected runtime: add 50-500ms
  • P99 queue INSERT under load: can be 10-20x the P50

The practical rule: target under 200ms for the acknowledge-and-enqueue path, measured at P99. If you're regularly hitting 500ms+, something is wrong — investigate before it causes intermittent timeouts under load.


Handling Queue Write Failures

What happens if your queue write fails? You have two options:

Option A: Return 500, let the provider retry. This works if the provider's retry window is long enough and your failure is transient. Stripe will retry for 3 days. GitHub will retry 3 times in a narrow window — less forgiving.

Option B: Accept the event anyway and log for manual recovery. Write the raw event to a fallback store (even a local file) and return 200. This prevents provider retries from amplifying a failure, but requires manual intervention to process the logged events.

For most teams, Option A is correct. The provider's retry is your safety net for transient failures. The danger is if your queue is consistently unavailable — in that case, provider retries will exhaust and events will be lost. Alert on queue write failure rates, not just queue depths.


Testing Your Timeout Margin

Add a test that verifies your handler's P99 latency under simulated load. Here's a simple benchmark using Go's testing package:

go
func BenchmarkWebhookAcknowledge(b *testing.B) {
    // Set up test server with real Postgres queue
    srv, db := setupTestServer(b)
    defer db.Close()

    body := generateStripePayload()
    sig := signStripePayload(body, testSecret)

    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            req := httptest.NewRequest("POST", "/webhooks/stripe", bytes.NewReader(body))
            req.Header.Set("Stripe-Signature", sig)
            rr := httptest.NewRecorder()

            srv.HandleStripe(rr, req)

            if rr.Code != http.StatusOK {
                b.Errorf("expected 200, got %d", rr.Code)
            }
        }
    })
}

Run this with -benchtime=10s -count=3 and look at the ns/op. Under 10 concurrent goroutines simulating realistic load, you should see P99 under 200ms. If you don't, trace where the latency is coming from before going to production.


What GetHook Does for You

When you use GetHook as your ingest layer, the acknowledge-and-enqueue path is handled for you. Events arrive at your GetHook source endpoint, are verified, persisted to a durable Postgres-backed queue, and your 200 is returned — all within the provider's timeout budget. Your application receives a clean, pre-verified event via outbound delivery with its own retry logic, decoupled from the provider's retry window.

This means the pattern described in this post — acknowledge fast, process async — is the default architecture, not something you have to build and maintain.


Summary

The five-second rule is not optional — it's enforced by providers through retries and eventual event abandonment. Design your webhook handlers around it from day one:

  1. Verify the signature and return 200 in under 200ms at P99.
  2. Write the raw event to a durable queue before returning 200.
  3. Process business logic asynchronously in a worker.
  4. Benchmark your acknowledge path under realistic concurrency.
  5. Alert on queue write failures, not just slow processing.

Reliability is a property of the whole system, but the acknowledge path is where it starts. Get that right, and everything downstream is recoverable.

Set up a GetHook source and get acknowledgment + retry handled automatically →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.