Webhook reliability is usually discussed from the sender's perspective: retry schedules, dead-letter queues, exponential backoff. But the most common root cause of failed deliveries in production is not a broken sender — it's a slow or overloaded consumer endpoint that times out before it can return a 200.
When your endpoint takes 35 seconds to respond and the sender's timeout is 30 seconds, the sender sees a network timeout, not a success. It reschedules the delivery. Your application may have processed the event perfectly — but the sender has no way to know that, so it will deliver the same event again. Now you have a duplicate. If the slow endpoint is caused by load (a burst of incoming events), every retry adds more load, and you have the conditions for a retry storm.
This post covers how senders interpret timeouts, how to design consumer endpoints that respond fast regardless of processing complexity, and how to measure response time as a first-class reliability metric.
How Senders Interpret Your Response
Webhook senders distinguish between several failure modes, and they do not all behave the same way:
| Response type | What the sender sees | Typical action |
|---|---|---|
200 OK within timeout | Success | No retry |
2xx within timeout | Success (some senders require exactly 200) | No retry |
4xx (except 429) | Consumer error — event rejected | Usually no retry; may dead-letter immediately |
5xx | Transient server error | Retry with backoff |
429 Too Many Requests | Consumer is rate-limiting | Retry, ideally honoring Retry-After |
| TCP timeout / no response | Network or processing timeout | Retry with backoff |
| Connection refused | Endpoint is down | Retry with backoff |
The critical insight: a timeout and a 503 look identical to most retry logic. Both result in rescheduled delivery. The difference is that a timeout means your handler may have already done the work — so the retry creates a duplicate execution risk that a clean 503 does not.
The other key insight is timeout budget: most webhook senders use a timeout between 5 and 30 seconds. GitHub's webhook delivery timeout is 10 seconds. Stripe's is 30 seconds. Shopify's is 5 seconds. If your endpoint occasionally takes 12 seconds to respond, you will see Shopify retries even when your handler ultimately succeeds.
The Slow Handler Anti-Pattern
The most common consumer design mistake is doing all processing synchronously inside the HTTP handler:
// Anti-pattern: everything happens before you respond
func handleWebhook(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
// Verify signature
if !verifySignature(body, r.Header.Get("Webhook-Signature")) {
http.Error(w, "unauthorized", http.StatusUnauthorized)
return
}
var event WebhookEvent
json.Unmarshal(body, &event)
// This can take 5–30 seconds depending on your database and downstream APIs:
processEventSynchronously(event)
w.WriteHeader(http.StatusOK)
}If processEventSynchronously calls your database, sends an email, hits a third-party API, or does anything non-trivial, you are holding the sender's connection open for the duration. Under load — when your database is slow or your downstream API is degraded — this turns every webhook delivery into a ticking timeout clock.
The Correct Pattern: Accept Fast, Process Async
The correct design is to do the minimum necessary work before responding, then hand off processing to a background worker:
func handleWebhook(w http.ResponseWriter, r *http.Request) {
r.Body = http.MaxBytesReader(w, r.Body, 1<<20) // 1 MB limit
body, err := io.ReadAll(r.Body)
if err != nil {
http.Error(w, "request too large", http.StatusRequestEntityTooLarge)
return
}
// Verify before anything else — unauthenticated requests get dropped immediately
if !verifySignature(body, r.Header.Get("Webhook-Signature"), signingSecret) {
http.Error(w, "unauthorized", http.StatusUnauthorized)
return
}
// Persist the raw payload to your queue for async processing
jobID, err := enqueueEvent(r.Context(), body)
if err != nil {
// If you cannot enqueue, return 500 — the sender should retry
http.Error(w, "failed to enqueue", http.StatusInternalServerError)
return
}
// Respond immediately — the sender's job is done
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"queued": jobID})
}The three operations before the response — body read, signature verification, and enqueue — should each complete in under 5 ms under normal conditions. Total handler latency under 15 ms is achievable and keeps you well inside even Shopify's 5-second timeout.
The enqueueEvent function writes to a local Postgres table or an in-process channel. It is not a call to an external service. If your "queue" is a third-party message broker with variable network latency, you've moved the timeout risk into the enqueue step.
Choosing Your Queue for the Enqueue Step
The enqueue step should be as fast and reliable as possible. Your options, ranked by latency and operational simplicity:
| Queue type | Typical enqueue latency | Reliability | Notes |
|---|---|---|---|
| In-process channel (buffered) | < 1 µs | Low — lost on crash | Only safe for non-critical events |
| Local Postgres table | 1–5 ms | High | Durable; survives restarts; simple to operate |
| Redis / Valkey LIST | < 1 ms | Medium | Requires Redis; AOF for durability |
| External broker (SQS, Kafka, RabbitMQ) | 5–50 ms | High | Network hop on critical path; adds timeout risk |
For most teams, a local Postgres table with FOR UPDATE SKIP LOCKED for the background worker is the right choice. It is durable, requires no additional infrastructure, and enqueue latency is consistent. This is the pattern GetHook uses for its delivery worker — a durable queue that keeps enqueue on the fast path and processes async.
What to Measure
Response time should be a first-class metric on your consumer endpoints, not an afterthought. The metrics that matter:
p50 / p95 / p99 response time — Your average (mean) response time hides the tail. A p99 of 8 seconds means 1 in 100 requests takes 8 seconds. For a consumer processing 10,000 events per day, that's 100 timeout-risk deliveries daily.
Timeout rate — Track HTTP requests from known webhook senders that do not complete within your target budget (e.g., 4 seconds). This is distinct from your server's response time; it represents the fraction of requests where the sender would have given up.
Enqueue error rate — If your enqueue step fails, the consumer correctly returns 5xx and the sender retries. Track this rate separately from processing errors so you can distinguish "my queue is full" from "my processing logic is broken."
A simple query to surface slow responses from your access log or application metrics:
SELECT
date_trunc('hour', created_at) AS hour,
COUNT(*) AS total_requests,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_ms) AS p50_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_ms) AS p95_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_ms) AS p99_ms,
COUNT(*) FILTER (WHERE response_ms > 4000) AS at_risk_count
FROM webhook_ingest_requests
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1 DESC;Alert when your p99 exceeds 60% of your target timeout budget. If your target is 5 seconds, alert at p99 > 3 seconds. This gives you headroom to investigate before the timeout rate starts climbing.
Handling Slow Dependencies Gracefully
Even with async processing, your enqueue step can become slow if it depends on a degraded resource. The mitigations:
Statement timeouts on enqueue. Set a short query timeout on your enqueue INSERT — 200 ms is reasonable. If the database is slow enough to miss this, return a 503 so the sender retries. A fast failure is better than holding the connection open until the sender's 30-second timeout expires.
ctx, cancel := context.WithTimeout(r.Context(), 200*time.Millisecond)
defer cancel()
jobID, err := enqueueEventWithContext(ctx, body)
if err != nil {
if errors.Is(err, context.DeadlineExceeded) {
http.Error(w, "queue temporarily unavailable", http.StatusServiceUnavailable)
return
}
// ... other error handling
}In-memory overflow buffer. During database degradation, a small in-memory channel (1,000–5,000 entries) can absorb bursts while your database recovers. Accept events into the channel, respond 200, and drain the channel to the database as it recovers. The risk is data loss on process restart — only appropriate for events where losing a small window is acceptable.
Graceful degradation headers. If you know your system is under load, return 429 Too Many Requests with a Retry-After header rather than letting requests queue and timeout. Senders that honor Retry-After will back off, reducing inbound pressure while you recover.
Idempotency as a Timeout Safety Net
No matter how fast your handler is, timeouts will happen under extreme conditions. The consumer's defense is idempotency: processing the same event twice should produce the same result as processing it once.
The standard pattern is to store the event's unique identifier (from the Webhook-ID or X-Request-ID header, or extracted from the payload) before processing, and check for it before doing work:
func processEvent(ctx context.Context, db *sql.DB, eventID string, payload []byte) error {
// Upsert the event ID — returns false if already processed
var alreadyProcessed bool
err := db.QueryRowContext(ctx,
`INSERT INTO processed_events (event_id, processed_at)
VALUES ($1, NOW())
ON CONFLICT (event_id) DO NOTHING
RETURNING false`,
eventID,
).Scan(&alreadyProcessed)
if err == sql.ErrNoRows {
// Conflict — event was already processed
return nil
}
if err != nil {
return err
}
// Safe to process — first time seeing this event_id
return doActualWork(ctx, payload)
}Idempotency means a retry caused by a timeout is harmless rather than dangerous. It is not a substitute for fast response times — it is a safety net for when timeouts occur despite your best efforts.
Putting It Together
The reliability properties of your webhook consumer depend on three things working together:
- ›Fast handler response — verify, enqueue, respond within your timeout budget (target < 1 second, hard maximum < half the sender's timeout)
- ›Durable async processing — a persistent queue that survives restarts and processes events even when downstream dependencies are slow
- ›Idempotent processing — retries caused by timeouts produce no side effects
If you are receiving webhooks from multiple providers with different timeout budgets (Shopify at 5 seconds, Stripe at 30 seconds), design for the strictest one. A handler that responds in under 2 seconds works for every provider. A handler that sometimes takes 10 seconds works for Stripe but not Shopify — and will generate unnecessary retries and duplicate deliveries from Shopify in production.
GetHook's ingest path follows this pattern exactly: every inbound event is persisted to a Postgres-backed queue before the 200 is returned. Verification and enqueue together take under 10 ms at p99, which keeps GetHook well within every major provider's timeout window even under load.
Your consumer endpoint's p99 response time is your webhook reliability SLA. If your p99 is slow, your reliability is low — regardless of how good your sender's retry logic is. Measure it, alert on it, and design your handler to be fast by construction rather than fast by luck.
If you want to receive webhooks without worrying about consumer timeouts on your side, GetHook handles ingest, verification, and queuing for you — your application only sees processed, deduplicated events from a stable internal queue.