Most webhook reliability writing focuses on the sender side: retries, exponential backoff, dead-letter queues. These matter. But the consumer side has an equally important failure surface that gets less attention. When your consumer endpoint is slow, overloaded, or partially degraded, your response — not your gateway's retry logic — determines whether events are processed reliably or silently dropped.
This post is about designing webhook consumers that fail well. We'll cover how to signal backpressure correctly, how to degrade components independently so a downstream database issue doesn't take down your ingest endpoint, and how to recover to steady state without triggering a retry storm on the sender side.
The Two Failure Modes No One Designs For
When a webhook consumer is overloaded, it typically falls into one of two failure modes, both of which are worse than the alternative we'll describe:
Failure mode 1: Slow processing with delayed responses. The endpoint accepts the request, begins processing synchronously, hits a slow database query or downstream API call, and returns 200 OK after 25 seconds. The sender logs a success. Your processing queue is backing up. The next 100 requests take 30 seconds each. Your server runs out of threads. New requests start timing out.
Failure mode 2: Returning 500 under load. The endpoint detects it's overwhelmed and returns 500 Internal Server Error. The sender schedules a retry. Fine — but if 40 senders are all retrying simultaneously, your retry load is now 40x your original load arriving in bursts. The first wave of retries hits a still-overloaded consumer. More 500s. More retries. Retry storm.
The correct design avoids both: accept events fast, defer processing, and return 429 when you genuinely cannot accept more work rather than 500 when your downstream is struggling.
Principle 1: Decouple Ingest from Processing
The first architectural decision that makes consumer-side reliability tractable is separating ingest from processing. Your HTTP endpoint's only job is to write the event to a durable buffer — a database table, a queue, or a message broker — and return immediately.
func (h *WebhookHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
body, err := io.ReadAll(io.LimitReader(r.Body, 5<<20))
if err != nil {
http.Error(w, "read error", http.StatusBadRequest)
return
}
if !h.verifySignature(r.Header, body) {
http.Error(w, "invalid signature", http.StatusUnauthorized)
return
}
// Write to the buffer. This is the only IO the handler does.
if err := h.buffer.Enqueue(r.Context(), body); err != nil {
// Buffer is full or unavailable — signal backpressure
if errors.Is(err, ErrBufferFull) {
w.Header().Set("Retry-After", "30")
http.Error(w, "temporarily unavailable", http.StatusTooManyRequests)
return
}
http.Error(w, "internal error", http.StatusInternalServerError)
return
}
w.WriteHeader(http.StatusAccepted) // 202, not 200 — we've accepted, not processed
}Notice a few things:
- ›We return
202 Accepted, not200 OK. This is semantically correct: the event has been accepted for processing, not processed. Most well-implemented senders treat 2xx identically, but the distinction matters for your own observability — you can distinguish "accepted" from "fully processed" in your logs. - ›Buffer full returns
429 Too Many Requestswith aRetry-Afterheader, not503 Service Unavailable. Many senders treat 503 as a hard failure; 429 signals that you're intentionally rate-limiting them and they should back off on a schedule. - ›No business logic, no database queries, no downstream API calls in the handler. The handler's latency should be deterministic and low — measured in single-digit milliseconds.
Principle 2: Size Your Buffer Explicitly
If your buffer is a Postgres table (the simplest durable option), define a maximum queue depth and enforce it at enqueue time:
CREATE TABLE webhook_ingest_buffer (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id TEXT NOT NULL UNIQUE,
payload JSONB NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
locked_until TIMESTAMPTZ,
attempts INT NOT NULL DEFAULT 0
);
CREATE INDEX webhook_buffer_pending ON webhook_ingest_buffer (received_at)
WHERE locked_until IS NULL OR locked_until < now();At enqueue time, check depth before inserting:
const maxBufferDepth = 10_000
func (b *PostgresBuffer) Enqueue(ctx context.Context, payload []byte) error {
var depth int
err := b.db.QueryRowContext(ctx,
`SELECT COUNT(*) FROM webhook_ingest_buffer
WHERE locked_until IS NULL OR locked_until < now()`).Scan(&depth)
if err != nil {
return fmt.Errorf("depth check: %w", err)
}
if depth >= maxBufferDepth {
return ErrBufferFull
}
_, err = b.db.ExecContext(ctx,
`INSERT INTO webhook_ingest_buffer (event_id, payload)
VALUES ($1, $2)
ON CONFLICT (event_id) DO NOTHING`,
extractEventID(payload), payload)
return err
}A buffer depth of 10,000 events is a reasonable starting point for most workloads. At 100 events/second throughput, this is 100 seconds of buffer — enough time to restart a crashed worker without dropping inbound events. Tune the number based on your event volume and your workers' per-event processing time.
Principle 3: Use 429, Not 500, Under Load
The difference between 429 Too Many Requests and 500 Internal Server Error matters a great deal for sender behavior:
| Status code | Sender interpretation | Retry behavior |
|---|---|---|
200 OK | Event received and processed | No retry |
202 Accepted | Event queued for processing | No retry (success) |
429 Too Many Requests | Sender is throttled, try later | Back off per Retry-After header |
503 Service Unavailable | Transient error | Retry with backoff (often aggressive) |
500 Internal Server Error | Handler crash | Retry with backoff (often aggressive) |
When your buffer is full, 429 is the semantically correct response and the operationally correct one. It tells the sender: "You're healthy, I'm not, wait 30 seconds." A well-implemented sender — and most production webhook gateways, including GetHook — will honor the Retry-After value and pause delivery to your endpoint specifically, without affecting delivery to other destinations.
500 tells the sender something went wrong internally, which triggers normal backoff retry. The problem: normal backoff retry starts at a short interval (often 30 seconds) and retries quickly in the first few rounds. If 50 senders all got 500 at the same time, your consumer gets hammered with retries in a tight window. 429 with a 30-second Retry-After distributes that load.
Principle 4: Fail Components Independently
Your consumer has multiple failure surfaces: the ingest handler, the processing worker, the downstream database, and any third-party APIs your business logic calls. Design these to fail independently.
A common mistake is having ingest health checks that fail when the database is slow:
// BAD: this causes 503 on every request if the DB is slow
func healthCheck(db *sql.DB) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
if err := db.PingContext(r.Context()); err != nil {
http.Error(w, "db unavailable", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
}If your load balancer uses this health check, a slow database query causes the load balancer to remove your ingest endpoint from the pool entirely. No more ingest. The sender gets connection refused or no route to host — and that triggers very aggressive retry behavior.
Instead, separate health checks by concern:
// Liveness: is the process alive and able to accept connections?
mux.HandleFunc("/healthz/live", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK) // Always healthy if the process is running
})
// Readiness: is the buffer available for writes?
mux.HandleFunc("/healthz/ready", func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
defer cancel()
depth, err := buffer.Depth(ctx)
if err != nil || depth >= maxBufferDepth {
http.Error(w, "not ready", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})The liveness check tells Kubernetes the pod is alive — don't kill it. The readiness check tells the load balancer whether to send traffic — and it only fails when the buffer itself is full or unavailable, not when downstream processing is slow. Your ingest endpoint stays reachable even when your processing workers are backed up.
Principle 5: Bounded Concurrency in Workers
Your workers pull from the buffer and process events. Unbounded concurrency here creates problems: too many concurrent database connections, too many concurrent calls to a third-party API that has its own rate limits, and excessive memory use from goroutine stacks.
A semaphore pattern bounds concurrency without blocking the main worker loop:
type Worker struct {
buffer *PostgresBuffer
semaphore chan struct{}
processor EventProcessor
}
func NewWorker(buffer *PostgresBuffer, maxConcurrent int, processor EventProcessor) *Worker {
return &Worker{
buffer: buffer,
semaphore: make(chan struct{}, maxConcurrent),
processor: processor,
}
}
func (w *Worker) Run(ctx context.Context) {
for {
select {
case <-ctx.Done():
return
default:
}
events, err := w.buffer.Poll(ctx, 10) // fetch up to 10 events
if err != nil || len(events) == 0 {
time.Sleep(500 * time.Millisecond)
continue
}
for _, event := range events {
w.semaphore <- struct{}{} // acquire slot
go func(e Event) {
defer func() { <-w.semaphore }() // release slot
if err := w.processor.Process(ctx, e); err != nil {
log.Printf("event %s failed: %v", e.ID, err)
w.buffer.Nack(ctx, e.ID) // return to queue
}
}(event)
}
}
}With maxConcurrent = 20, you have at most 20 events in-flight at any time. This translates directly to at most 20 concurrent database connections and at most 20 concurrent calls to any downstream API your processor touches.
Recovery: The Retry Storm Problem
When your consumer recovers after a period of unavailability, you face the inverse of the degradation problem: a large backlog of pending retries about to hit a freshly recovered endpoint simultaneously.
The shape of this problem:
[Degradation period: 10 minutes]
- 5,000 events were rejected or undelivered
- All 5,000 are now scheduled for retry
- They arrive in a burst within 60 seconds of your endpoint recovering
- Your endpoint hits peak load exactly when it's most fragile (just recovered)Two things help here. First, the Retry-After header during degradation: if senders honor it, they stagger their retry times rather than piling up at the same timestamp. Second, buffer depth limiting during recovery: your endpoint is already applying backpressure via 429 if the buffer fills up during the recovery burst, which tells senders to slow down.
The worst outcome is a consumer that was returning 500 during degradation (encouraging aggressive retry) and then came back online without any rate limiting. That combination — aggressive pending retries hitting an unprotected endpoint — is the textbook retry storm.
Observability for Consumer Health
The metrics that tell you whether your consumer is healthy:
| Metric | What to alert on |
|---|---|
webhook.ingest.buffer_depth | > 70% of max capacity |
webhook.ingest.p99_latency_ms | > 50ms (ingest should be fast) |
webhook.ingest.http_429_rate | Any nonzero rate (you're throttling) |
webhook.worker.processing_lag_seconds | > 30s (worker falling behind) |
webhook.worker.error_rate | > 1% sustained |
webhook.worker.concurrency_saturation | Semaphore at 100% utilization sustained |
Buffer depth and processing lag are the leading indicators. By the time you're returning 429, you're already in a degraded state. Alerting at 70% buffer depth gives you time to investigate and add capacity before you start rejecting events.
Checklist
| Concern | Recommendation |
|---|---|
| Handler latency | Ingest-only, no business logic. Target < 10ms p99. |
| Backpressure signal | 429 with Retry-After when buffer is full, not 500. |
| Buffer durability | Postgres or a message broker — not in-memory. |
| Buffer depth limit | Explicit maximum; return 429 at the limit. |
| Health checks | Separate liveness and readiness probes. |
| Worker concurrency | Bounded via semaphore; size to downstream capacity. |
| Deduplication | event_id UNIQUE constraint on the buffer table. |
| Recovery | Buffer depth + 429 naturally throttle recovery bursts. |
Building a webhook consumer that degrades gracefully is straightforward once the principle is clear: accept fast, process separately, signal backpressure explicitly. The handler should be nearly stateless. The buffer should be durable and bounded. Workers should be concurrency-limited. Health checks should reflect ingest capacity, not processing health.
If you want the gateway side — retries, dead-letter queues, Retry-After honoring, and per-destination concurrency controls — handled for you, GetHook does all of that out of the box. Your consumer still owns the backpressure signaling; GetHook does the rest. Get started here.