Back to Blog
webhooksreliabilitybackpressurearchitecturedistributed-systems

Designing for Webhook Consumer Failures: Backpressure, Graceful Degradation, and Recovery

When your webhook consumer can't keep up, the failure mode matters more than the failure itself. Here's how to design consumer endpoints that degrade gracefully, signal backpressure correctly, and recover cleanly — without losing events or triggering retry storms.

D
Dmitri Volkov
Distributed Systems Engineer
April 18, 2026
11 min read

Most webhook reliability writing focuses on the sender side: retries, exponential backoff, dead-letter queues. These matter. But the consumer side has an equally important failure surface that gets less attention. When your consumer endpoint is slow, overloaded, or partially degraded, your response — not your gateway's retry logic — determines whether events are processed reliably or silently dropped.

This post is about designing webhook consumers that fail well. We'll cover how to signal backpressure correctly, how to degrade components independently so a downstream database issue doesn't take down your ingest endpoint, and how to recover to steady state without triggering a retry storm on the sender side.


The Two Failure Modes No One Designs For

When a webhook consumer is overloaded, it typically falls into one of two failure modes, both of which are worse than the alternative we'll describe:

Failure mode 1: Slow processing with delayed responses. The endpoint accepts the request, begins processing synchronously, hits a slow database query or downstream API call, and returns 200 OK after 25 seconds. The sender logs a success. Your processing queue is backing up. The next 100 requests take 30 seconds each. Your server runs out of threads. New requests start timing out.

Failure mode 2: Returning 500 under load. The endpoint detects it's overwhelmed and returns 500 Internal Server Error. The sender schedules a retry. Fine — but if 40 senders are all retrying simultaneously, your retry load is now 40x your original load arriving in bursts. The first wave of retries hits a still-overloaded consumer. More 500s. More retries. Retry storm.

The correct design avoids both: accept events fast, defer processing, and return 429 when you genuinely cannot accept more work rather than 500 when your downstream is struggling.


Principle 1: Decouple Ingest from Processing

The first architectural decision that makes consumer-side reliability tractable is separating ingest from processing. Your HTTP endpoint's only job is to write the event to a durable buffer — a database table, a queue, or a message broker — and return immediately.

go
func (h *WebhookHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    body, err := io.ReadAll(io.LimitReader(r.Body, 5<<20))
    if err != nil {
        http.Error(w, "read error", http.StatusBadRequest)
        return
    }

    if !h.verifySignature(r.Header, body) {
        http.Error(w, "invalid signature", http.StatusUnauthorized)
        return
    }

    // Write to the buffer. This is the only IO the handler does.
    if err := h.buffer.Enqueue(r.Context(), body); err != nil {
        // Buffer is full or unavailable — signal backpressure
        if errors.Is(err, ErrBufferFull) {
            w.Header().Set("Retry-After", "30")
            http.Error(w, "temporarily unavailable", http.StatusTooManyRequests)
            return
        }
        http.Error(w, "internal error", http.StatusInternalServerError)
        return
    }

    w.WriteHeader(http.StatusAccepted) // 202, not 200 — we've accepted, not processed
}

Notice a few things:

  • We return 202 Accepted, not 200 OK. This is semantically correct: the event has been accepted for processing, not processed. Most well-implemented senders treat 2xx identically, but the distinction matters for your own observability — you can distinguish "accepted" from "fully processed" in your logs.
  • Buffer full returns 429 Too Many Requests with a Retry-After header, not 503 Service Unavailable. Many senders treat 503 as a hard failure; 429 signals that you're intentionally rate-limiting them and they should back off on a schedule.
  • No business logic, no database queries, no downstream API calls in the handler. The handler's latency should be deterministic and low — measured in single-digit milliseconds.

Principle 2: Size Your Buffer Explicitly

If your buffer is a Postgres table (the simplest durable option), define a maximum queue depth and enforce it at enqueue time:

sql
CREATE TABLE webhook_ingest_buffer (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_id     TEXT NOT NULL UNIQUE,
    payload      JSONB NOT NULL,
    received_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    locked_until TIMESTAMPTZ,
    attempts     INT NOT NULL DEFAULT 0
);

CREATE INDEX webhook_buffer_pending ON webhook_ingest_buffer (received_at)
    WHERE locked_until IS NULL OR locked_until < now();

At enqueue time, check depth before inserting:

go
const maxBufferDepth = 10_000

func (b *PostgresBuffer) Enqueue(ctx context.Context, payload []byte) error {
    var depth int
    err := b.db.QueryRowContext(ctx,
        `SELECT COUNT(*) FROM webhook_ingest_buffer
         WHERE locked_until IS NULL OR locked_until < now()`).Scan(&depth)
    if err != nil {
        return fmt.Errorf("depth check: %w", err)
    }
    if depth >= maxBufferDepth {
        return ErrBufferFull
    }

    _, err = b.db.ExecContext(ctx,
        `INSERT INTO webhook_ingest_buffer (event_id, payload)
         VALUES ($1, $2)
         ON CONFLICT (event_id) DO NOTHING`,
        extractEventID(payload), payload)
    return err
}

A buffer depth of 10,000 events is a reasonable starting point for most workloads. At 100 events/second throughput, this is 100 seconds of buffer — enough time to restart a crashed worker without dropping inbound events. Tune the number based on your event volume and your workers' per-event processing time.


Principle 3: Use 429, Not 500, Under Load

The difference between 429 Too Many Requests and 500 Internal Server Error matters a great deal for sender behavior:

Status codeSender interpretationRetry behavior
200 OKEvent received and processedNo retry
202 AcceptedEvent queued for processingNo retry (success)
429 Too Many RequestsSender is throttled, try laterBack off per Retry-After header
503 Service UnavailableTransient errorRetry with backoff (often aggressive)
500 Internal Server ErrorHandler crashRetry with backoff (often aggressive)

When your buffer is full, 429 is the semantically correct response and the operationally correct one. It tells the sender: "You're healthy, I'm not, wait 30 seconds." A well-implemented sender — and most production webhook gateways, including GetHook — will honor the Retry-After value and pause delivery to your endpoint specifically, without affecting delivery to other destinations.

500 tells the sender something went wrong internally, which triggers normal backoff retry. The problem: normal backoff retry starts at a short interval (often 30 seconds) and retries quickly in the first few rounds. If 50 senders all got 500 at the same time, your consumer gets hammered with retries in a tight window. 429 with a 30-second Retry-After distributes that load.


Principle 4: Fail Components Independently

Your consumer has multiple failure surfaces: the ingest handler, the processing worker, the downstream database, and any third-party APIs your business logic calls. Design these to fail independently.

A common mistake is having ingest health checks that fail when the database is slow:

go
// BAD: this causes 503 on every request if the DB is slow
func healthCheck(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        if err := db.PingContext(r.Context()); err != nil {
            http.Error(w, "db unavailable", http.StatusServiceUnavailable)
            return
        }
        w.WriteHeader(http.StatusOK)
    }
}

If your load balancer uses this health check, a slow database query causes the load balancer to remove your ingest endpoint from the pool entirely. No more ingest. The sender gets connection refused or no route to host — and that triggers very aggressive retry behavior.

Instead, separate health checks by concern:

go
// Liveness: is the process alive and able to accept connections?
mux.HandleFunc("/healthz/live", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK) // Always healthy if the process is running
})

// Readiness: is the buffer available for writes?
mux.HandleFunc("/healthz/ready", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 500*time.Millisecond)
    defer cancel()

    depth, err := buffer.Depth(ctx)
    if err != nil || depth >= maxBufferDepth {
        http.Error(w, "not ready", http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
})

The liveness check tells Kubernetes the pod is alive — don't kill it. The readiness check tells the load balancer whether to send traffic — and it only fails when the buffer itself is full or unavailable, not when downstream processing is slow. Your ingest endpoint stays reachable even when your processing workers are backed up.


Principle 5: Bounded Concurrency in Workers

Your workers pull from the buffer and process events. Unbounded concurrency here creates problems: too many concurrent database connections, too many concurrent calls to a third-party API that has its own rate limits, and excessive memory use from goroutine stacks.

A semaphore pattern bounds concurrency without blocking the main worker loop:

go
type Worker struct {
    buffer    *PostgresBuffer
    semaphore chan struct{}
    processor EventProcessor
}

func NewWorker(buffer *PostgresBuffer, maxConcurrent int, processor EventProcessor) *Worker {
    return &Worker{
        buffer:    buffer,
        semaphore: make(chan struct{}, maxConcurrent),
        processor: processor,
    }
}

func (w *Worker) Run(ctx context.Context) {
    for {
        select {
        case <-ctx.Done():
            return
        default:
        }

        events, err := w.buffer.Poll(ctx, 10) // fetch up to 10 events
        if err != nil || len(events) == 0 {
            time.Sleep(500 * time.Millisecond)
            continue
        }

        for _, event := range events {
            w.semaphore <- struct{}{} // acquire slot
            go func(e Event) {
                defer func() { <-w.semaphore }() // release slot
                if err := w.processor.Process(ctx, e); err != nil {
                    log.Printf("event %s failed: %v", e.ID, err)
                    w.buffer.Nack(ctx, e.ID) // return to queue
                }
            }(event)
        }
    }
}

With maxConcurrent = 20, you have at most 20 events in-flight at any time. This translates directly to at most 20 concurrent database connections and at most 20 concurrent calls to any downstream API your processor touches.


Recovery: The Retry Storm Problem

When your consumer recovers after a period of unavailability, you face the inverse of the degradation problem: a large backlog of pending retries about to hit a freshly recovered endpoint simultaneously.

The shape of this problem:

[Degradation period: 10 minutes]
  - 5,000 events were rejected or undelivered
  - All 5,000 are now scheduled for retry
  - They arrive in a burst within 60 seconds of your endpoint recovering
  - Your endpoint hits peak load exactly when it's most fragile (just recovered)

Two things help here. First, the Retry-After header during degradation: if senders honor it, they stagger their retry times rather than piling up at the same timestamp. Second, buffer depth limiting during recovery: your endpoint is already applying backpressure via 429 if the buffer fills up during the recovery burst, which tells senders to slow down.

The worst outcome is a consumer that was returning 500 during degradation (encouraging aggressive retry) and then came back online without any rate limiting. That combination — aggressive pending retries hitting an unprotected endpoint — is the textbook retry storm.


Observability for Consumer Health

The metrics that tell you whether your consumer is healthy:

MetricWhat to alert on
webhook.ingest.buffer_depth> 70% of max capacity
webhook.ingest.p99_latency_ms> 50ms (ingest should be fast)
webhook.ingest.http_429_rateAny nonzero rate (you're throttling)
webhook.worker.processing_lag_seconds> 30s (worker falling behind)
webhook.worker.error_rate> 1% sustained
webhook.worker.concurrency_saturationSemaphore at 100% utilization sustained

Buffer depth and processing lag are the leading indicators. By the time you're returning 429, you're already in a degraded state. Alerting at 70% buffer depth gives you time to investigate and add capacity before you start rejecting events.


Checklist

ConcernRecommendation
Handler latencyIngest-only, no business logic. Target < 10ms p99.
Backpressure signal429 with Retry-After when buffer is full, not 500.
Buffer durabilityPostgres or a message broker — not in-memory.
Buffer depth limitExplicit maximum; return 429 at the limit.
Health checksSeparate liveness and readiness probes.
Worker concurrencyBounded via semaphore; size to downstream capacity.
Deduplicationevent_id UNIQUE constraint on the buffer table.
RecoveryBuffer depth + 429 naturally throttle recovery bursts.

Building a webhook consumer that degrades gracefully is straightforward once the principle is clear: accept fast, process separately, signal backpressure explicitly. The handler should be nearly stateless. The buffer should be durable and bounded. Workers should be concurrency-limited. Health checks should reflect ingest capacity, not processing health.

If you want the gateway side — retries, dead-letter queues, Retry-After honoring, and per-destination concurrency controls — handled for you, GetHook does all of that out of the box. Your consumer still owns the backpressure signaling; GetHook does the rest. Get started here.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.