Retry logic and exponential backoff are well-understood in webhook infrastructure. What gets less attention is the failure mode that precedes retry exhaustion: your delivery queue growing faster than your workers can drain it.

This is backpressure. It happens when one or more destinations slow down or go offline, your worker pool size is fixed, and new events keep arriving. Retries pile up behind fresh events, queue depth climbs, and delivery latency grows for every tenant on the platform — including the ones whose destinations are perfectly healthy.

This post covers how to detect backpressure early, how to isolate its blast radius, and how to design your queue and worker architecture so a single misbehaving destination can't degrade delivery for everyone else.

Understanding the Failure Mode

Consider a Postgres-backed delivery queue — the pattern used by GetHook and many comparable systems. The worker loop looks roughly like this:

›SELECT ... FOR UPDATE SKIP LOCKED picks up the next event with next_attempt_at <= now()
›Worker issues HTTP POST to the destination
›If destination times out (say, 30 seconds), worker marks the event for retry and moves on
›If the destination has 100 pending events, that's 3,000 seconds of timeout-bound worker time

Even with SKIP LOCKED allowing multiple workers to run concurrently, a destination that consistently times out can saturate your worker pool. Every occupied worker slot is spending 30 seconds on a request it knows will fail, while fresh events from healthy destinations sit waiting.

The queue depth metric tells you this is happening. The worker saturation metric tells you why.

The Key Metrics to Watch

Before building any mitigation, you need visibility into queue behavior. These are the four metrics that matter:

Metric	Query	Alert threshold
Queue depth (total pending)	`SELECT COUNT(*) FROM events WHERE status IN ('queued', 'retry_scheduled')`	> 1,000 sustained 5m
Queue depth per destination	Group above by `destination_id`	> 200 for any single destination
Worker saturation	Active workers / total workers	> 85% sustained 2m
P95 queue wait time	`now() - queued_at` for delivered events	> 30s

Queue depth per destination is the signal that tells you whether you have a platform-wide problem or a destination-specific one. If one destination has 800 pending events and all others are near zero, the problem is isolated — but it's still consuming workers that could be processing healthy traffic.

sql

-- Queue depth by destination, ordered by worst offenders
SELECT
    d.id,
    d.name,
    d.url,
    COUNT(*) AS pending_count,
    MIN(e.created_at) AS oldest_event,
    MAX(e.attempts_count) AS max_attempts
FROM events e
JOIN destinations d ON e.destination_id = d.id
WHERE e.status IN ('queued', 'retry_scheduled')
  AND e.next_attempt_at <= now() + INTERVAL '10 minutes'
GROUP BY d.id, d.name, d.url
ORDER BY pending_count DESC
LIMIT 20;

Run this query on a schedule (or expose it as a dashboard endpoint) and you'll know immediately which destinations are causing congestion.

Isolation: Per-Destination Concurrency Limits

The core architectural fix is to prevent any single destination from consuming more than a bounded fraction of your worker pool. This is concurrency limiting at the destination level.

With a Postgres queue, you can enforce this in the worker's pickup query:

sql

-- Pick up the next event, but skip destinations that already have
-- too many in-flight deliveries (tracked in a separate table).
SELECT e.*
FROM events e
WHERE e.status = 'queued'
  AND e.next_attempt_at <= now()
  AND (
    SELECT COUNT(*) FROM delivery_attempts da
    WHERE da.destination_id = e.destination_id
      AND da.started_at > now() - INTERVAL '60 seconds'
      AND da.outcome IS NULL  -- still in flight
  ) < 5  -- max 5 concurrent per destination
ORDER BY e.next_attempt_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;

The subquery counts in-flight deliveries for the candidate event's destination. If that destination already has 5 active deliveries, the worker skips it and picks the next eligible event instead.

This has a cost: the subquery adds latency to every worker pickup. At low queue depths this is negligible. At high queue depths, consider maintaining an in-memory or Redis counter for in-flight counts instead of querying delivery attempts. For a Postgres-first stack, a lightweight destination_inflight table with INSERT ... ON CONFLICT DO UPDATE and periodic cleanup works well at moderate scale.

Circuit Breaking per Destination

Concurrency limits bound the ongoing damage from a slow destination. Circuit breaking prevents you from even trying destinations that are clearly down.

A simple three-state circuit breaker per destination:

State	Behavior	Transition
Closed (normal)	Deliver as usual	→ Open after N consecutive failures
Open (broken)	Skip delivery, return immediately	→ Half-open after cooldown period
Half-open (testing)	Allow one delivery attempt	→ Closed on success, Open on failure

type CircuitState int

const (
    CircuitClosed   CircuitState = iota
    CircuitOpen
    CircuitHalfOpen
)

type DestinationCircuit struct {
    State            CircuitState
    ConsecutiveFails int
    OpenedAt         time.Time
    Threshold        int           // fails before opening
    CooldownDuration time.Duration // how long to stay open
}

func (c *DestinationCircuit) ShouldAttempt() bool {
    switch c.State {
    case CircuitClosed:
        return true
    case CircuitOpen:
        if time.Since(c.OpenedAt) >= c.CooldownDuration {
            c.State = CircuitHalfOpen
            return true
        }
        return false
    case CircuitHalfOpen:
        return true
    default:
        return true
    }
}

func (c *DestinationCircuit) RecordResult(success bool) {
    if success {
        c.State = CircuitClosed
        c.ConsecutiveFails = 0
        return
    }
    c.ConsecutiveFails++
    if c.ConsecutiveFails >= c.Threshold {
        c.State = CircuitOpen
        c.OpenedAt = time.Now()
    }
}

When a circuit is open, the worker skips events for that destination without making an HTTP request. This is important: you still want to advance next_attempt_at on those events so they don't pile up at the front of the queue, but you don't want to burn a worker slot on a connection you know will fail.

Store circuit state in Postgres or a shared cache so all worker instances see the same state. A single-row table per destination works at modest scale:

sql

CREATE TABLE destination_circuits (
    destination_id      UUID PRIMARY KEY REFERENCES destinations(id),
    state               TEXT NOT NULL DEFAULT 'closed', -- closed|open|half_open
    consecutive_fails   INT NOT NULL DEFAULT 0,
    opened_at           TIMESTAMPTZ,
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

Separating Retry Traffic from Fresh Traffic

A subtler form of backpressure happens when retries crowd out fresh events. A healthy destination receiving a new event shouldn't wait behind 500 retry attempts for a broken destination.

The fix is to use separate logical queues for first-attempt delivery and retry delivery. With Postgres, this is just a column on the events table:

sql

ALTER TABLE events ADD COLUMN attempt_class TEXT NOT NULL DEFAULT 'first';
-- 'first' | 'retry'

Your workers can then prioritize: always prefer attempt_class = 'first' events, fall back to retries when the first-attempt queue is empty.

sql

SELECT * FROM events
WHERE status IN ('queued', 'retry_scheduled')
  AND next_attempt_at <= now()
ORDER BY
    -- first-attempt events before retries
    CASE WHEN attempt_class = 'first' THEN 0 ELSE 1 END ASC,
    next_attempt_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;

This ensures that even during a retry storm — say, after a widespread destination outage clears and thousands of retries become due simultaneously — new events from healthy integrations still get prompt delivery.

Tenant-Level Fairness

On a multi-tenant platform, backpressure from one tenant's misbehaving destination shouldn't degrade delivery for other tenants. The same isolation principle applies at the account level.

Add a per-account fair-queuing pass to your worker pickup:

sql

-- Round-robin across accounts rather than pure FIFO
SELECT e.*
FROM events e
WHERE e.status = 'queued'
  AND e.next_attempt_at <= now()
  AND e.account_id NOT IN (
    -- Skip accounts that already have too many in-flight
    SELECT account_id FROM events
    WHERE status = 'delivering'
    GROUP BY account_id
    HAVING COUNT(*) >= 10
  )
ORDER BY e.next_attempt_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;

True weighted fair queuing is more complex than this, but even a simple cap on in-flight per account prevents the worst case: a single tenant flooding the queue and starving everyone else.

Operational Runbook

When your queue depth alert fires, here is the decision tree:

›Run the per-destination query above. Is one destination responsible for the majority of the queue depth?
›If yes: Check that destination's circuit state. If the circuit is still closed, manually open it or lower the consecutive-failure threshold. Review whether the destination's SLA warrants pausing delivery entirely.
›If no (spread across many destinations): You have a worker capacity problem. Scale out your worker fleet. Check whether worker timeouts are set appropriately — a 30-second destination timeout with 10 workers means you can only process 20 events per minute against timing-out destinations.
›After mitigation: Watch the queue depth trend. It should flatten and then decline. If it continues rising, the mitigation wasn't sufficient.

GetHook surfaces queue depth and per-destination in-flight counts in the dashboard so you can skip the raw SQL queries during an incident. But regardless of the tool, the diagnostic logic is the same.

Backpressure is not a rare edge case. Any webhook platform running at real scale will encounter it: destinations go down, retry storms happen, and worker pools have finite size. The teams that handle it gracefully are the ones who built for it before the incident, not during it.

The patterns here — per-destination concurrency limits, circuit breaking, priority queues for first attempts, and tenant-level fairness — compose into a delivery system that degrades gracefully rather than catastrophically. None of them require introducing new infrastructure; they're all implementable on top of a Postgres queue.

If you want a webhook gateway with these reliability patterns already built in, get started with GetHook.

Webhook Backpressure: Handling a Growing Delivery Queue Without Dropping Events

Understanding the Failure Mode

The Key Metrics to Watch

Isolation: Per-Destination Concurrency Limits

Circuit Breaking per Destination

Separating Retry Traffic from Fresh Traffic

Tenant-Level Fairness

Operational Runbook

Related articles

Webhook Payload Transformation: Normalizing, Enriching, and Redacting Events at the Gateway

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Designing a Great Webhook SDK: Verification, Typing, and Developer Ergonomics

Stop losing webhook events.