Exponential backoff with a maximum retry count is the standard answer to failed webhook deliveries. It works well for transient failures — a destination that returns a 500, recovers in 30 seconds, and accepts the next attempt just fine. But backoff alone doesn't handle a different failure mode: a destination that is completely down for hours.
When a destination goes down for an extended period, your retry queue fills with attempts that all fail. Your worker threads are burning time making HTTP connections that time out. And other destinations — perfectly healthy ones — experience increased latency because your worker pool is clogged with hopeless retries.
Circuit breakers solve this. The pattern comes from electrical engineering: a circuit breaker "trips" when it detects a fault, opens the circuit to stop the flow of current, and closes again only when conditions are safe. Applied to webhook delivery, a circuit breaker tracks per-destination failure rates, stops attempting delivery when a destination looks unhealthy, and probes for recovery before resuming normal delivery.
The Three States
A circuit breaker has three states:
| State | Behavior | Transition |
|---|---|---|
| Closed | Delivery proceeds normally. Failures are counted. | Opens when the failure threshold is crossed. |
| Open | Delivery is suspended. No attempts are made. | Moves to Half-Open after a cooldown period. |
| Half-Open | A single probe delivery is attempted. | Closes if the probe succeeds. Reopens if it fails. |
The key insight is the Half-Open state. Rather than waiting for an operator to manually re-enable a destination, the circuit breaker automatically probes for recovery. This gives you hands-off healing in the common case (the destination recovers on its own) while still protecting your workers from sustained retry storms.
Tracking State in Postgres
You don't need Redis or a dedicated service to implement circuit breakers. Postgres is sufficient, and if you're already using it as your job queue, adding circuit breaker state is a small extension.
Add a circuit_breaker_state table:
CREATE TABLE destination_circuit_breakers (
destination_id UUID PRIMARY KEY REFERENCES destinations(id) ON DELETE CASCADE,
account_id UUID NOT NULL REFERENCES accounts(id),
state TEXT NOT NULL DEFAULT 'closed',
-- closed | open | half_open
consecutive_failures INT NOT NULL DEFAULT 0,
last_failure_at TIMESTAMPTZ,
opened_at TIMESTAMPTZ,
next_probe_at TIMESTAMPTZ,
last_success_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX dcb_account_state ON destination_circuit_breakers (account_id, state);Keep this table separate from destinations. Circuit breaker state is operational data — it changes frequently and has a different lifecycle than destination configuration.
The Delivery Worker Integration
The circuit breaker check lives in your delivery worker, immediately before each HTTP attempt:
func (w *Worker) deliverEvent(ctx context.Context, evt *Event, dst *Destination) error {
cb, err := w.cbStore.Get(ctx, dst.ID)
if err != nil {
return fmt.Errorf("circuit breaker lookup: %w", err)
}
switch cb.State {
case "open":
if time.Now().Before(cb.NextProbeAt) {
// Destination is in open state and not yet ready to probe.
// Re-schedule for after the probe window opens.
return w.eventStore.RescheduleAfter(ctx, evt.ID, cb.NextProbeAt)
}
// Transition to half-open and attempt a probe.
if err := w.cbStore.TransitionToHalfOpen(ctx, dst.ID); err != nil {
return err
}
case "half_open":
// Another worker may have beaten us to the probe. Skip this event
// and let the probe outcome determine whether to reclose.
return w.eventStore.RescheduleAfter(ctx, evt.ID, time.Now().Add(30*time.Second))
}
// Attempt delivery.
outcome, err := w.forwarder.Deliver(ctx, evt, dst)
if err != nil {
return err
}
if outcome.Success {
return w.cbStore.RecordSuccess(ctx, dst.ID)
}
return w.cbStore.RecordFailure(ctx, dst.ID, outcome)
}The RescheduleAfter function pushes the event's next_attempt_at forward without consuming a retry attempt — the event is not failed, just deferred. This is important: circuit breaker deferral is not the same as a retry. The retry counter only increments on actual HTTP attempts.
Opening and Closing the Circuit
The RecordFailure and RecordSuccess functions implement the state machine transitions:
const (
openThreshold = 5 // consecutive failures before opening
openCooldown = 10 * time.Minute // initial open duration
maxOpenCooldown = 4 * time.Hour // maximum cooldown after repeated failures
)
func (s *CBStore) RecordFailure(ctx context.Context, dstID uuid.UUID, outcome DeliveryOutcome) error {
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
cb, err := s.getForUpdate(tx, ctx, dstID)
if err != nil {
return err
}
cb.ConsecutiveFailures++
cb.LastFailureAt = ptr(time.Now())
if cb.State == "half_open" || cb.ConsecutiveFailures >= openThreshold {
cb.State = "open"
cb.OpenedAt = ptr(time.Now())
// Exponential cooldown: double on each consecutive open cycle.
cooldown := min(openCooldown*time.Duration(1<<cb.OpenCycles), maxOpenCooldown)
cb.NextProbeAt = ptr(time.Now().Add(cooldown))
}
return s.save(tx, ctx, cb)
}
func (s *CBStore) RecordSuccess(ctx context.Context, dstID uuid.UUID) error {
return s.update(ctx, dstID, func(cb *CircuitBreaker) {
cb.State = "closed"
cb.ConsecutiveFailures = 0
cb.LastSuccessAt = ptr(time.Now())
cb.OpenedAt = nil
cb.NextProbeAt = nil
})
}The doubling cooldown matters. A destination that trips the circuit breaker, recovers briefly, and then fails again is exhibiting an instability pattern. Holding it in the open state longer before the next probe reduces the pressure you put on a struggling service.
Choosing Your Thresholds
The right thresholds depend on your traffic patterns and acceptable false-positive rates. These are reasonable starting defaults:
| Parameter | Default | Notes |
|---|---|---|
| Open threshold | 5 consecutive failures | Low enough to catch real outages, high enough to ignore flaps |
| Initial cooldown | 10 minutes | Most transient outages recover in under 10 minutes |
| Maximum cooldown | 4 hours | Prevents indefinite suppression; forces a re-check |
| Probe timeout | 10 seconds | Shorter than normal delivery timeout to detect slow recovery |
| Half-open slot count | 1 | Only one probe at a time; prevents thundering herd on recovery |
For high-volume destinations where a single failure spike should not trigger an open state, consider using a sliding window instead of consecutive failures: open the circuit if more than 80% of deliveries in the last 60 seconds failed, with a minimum sample size of 10 attempts.
Surfacing Circuit Breaker State to Customers
A circuit breaker that silently defers events is invisible to your customers — and invisibility is a support nightmare. When a destination trips a circuit breaker, notify the customer.
At minimum, expose circuit breaker state on the destination detail endpoint:
{
"id": "dst_01HX...",
"name": "Production Order Service",
"url": "https://api.acme.com/webhooks/orders",
"circuit_breaker": {
"state": "open",
"consecutive_failures": 5,
"opened_at": "2026-03-29T14:30:00Z",
"next_probe_at": "2026-03-29T14:40:00Z",
"last_success_at": "2026-03-29T14:28:00Z"
}
}Add a status page indicator when a destination's circuit is open. Make it obvious. Customers debugging webhook delivery issues should not need to read logs to understand that their endpoint has been unreachable for 45 minutes.
GetHook surfaces circuit breaker state in the destination detail view and includes it in webhook delivery observability dashboards, so you can see at a glance which destinations are healthy and which are suspended.
Events During an Open Circuit
While the circuit is open, events destined for the suspended endpoint accumulate in a deferred state. When the circuit closes — either because the probe succeeded or because an operator manually reset it — those events need to be delivered.
The mechanics are simple: on circuit close, query for all events where destination_id = $1 AND status = 'retry_scheduled' AND next_attempt_at > now() and reset their next_attempt_at to now(). The worker will pick them up in normal priority order.
One caution: a destination that was down for 4 hours may have thousands of deferred events. Resuming all of them simultaneously will spike the destination. Rate-limit the resume by releasing events in batches — for example, no more than 100 events per second to a newly recovered destination — to give it a chance to stabilize before absorbing the backlog.
What the Circuit Breaker Does Not Do
Circuit breakers handle the "destination is down" case. They don't replace:
- ›Retry logic — circuit breakers defer; retries consume retry budget. Both are needed.
- ›Dead letter queues — events that exhaust their retry budget before the circuit closes should go to a DLQ, not vanish.
- ›Alerting — a circuit breaker opening is an operational event worth paging on. Wire it to your alerting system.
- ›Event ordering guarantees — if your consumers depend on strict event ordering, circuit breaker deferral breaks that ordering. Design consumers to be order-tolerant, or implement per-destination delivery queues with ordering guarantees.
Adding a circuit breaker to your webhook delivery pipeline is a one-time investment that pays off every time a downstream service has an incident. Without one, every outage cascades into a retry storm that affects your entire delivery infrastructure. With one, the damage is contained to the affected destination while everything else continues normally.