When your webhook infrastructure needs to route a single inbound event to multiple destinations, you enter fanout territory. The canonical use case: a payment provider sends payment.succeeded to your ingest endpoint, and your system must deliver it to your order service, your analytics pipeline, your fraud detection system, your accounting integration, and every customer who has subscribed to payment events through your platform.
Fanout at 2 destinations is trivial. At 50 it requires deliberate design. This post covers the failure modes, the architectural trade-offs, and the implementation patterns that keep fanout reliable under load.
Why Fanout Is Not Just a Loop
The naive implementation looks like this:
destinations := getDestinationsForEvent(event)
for _, dest := range destinations {
deliver(event, dest)
}This has several problems:
Sequential delivery means destination 50 waits for destinations 1–49 to complete. If each delivery takes 200ms, you're looking at 10 seconds of sequential work for a 50-destination fanout. This blocks your worker and inflates end-to-end latency.
One failure blocks the rest. If destination 23 is down, you have to decide: continue to destinations 24–50 and mark 23 as failed, or abort and retry everything? Neither is clean without explicit state tracking per destination.
No independent retry. A synchronous loop can't retry destination 23 independently of the others. If you retry the loop, destinations 1–22 and 24–50 receive duplicate deliveries.
The fix is to decompose fanout into one delivery job per destination, created atomically when the event arrives.
The Right Data Model
Fanout reliability starts with the right schema. You need to track delivery state independently for each destination:
-- One row per (event, destination) pair
CREATE TABLE delivery_attempts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID NOT NULL REFERENCES events(id),
destination_id UUID NOT NULL REFERENCES destinations(id),
attempt_number INT NOT NULL DEFAULT 1,
status TEXT NOT NULL DEFAULT 'queued',
-- queued | delivering | delivered | failed | dead_letter
http_status INT,
outcome TEXT,
-- success | timeout | network_error | http_4xx | http_5xx
scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
attempted_at TIMESTAMPTZ,
UNIQUE (event_id, destination_id, attempt_number)
);
CREATE INDEX ON delivery_attempts (status, scheduled_at)
WHERE status IN ('queued', 'failed');The UNIQUE constraint on (event_id, destination_id, attempt_number) prevents duplicate delivery records from being inserted, even under concurrent worker processes.
The partial index on (status, scheduled_at) is what the worker uses to claim work — only rows in queued or failed status appear in the index.
Atomically Expanding Events into Fanout Jobs
When an inbound event arrives, you need to atomically:
- ›Persist the event
- ›Resolve all matching destinations
- ›Insert one delivery job per destination
This must happen in a single transaction. If you persist the event but crash before inserting delivery jobs, those destinations never receive anything.
func (s *EventStore) IngestAndFanout(ctx context.Context, event *Event) error {
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
// 1. Insert the event
if err := insertEvent(ctx, tx, event); err != nil {
return err
}
// 2. Resolve matching routes and destinations
destinations, err := resolveDestinations(ctx, tx, event)
if err != nil {
return err
}
if len(destinations) == 0 {
return tx.Commit() // No fanout needed
}
// 3. Insert one delivery job per destination
for _, dest := range destinations {
if err := insertDeliveryJob(ctx, tx, event.ID, dest.ID); err != nil {
return err
}
}
return tx.Commit()
}If this transaction commits, every destination has a delivery job. If it fails, nothing is persisted — the provider's retry will re-send the event and you'll try again.
Worker Design for Parallel Fanout Delivery
With delivery jobs in Postgres, workers claim and execute them in parallel using FOR UPDATE SKIP LOCKED:
UPDATE delivery_attempts
SET status = 'delivering', attempted_at = now()
WHERE id IN (
SELECT id FROM delivery_attempts
WHERE status = 'queued'
AND scheduled_at <= now()
ORDER BY scheduled_at
LIMIT 10
FOR UPDATE SKIP LOCKED
)
RETURNING *;FOR UPDATE SKIP LOCKED lets multiple workers run simultaneously without contention — each worker grabs a batch of jobs that no other worker is currently processing. This is the key to parallel fanout without an external queue system.
For a 50-destination event with 10 workers each claiming 10 jobs, all 50 deliveries complete in roughly one round — limited by the slowest destination's response time, not by sequential chaining.
Handling Partial Fanout Failures
With independent delivery jobs, partial failure is the default case — some destinations succeed, others fail, others time out. This is actually the correct behavior. What matters is how you handle each case.
| Destination Outcome | Action |
|---|---|
| 2xx response | Mark delivered, done |
| 4xx response (except 429) | Mark dead_letter, do not retry (client error — retrying won't fix it) |
| 429 Too Many Requests | Retry with backoff, respecting Retry-After header if present |
| 5xx response | Retry with exponential backoff |
| Connection timeout | Retry — destination may be temporarily unreachable |
| Network error | Retry — transient infrastructure issue |
| TLS handshake failure | Retry once; if persistent, dead-letter with clear error |
The critical rule for 4xx: do not retry unconditional 4xx errors. A 404 Not Found means the endpoint no longer exists. Retrying it 5 times is noise. A 401 Unauthorized means the signing secret doesn't match — the customer needs to fix their configuration. Retrying doesn't help and inflates your retry queue.
Fanout at Scale: When 50 Destinations Becomes 5,000
Everything described above works well up to a few hundred destinations per event. At larger fanout counts (multi-tenant platforms where a single event triggers deliveries to thousands of customer endpoints), you need to think about write amplification.
Inserting 5,000 delivery job rows per event, at 100 events/second, is 500,000 row inserts per second. That's a write load most Postgres instances handle without issue — but it's worth measuring before assuming it's fine.
The practical limits to watch:
| Scale | Concern |
|---|---|
| 1–100 destinations | No special consideration needed |
| 100–1,000 destinations | Batch inserts (INSERT with multi-row VALUES) to reduce round trips |
| 1,000–10,000 destinations | Consider a fanout expansion service that runs asynchronously |
| 10,000+ destinations | Partition delivery jobs by account or region; purpose-built fanout queue |
For most SaaS webhook platforms, the 100–1,000 destination range is the realistic ceiling. Batch insert your delivery jobs in chunks of 100 to keep transaction size manageable.
// Batch insert delivery jobs in chunks
const chunkSize = 100
for i := 0; i < len(destinations); i += chunkSize {
end := i + chunkSize
if end > len(destinations) {
end = len(destinations)
}
chunk := destinations[i:end]
if err := insertDeliveryJobsBatch(ctx, tx, event.ID, chunk); err != nil {
return err
}
}Ordering Guarantees (or Lack Thereof)
Fanout inherently breaks strict ordering. If you have destinations A, B, and C for a given source, and event E1 is followed by event E2:
- ›E1 may be delivered to A before E2
- ›E2 may be delivered to B before E1 (if E1 is retrying)
- ›E1 and E2 may arrive at C simultaneously if two workers claim them at the same time
If ordering matters for a specific destination, you need per-destination sequencing — a mechanism that ensures no E2 delivery is attempted for a destination until E1 is confirmed delivered.
This is expensive: it serializes delivery for that destination, eliminating the parallelism that makes fanout fast. Use it only where ordering is a hard requirement (e.g., financial ledger updates where event order directly affects state).
The practical approach: design your destination handlers to be idempotent and order-tolerant. Include sequence numbers in your event payload so consumers can detect and handle out-of-order delivery themselves.
Observing Fanout Health
Standard delivery metrics don't tell the full story for fanout. You need per-event fanout visibility:
- ›Fanout completion rate: what percentage of events have all destinations delivered (not just one)?
- ›Fanout partial failure rate: events where at least one destination is in dead-letter
- ›Per-destination success rate: which specific destinations are consistently failing?
- ›Fanout lag: time between event receipt and the last destination receiving delivery
A useful query for identifying events with incomplete fanout:
SELECT
e.id AS event_id,
e.created_at,
COUNT(*) FILTER (WHERE da.status = 'delivered') AS delivered_count,
COUNT(*) FILTER (WHERE da.status = 'dead_letter') AS dead_letter_count,
COUNT(*) FILTER (WHERE da.status IN ('queued', 'delivering')) AS pending_count,
COUNT(*) AS total_destinations
FROM events e
JOIN delivery_attempts da ON da.event_id = e.id
WHERE e.created_at > now() - interval '1 hour'
GROUP BY e.id, e.created_at
HAVING COUNT(*) FILTER (WHERE da.status IN ('queued', 'delivering', 'dead_letter')) > 0
ORDER BY e.created_at DESC;This surfaces events that haven't fully fanned out within the last hour — the starting point for any fanout incident investigation.
How GetHook Handles Fanout
GetHook's route model is designed for fanout from the start. A single source can have multiple routes, each mapping to a different destination with its own retry policy, signing secret, and timeout configuration. When an event arrives, GetHook expands it into per-destination delivery jobs atomically, processes them in parallel with independent retry state, and surfaces per-destination outcomes in the event timeline.
For platforms building customer-facing webhook infrastructure with many subscribers per event, the delivery isolation per destination means one slow or failed customer endpoint never delays delivery to the others.
If you're building fanout into your platform, start with GetHook's route configuration to avoid rebuilding the delivery infrastructure from scratch.
Summary
Reliable webhook fanout requires:
- ›Decompose fanout into one delivery job per destination at ingest time, atomically
- ›Use
FOR UPDATE SKIP LOCKEDto let workers process jobs in parallel without contention - ›Track delivery state independently per destination — partial failure is normal and manageable
- ›Don't retry unconditional 4xx errors; they indicate a configuration problem, not a transient failure
- ›For high-destination-count fanout, batch insert delivery jobs to manage write amplification
- ›Add per-event fanout visibility metrics — aggregate success rate hides partial failures
The patterns here scale from 2 destinations to thousands without architectural changes. The hardest part isn't the mechanics — it's recognizing early that a loop won't get you there.