Back to Blog
observabilitywebhooksmonitoringreliabilitysre

The Four Golden Signals Applied to Webhook Delivery Pipelines

Latency, traffic, errors, and saturation were designed for request-response systems. Here's how to reframe each signal for the unique failure modes of webhook delivery infrastructure.

L
Lena Hartmann
Infrastructure Engineer
April 22, 2026
10 min read

The four golden signals — latency, traffic, errors, and saturation — come from Google's Site Reliability Engineering book. They were designed to give on-call engineers a concise, actionable picture of a system's health. The original framing assumes a synchronous request-response model: a request comes in, something happens, a response goes out, and you measure how long that took and whether it succeeded.

Webhook delivery pipelines do not fit that model cleanly. A webhook event is accepted at ingest, persisted, queued, delivered asynchronously, and potentially retried multiple times across a window of hours. The "latency" is not milliseconds between request and response — it's minutes between event creation and confirmed delivery. The "errors" are not HTTP 5xx responses from your own service — they're delivery failures at remote endpoints you do not control. The "traffic" is not requests per second at your load balancer — it's a combination of ingest rate and delivery attempt rate, which can diverge significantly during retry storms.

This post maps each signal to its webhook-specific equivalent, the metric you should actually be tracking, and the alert threshold that makes operational sense.


Signal 1: Latency — End-to-End Delivery Time

In a traditional API, latency is the time from request received to response sent. For webhook delivery, the meaningful latency metric is time from event creation to first successful delivery attempt.

This is your delivery latency, and it tells you whether your pipeline is keeping up with inbound event volume. A p50 of 1.2 seconds is excellent. A p99 of 45 seconds suggests your queue is backing up during traffic spikes. A p99 of 8 minutes means your retry logic is the source of most "delivery" time — not your initial delivery attempt.

The key query to track delivery latency, run against your events and delivery attempts tables:

sql
SELECT
    percentile_cont(0.50) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (
            da.created_at - e.created_at
        ))
    ) AS p50_seconds,
    percentile_cont(0.95) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (
            da.created_at - e.created_at
        ))
    ) AS p95_seconds,
    percentile_cont(0.99) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (
            da.created_at - e.created_at
        ))
    ) AS p99_seconds
FROM events e
JOIN delivery_attempts da
    ON da.event_id = e.id
    AND da.outcome = 'success'
    AND da.attempt_number = 1   -- first successful attempt only
WHERE e.created_at >= NOW() - INTERVAL '1 hour';

Track this as a time-series metric, not just a point-in-time snapshot. Alert on p95 exceeding your SLA threshold (typically 30 seconds for most use cases, 5 seconds for critical event types if you've configured priority queuing).

A secondary latency metric worth tracking: time from event creation to first delivery attempt, regardless of outcome. This isolates queue latency from destination latency. If this number climbs, your delivery workers are behind on the queue. If the overall delivery latency climbs but this number is stable, your destinations are slow to respond.

Latency metricWhat it measuresHealthy threshold
Ingest-to-first-attemptQueue processing speedp99 < 5s
Ingest-to-first-successFull delivery pipelinep99 < 30s
Ingest-to-final-outcomeIncluding retriesp99 < 2h
HTTP response time (per attempt)Destination responsivenessp99 < 10s

Signal 2: Traffic — Ingest Rate vs. Attempt Rate

In a traditional system, traffic is straightforward: requests per second at the edge. In a webhook pipeline, you have two distinct traffic signals that can diverge:

Ingest rate — events received per second at your ingest endpoint. This reflects upstream activity: your customers' systems generating events, third-party providers sending webhooks, or your own application emitting outbound events.

Attempt rate — delivery attempts per second from your worker pool. Under normal conditions, attempt rate tracks ingest rate with a small multiple (account for fan-out to multiple destinations and occasional first-attempt failures). During a retry storm, attempt rate can be 5–10× ingest rate as your workers process both new events and the accumulated backlog of scheduled retries.

Track both. The ratio between them is meaningful:

go
type TrafficMetrics struct {
    IngestRate   float64 // events/sec over last 60s
    AttemptRate  float64 // attempts/sec over last 60s
    FanoutRatio  float64 // avg destinations per event (steady-state)
    RetryRatio   float64 // (AttemptRate / IngestRate) / FanoutRatio
}

func (m TrafficMetrics) IsRetryStorm() bool {
    // If attempts per event are more than 2x what fan-out alone explains,
    // you're in retry amplification territory
    return m.RetryRatio > 2.0
}

A healthy RetryRatio is between 1.0 and 1.3 — some events fail on first attempt and need one retry. A ratio above 2.0 suggests systematic delivery failures at one or more destinations. A ratio above 4.0 is a retry storm: you have destinations that are consistently failing, and your workers are spending most of their capacity re-delivering events that will probably fail again.

Alert on RetryRatio > 2.5 for more than five consecutive minutes. At that point, circuit breaking one or more destinations will reduce attempt rate faster than any other intervention.


Signal 3: Errors — Delivery Failures vs. System Errors

This is where webhook pipelines diverge most sharply from traditional systems. You have two completely different error categories:

System errors — failures in your own infrastructure. Your worker crashes, the database is unreachable, your ingest endpoint returns 500. These are errors you own and must fix.

Delivery errors — failures at remote destination endpoints. The customer's server returns 503. The endpoint times out. The TLS certificate is expired. These are errors you do not own, but you must track, surface, and route around.

Conflating these two error types in a single error rate metric is a significant operational mistake. Your error rate will look inflated whenever a customer's endpoint is having issues, even if your infrastructure is perfectly healthy.

Track them separately:

sql
-- System errors: internal failures in your pipeline
SELECT
    DATE_TRUNC('minute', created_at) AS minute,
    COUNT(*) FILTER (WHERE outcome = 'network_error') AS network_errors,
    COUNT(*) FILTER (WHERE outcome = 'timeout') AS timeouts,
    COUNT(*) AS total_attempts
FROM delivery_attempts
WHERE created_at >= NOW() - INTERVAL '1 hour'
GROUP BY 1
ORDER BY 1;

-- Destination health: errors attributable to specific endpoints
SELECT
    destination_id,
    COUNT(*) FILTER (WHERE outcome = 'http_5xx') AS http_5xx,
    COUNT(*) FILTER (WHERE outcome = 'http_4xx') AS http_4xx,
    COUNT(*) FILTER (WHERE outcome = 'timeout') AS timeouts,
    COUNT(*) AS total_attempts,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE outcome = 'success') / NULLIF(COUNT(*), 0),
        1
    ) AS success_rate_pct
FROM delivery_attempts
WHERE created_at >= NOW() - INTERVAL '1 hour'
GROUP BY destination_id
ORDER BY success_rate_pct ASC
LIMIT 20;

Your system error rate alert threshold should be low: alert if network_error rate exceeds 0.5% of attempts, or if timeout rate on attempts with a 30-second timeout exceeds 1%. These numbers suggest infrastructure problems, not destination problems.

Your destination error rate threshold is softer: a destination with a 10% error rate over five minutes is unhealthy and should be considered for circuit breaking. A destination with a 100% error rate over two minutes should be immediately circuit-broken.

GetHook tracks delivery error rates at the destination level and surfaces per-destination health scores in the dashboard. When a destination's health score drops below a threshold, delivery workers automatically back off — reducing retry noise without manual intervention.


Signal 4: Saturation — Queue Depth and Worker Headroom

Saturation measures how close your system is to its capacity limit. In a request-response system this is usually CPU utilization or connection pool exhaustion. In a webhook pipeline, the primary saturation signal is queue depth: the number of events waiting to be delivered.

Queue depth is a leading indicator of delivery latency. When your workers are keeping up with ingest, queue depth stays near zero. When workers fall behind — due to slow destinations, retry accumulation, or a traffic spike — queue depth grows, and latency climbs shortly after.

Track queue depth broken down by status:

sql
SELECT
    status,
    COUNT(*) AS count,
    MIN(created_at) AS oldest_event,
    MAX(created_at) AS newest_event
FROM events
WHERE status IN ('queued', 'delivering', 'retry_scheduled')
GROUP BY status;

Saturated delivery pipelines show a growing queued count. A large retry_scheduled count with a very old oldest_event timestamp means events are cycling through retries without clearing — a sign of persistently unhealthy destinations, not worker throughput issues.

Secondary saturation signals:

Worker thread utilization — if all your delivery worker goroutines or threads are busy simultaneously, you have no headroom for spikes. Track active workers as a percentage of total worker capacity. Alert at 80% sustained utilization.

Database connection pool exhaustion — webhook pipelines are database-heavy: every delivery attempt writes a row, every retry updates event state, and the queue poll query runs continuously. If your connection pool approaches its limit, latency spikes across all operations. Track idle connections / total connections and alert if idle drops below 20%.

Destination concurrency limits — if you enforce per-destination delivery concurrency (to avoid overwhelming a single customer endpoint), track how often that limit is hit. High contention on a destination's concurrency slot is a sign the destination is slow to respond, even if it's not failing outright.

Saturation metricAlert thresholdMeaning
Queue depth (queued)> 10,000Workers falling behind
Queue depth (retry_scheduled)> 50,000Retry storm accumulation
Worker utilization> 80% for 5 minNear capacity
DB connection pool< 20% idleConnection pressure
Oldest queued event age> 5 minDelivery SLA breach risk

Putting It Together: A Minimal Dashboard

You do not need a complex observability platform to get value from these four signals. A minimal webhook health dashboard has four rows:

  1. Latency: p50/p95/p99 ingest-to-first-success, plotted over time
  2. Traffic: ingest rate and attempt rate on the same chart, with retry ratio as a third line
  3. Errors: system error rate (%) and top-5 unhealthy destinations by success rate
  4. Saturation: queue depth by status, worker utilization %, oldest queued event age

If you can answer these four questions in under 30 seconds during an incident, your observability is in good shape:

  • Is the pipeline keeping up? (queue depth + latency)
  • Is the error rate elevated on my infrastructure? (system errors)
  • Which destination is causing problems? (destination error rates)
  • How close am I to capacity? (worker utilization + connection pool)

The Webhook-Specific Blind Spots

Two failure modes that the four golden signals do not directly catch:

Silent delivery success on wrong payload. Your pipeline delivers an event, the destination returns 200, your success rate is perfect — but the destination is silently discarding events because of a schema mismatch. This is not an infrastructure problem; it's a contract problem. Catch it with consumer-driven contract tests in CI, not production monitoring.

Correct delivery to the wrong destination. Your routing logic has a bug that sends payment.failed events to a low-priority destination instead of the expected high-priority one. Events are delivered, latency looks fine, errors are low — but the customer is not receiving the events they expect. Validate routing configuration with integration tests that assert on destination-level delivery, not just pipeline-level delivery.

The four golden signals tell you whether your pipeline is functioning. They do not tell you whether it's doing the right thing. Both types of monitoring are necessary.


If you want delivery latency percentiles, per-destination error rates, and queue depth metrics without building the instrumentation layer yourself, GetHook surfaces all four golden signals out of the box. Configure your sources and destinations, and you get a pre-built health dashboard on day one.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.