Webhooks are invisible by default. A request comes in, processing happens (or doesn't), and the only way you know something went wrong is when a customer emails you saying their order didn't ship.

The problem isn't that webhooks are hard to observe — it's that most teams don't instrument them properly until there's an incident. By then, you're flying blind during the investigation.

This guide covers the observability stack for webhook infrastructure: what to measure, how to log it, and which alerts are worth setting up.

The Three Layers of Webhook Observability

Layer 1: Delivery Metrics

These tell you whether events are reaching their destinations:

Metric	Description	Target
`delivery.success_rate`	% of deliveries that return 2xx on first attempt	> 97%
`delivery.retry_rate`	% of events that require at least one retry	< 5%
`delivery.dead_letter_rate`	% of events that exhaust all retries	< 0.3%
`delivery.latency_p50`	Median time from event receipt to first delivery attempt	< 500ms
`delivery.latency_p99`	99th percentile delivery latency	< 5s
`delivery.throughput`	Events delivered per second	(baseline)

Layer 2: Ingest Metrics

These tell you whether events are arriving correctly at your ingest layer:

Metric	Description	Target
`ingest.requests_per_second`	Rate of incoming webhook events	(baseline)
`ingest.signature_failures`	Events rejected due to HMAC mismatch	< 0.1%
`ingest.oversized_payloads`	Requests rejected due to payload size	< 0.01%
`ingest.latency_p99`	Time to accept and persist an event	< 200ms

Layer 3: Queue Metrics

These tell you about the health of your delivery pipeline:

Metric	Description	Target
`queue.depth`	Events waiting to be delivered	< 1000 (typically)
`queue.age_max`	Age of oldest undelivered event	< 60s (steady state)
`queue.worker_utilization`	% of workers actively delivering	< 80% sustained

What to Log for Each Event

Every event should produce structured log entries at three stages: receipt, delivery attempt, and final outcome.

Ingest Log

json

{
  "timestamp": "2026-01-29T14:22:01.332Z",
  "event": "ingest.received",
  "event_id": "evt_01HX...",
  "source_id": "src_abc123",
  "source_name": "stripe-production",
  "payload_bytes": 1842,
  "signature_valid": true,
  "latency_ms": 12
}

Delivery Attempt Log

json

{
  "timestamp": "2026-01-29T14:22:01.450Z",
  "event": "delivery.attempted",
  "event_id": "evt_01HX...",
  "destination_id": "dst_xyz789",
  "destination_url": "https://api.acme.com/webhooks",
  "attempt_number": 1,
  "http_status": 200,
  "outcome": "success",
  "response_latency_ms": 87,
  "next_attempt_at": null
}

Delivery Failure Log

json

{
  "timestamp": "2026-01-29T14:35:22.110Z",
  "event": "delivery.failed",
  "event_id": "evt_01HX...",
  "destination_id": "dst_xyz789",
  "attempt_number": 2,
  "http_status": 503,
  "outcome": "http_5xx",
  "response_body": "Service Unavailable",
  "next_attempt_at": "2026-01-29T14:37:22.000Z",
  "will_retry": true
}

Why structured logs matter

Structured JSON logs are machine-readable. You can query them with any log aggregation tool — Datadog, Grafana Loki, CloudWatch Logs Insights, Elasticsearch — using field-level filters.

Unstructured logs like "Delivered event evt_01HX to dst_xyz789 in 87ms" are readable by humans but unqueryable at scale.

Building the Observability Dashboard

A webhook observability dashboard should answer four questions at a glance:

1. Is everything working right now?

Current success rate (last 5 minutes) vs. the last 24-hour baseline. A drop of more than 3% from baseline is worth investigating.

2. Is there a backlog building?

Queue depth over time. A rising queue depth that doesn't decline indicates workers can't keep up with ingest volume — you need to scale workers or there's a systematic delivery failure.

3. Which destinations are struggling?

Per-destination error rate for the last hour. This lets you pinpoint whether a problem is systemic or isolated to one customer's endpoint.

4. How long is delivery taking?

P50/P95/P99 delivery latency. Spikes in P99 often indicate destination slowness before full failures start.

Dashboard layout (recommended)

Row 1: Current health
  - Overall success rate (last 5m) — big number, green/red
  - Queue depth — gauge
  - Dead-letter count (last hour) — counter

Row 2: Trends (last 24h)
  - Delivery success rate — line chart
  - Events per second (ingest + delivered) — area chart
  - Delivery latency P50/P95/P99 — line chart

Row 3: Per-destination breakdown
  - Table: destination name | success rate | avg latency | DLQ count

Alerts That Actually Matter

The goal of alerting is to be woken up exactly when action is required — not for every transient blip, and not too late to prevent customer impact.

Alert 1: Delivery success rate drop

condition: delivery.success_rate(5m) < 0.95
severity: P1 (page on-call)
message: "Webhook delivery success rate dropped to {value}% in the last 5 minutes"

A sustained drop below 95% means a significant fraction of events are failing. This is customer-facing.

Alert 2: Dead-letter queue accumulating

condition: rate(delivery.dead_letter_count[1h]) > 10
severity: P2 (Slack notification)
message: "DLQ accumulating: {count} events in dead-letter in the last hour"

A rising DLQ rate isn't immediately customer-impacting (because retry is still happening), but left unaddressed it becomes one.

Alert 3: Queue depth not draining

condition: queue.depth > 5000 AND rate(queue.depth[10m]) > 0
severity: P1
message: "Webhook queue depth is {depth} and growing — workers may be stuck"

A growing queue with no drain is a sign the worker process has stalled or crashed.

Alert 4: Ingest latency spike

condition: ingest.latency_p99(5m) > 1000ms
severity: P2
message: "Webhook ingest P99 latency is {value}ms — events may be slow to accept"

Slow ingest means you're accepting events slowly. Upstream providers with short timeouts (like Shopify's 5s) may start timing out and retrying.

Alert 5: Signature failure spike

condition: rate(ingest.signature_failures[5m]) > 0.01
severity: P2
message: "Webhook signature failures at {rate}% — possible misconfiguration or attack"

A sudden spike in signature failures could mean a secret was rotated upstream without updating your config, or someone is probing your endpoint.

Sampling Strategy for High-Volume Systems

At 1M+ events/day, logging every event is expensive. Use a tiered sampling strategy:

Event type	Sample rate
Successful deliveries (first attempt)	1%
Successful deliveries (retry)	100%
Failed deliveries	100%
Dead-letter events	100%
Signature failures	100%

This captures everything you need for debugging while reducing log volume by ~95%.

Always log 100% of failures and retries. You can reduce success-path sampling significantly without losing signal quality.

Correlating Webhook Events with Application Traces

For distributed tracing (OpenTelemetry, Datadog APM, etc.), propagate trace context through the webhook pipeline:

On ingest: Extract traceparent header if present (some webhook senders include it). Create a new root span otherwise. Attach event_id to the span.

On delivery: Create a child span under the original ingest span. This connects the delivery attempt to the original receipt in your trace waterfall.

ctx, span := tracer.Start(ctx, "webhook.deliver",
    trace.WithAttributes(
        attribute.String("event.id", eventID),
        attribute.String("destination.id", destID),
        attribute.Int("attempt.number", attemptNum),
    ),
)
defer span.End()

This lets you query "show me all traces where webhook delivery took more than 2 seconds" — invaluable for diagnosing slow destinations.

GetHook's Built-in Observability

GetHook provides structured observability out of the box:

›Event timeline — per-event view of all delivery attempts, response codes, latencies, and outcomes
›Destination health — per-destination success rate, error rate, and P99 latency over the last 24 hours
›DLQ dashboard — dead-letter events with one-click replay and root-cause details
›Webhook logs — structured, searchable delivery logs retained for 90 days

For teams that want to export to their own observability stack, GetHook supports log forwarding to any HTTP endpoint — pipe events into Datadog, Elastic, or your internal log system.

View GetHook observability features →

Webhook Observability: Metrics, Logs, and Alerts You Actually Need

The Three Layers of Webhook Observability

Layer 1: Delivery Metrics

Layer 2: Ingest Metrics

Layer 3: Queue Metrics

What to Log for Each Event

Ingest Log

Delivery Attempt Log

Delivery Failure Log

Why structured logs matter

Building the Observability Dashboard

Dashboard layout (recommended)

Alerts That Actually Matter

Alert 1: Delivery success rate drop

Alert 2: Dead-letter queue accumulating

Alert 3: Queue depth not draining

Alert 4: Ingest latency spike

Alert 5: Signature failure spike

Sampling Strategy for High-Volume Systems

Correlating Webhook Events with Application Traces

GetHook's Built-in Observability

Related articles

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Synthetic End-to-End Testing for Webhook Delivery Pipelines

Proactive Webhook Consumer Health Checks: Testing Endpoints Before Traffic Arrives

Stop losing webhook events.