Webhooks are invisible by default. A request comes in, processing happens (or doesn't), and the only way you know something went wrong is when a customer emails you saying their order didn't ship.
The problem isn't that webhooks are hard to observe — it's that most teams don't instrument them properly until there's an incident. By then, you're flying blind during the investigation.
This guide covers the observability stack for webhook infrastructure: what to measure, how to log it, and which alerts are worth setting up.
The Three Layers of Webhook Observability
Layer 1: Delivery Metrics
These tell you whether events are reaching their destinations:
| Metric | Description | Target |
|---|---|---|
delivery.success_rate | % of deliveries that return 2xx on first attempt | > 97% |
delivery.retry_rate | % of events that require at least one retry | < 5% |
delivery.dead_letter_rate | % of events that exhaust all retries | < 0.3% |
delivery.latency_p50 | Median time from event receipt to first delivery attempt | < 500ms |
delivery.latency_p99 | 99th percentile delivery latency | < 5s |
delivery.throughput | Events delivered per second | (baseline) |
Layer 2: Ingest Metrics
These tell you whether events are arriving correctly at your ingest layer:
| Metric | Description | Target |
|---|---|---|
ingest.requests_per_second | Rate of incoming webhook events | (baseline) |
ingest.signature_failures | Events rejected due to HMAC mismatch | < 0.1% |
ingest.oversized_payloads | Requests rejected due to payload size | < 0.01% |
ingest.latency_p99 | Time to accept and persist an event | < 200ms |
Layer 3: Queue Metrics
These tell you about the health of your delivery pipeline:
| Metric | Description | Target |
|---|---|---|
queue.depth | Events waiting to be delivered | < 1000 (typically) |
queue.age_max | Age of oldest undelivered event | < 60s (steady state) |
queue.worker_utilization | % of workers actively delivering | < 80% sustained |
What to Log for Each Event
Every event should produce structured log entries at three stages: receipt, delivery attempt, and final outcome.
Ingest Log
{
"timestamp": "2026-01-29T14:22:01.332Z",
"event": "ingest.received",
"event_id": "evt_01HX...",
"source_id": "src_abc123",
"source_name": "stripe-production",
"payload_bytes": 1842,
"signature_valid": true,
"latency_ms": 12
}Delivery Attempt Log
{
"timestamp": "2026-01-29T14:22:01.450Z",
"event": "delivery.attempted",
"event_id": "evt_01HX...",
"destination_id": "dst_xyz789",
"destination_url": "https://api.acme.com/webhooks",
"attempt_number": 1,
"http_status": 200,
"outcome": "success",
"response_latency_ms": 87,
"next_attempt_at": null
}Delivery Failure Log
{
"timestamp": "2026-01-29T14:35:22.110Z",
"event": "delivery.failed",
"event_id": "evt_01HX...",
"destination_id": "dst_xyz789",
"attempt_number": 2,
"http_status": 503,
"outcome": "http_5xx",
"response_body": "Service Unavailable",
"next_attempt_at": "2026-01-29T14:37:22.000Z",
"will_retry": true
}Why structured logs matter
Structured JSON logs are machine-readable. You can query them with any log aggregation tool — Datadog, Grafana Loki, CloudWatch Logs Insights, Elasticsearch — using field-level filters.
Unstructured logs like "Delivered event evt_01HX to dst_xyz789 in 87ms" are readable by humans but unqueryable at scale.
Building the Observability Dashboard
A webhook observability dashboard should answer four questions at a glance:
1. Is everything working right now?
Current success rate (last 5 minutes) vs. the last 24-hour baseline. A drop of more than 3% from baseline is worth investigating.
2. Is there a backlog building?
Queue depth over time. A rising queue depth that doesn't decline indicates workers can't keep up with ingest volume — you need to scale workers or there's a systematic delivery failure.
3. Which destinations are struggling?
Per-destination error rate for the last hour. This lets you pinpoint whether a problem is systemic or isolated to one customer's endpoint.
4. How long is delivery taking?
P50/P95/P99 delivery latency. Spikes in P99 often indicate destination slowness before full failures start.
Dashboard layout (recommended)
Row 1: Current health
- Overall success rate (last 5m) — big number, green/red
- Queue depth — gauge
- Dead-letter count (last hour) — counter
Row 2: Trends (last 24h)
- Delivery success rate — line chart
- Events per second (ingest + delivered) — area chart
- Delivery latency P50/P95/P99 — line chart
Row 3: Per-destination breakdown
- Table: destination name | success rate | avg latency | DLQ countAlerts That Actually Matter
The goal of alerting is to be woken up exactly when action is required — not for every transient blip, and not too late to prevent customer impact.
Alert 1: Delivery success rate drop
condition: delivery.success_rate(5m) < 0.95
severity: P1 (page on-call)
message: "Webhook delivery success rate dropped to {value}% in the last 5 minutes"A sustained drop below 95% means a significant fraction of events are failing. This is customer-facing.
Alert 2: Dead-letter queue accumulating
condition: rate(delivery.dead_letter_count[1h]) > 10
severity: P2 (Slack notification)
message: "DLQ accumulating: {count} events in dead-letter in the last hour"A rising DLQ rate isn't immediately customer-impacting (because retry is still happening), but left unaddressed it becomes one.
Alert 3: Queue depth not draining
condition: queue.depth > 5000 AND rate(queue.depth[10m]) > 0
severity: P1
message: "Webhook queue depth is {depth} and growing — workers may be stuck"A growing queue with no drain is a sign the worker process has stalled or crashed.
Alert 4: Ingest latency spike
condition: ingest.latency_p99(5m) > 1000ms
severity: P2
message: "Webhook ingest P99 latency is {value}ms — events may be slow to accept"Slow ingest means you're accepting events slowly. Upstream providers with short timeouts (like Shopify's 5s) may start timing out and retrying.
Alert 5: Signature failure spike
condition: rate(ingest.signature_failures[5m]) > 0.01
severity: P2
message: "Webhook signature failures at {rate}% — possible misconfiguration or attack"A sudden spike in signature failures could mean a secret was rotated upstream without updating your config, or someone is probing your endpoint.
Sampling Strategy for High-Volume Systems
At 1M+ events/day, logging every event is expensive. Use a tiered sampling strategy:
| Event type | Sample rate |
|---|---|
| Successful deliveries (first attempt) | 1% |
| Successful deliveries (retry) | 100% |
| Failed deliveries | 100% |
| Dead-letter events | 100% |
| Signature failures | 100% |
This captures everything you need for debugging while reducing log volume by ~95%.
Always log 100% of failures and retries. You can reduce success-path sampling significantly without losing signal quality.
Correlating Webhook Events with Application Traces
For distributed tracing (OpenTelemetry, Datadog APM, etc.), propagate trace context through the webhook pipeline:
On ingest: Extract traceparent header if present (some webhook senders include it). Create a new root span otherwise. Attach event_id to the span.
On delivery: Create a child span under the original ingest span. This connects the delivery attempt to the original receipt in your trace waterfall.
ctx, span := tracer.Start(ctx, "webhook.deliver",
trace.WithAttributes(
attribute.String("event.id", eventID),
attribute.String("destination.id", destID),
attribute.Int("attempt.number", attemptNum),
),
)
defer span.End()This lets you query "show me all traces where webhook delivery took more than 2 seconds" — invaluable for diagnosing slow destinations.
GetHook's Built-in Observability
GetHook provides structured observability out of the box:
- ›Event timeline — per-event view of all delivery attempts, response codes, latencies, and outcomes
- ›Destination health — per-destination success rate, error rate, and P99 latency over the last 24 hours
- ›DLQ dashboard — dead-letter events with one-click replay and root-cause details
- ›Webhook logs — structured, searchable delivery logs retained for 90 days
For teams that want to export to their own observability stack, GetHook supports log forwarding to any HTTP endpoint — pipe events into Datadog, Elastic, or your internal log system.