Back to Blog
observabilitymonitoringinfrastructureengineering

Webhook Observability: Metrics, Logs, and Alerts You Actually Need

Most teams only discover webhook problems when customers complain. Here's how to build proper observability for your webhook infrastructure — the signals that matter, the dashboards that help, and the alerts that wake you up at the right time.

L
Lena Hartmann
Infrastructure Engineer
January 29, 2026
10 min read

Webhooks are invisible by default. A request comes in, processing happens (or doesn't), and the only way you know something went wrong is when a customer emails you saying their order didn't ship.

The problem isn't that webhooks are hard to observe — it's that most teams don't instrument them properly until there's an incident. By then, you're flying blind during the investigation.

This guide covers the observability stack for webhook infrastructure: what to measure, how to log it, and which alerts are worth setting up.


The Three Layers of Webhook Observability

Layer 1: Delivery Metrics

These tell you whether events are reaching their destinations:

MetricDescriptionTarget
delivery.success_rate% of deliveries that return 2xx on first attempt> 97%
delivery.retry_rate% of events that require at least one retry< 5%
delivery.dead_letter_rate% of events that exhaust all retries< 0.3%
delivery.latency_p50Median time from event receipt to first delivery attempt< 500ms
delivery.latency_p9999th percentile delivery latency< 5s
delivery.throughputEvents delivered per second(baseline)

Layer 2: Ingest Metrics

These tell you whether events are arriving correctly at your ingest layer:

MetricDescriptionTarget
ingest.requests_per_secondRate of incoming webhook events(baseline)
ingest.signature_failuresEvents rejected due to HMAC mismatch< 0.1%
ingest.oversized_payloadsRequests rejected due to payload size< 0.01%
ingest.latency_p99Time to accept and persist an event< 200ms

Layer 3: Queue Metrics

These tell you about the health of your delivery pipeline:

MetricDescriptionTarget
queue.depthEvents waiting to be delivered< 1000 (typically)
queue.age_maxAge of oldest undelivered event< 60s (steady state)
queue.worker_utilization% of workers actively delivering< 80% sustained

What to Log for Each Event

Every event should produce structured log entries at three stages: receipt, delivery attempt, and final outcome.

Ingest Log

json
{
  "timestamp": "2026-01-29T14:22:01.332Z",
  "event": "ingest.received",
  "event_id": "evt_01HX...",
  "source_id": "src_abc123",
  "source_name": "stripe-production",
  "payload_bytes": 1842,
  "signature_valid": true,
  "latency_ms": 12
}

Delivery Attempt Log

json
{
  "timestamp": "2026-01-29T14:22:01.450Z",
  "event": "delivery.attempted",
  "event_id": "evt_01HX...",
  "destination_id": "dst_xyz789",
  "destination_url": "https://api.acme.com/webhooks",
  "attempt_number": 1,
  "http_status": 200,
  "outcome": "success",
  "response_latency_ms": 87,
  "next_attempt_at": null
}

Delivery Failure Log

json
{
  "timestamp": "2026-01-29T14:35:22.110Z",
  "event": "delivery.failed",
  "event_id": "evt_01HX...",
  "destination_id": "dst_xyz789",
  "attempt_number": 2,
  "http_status": 503,
  "outcome": "http_5xx",
  "response_body": "Service Unavailable",
  "next_attempt_at": "2026-01-29T14:37:22.000Z",
  "will_retry": true
}

Why structured logs matter

Structured JSON logs are machine-readable. You can query them with any log aggregation tool — Datadog, Grafana Loki, CloudWatch Logs Insights, Elasticsearch — using field-level filters.

Unstructured logs like "Delivered event evt_01HX to dst_xyz789 in 87ms" are readable by humans but unqueryable at scale.


Building the Observability Dashboard

A webhook observability dashboard should answer four questions at a glance:

1. Is everything working right now?

Current success rate (last 5 minutes) vs. the last 24-hour baseline. A drop of more than 3% from baseline is worth investigating.

2. Is there a backlog building?

Queue depth over time. A rising queue depth that doesn't decline indicates workers can't keep up with ingest volume — you need to scale workers or there's a systematic delivery failure.

3. Which destinations are struggling?

Per-destination error rate for the last hour. This lets you pinpoint whether a problem is systemic or isolated to one customer's endpoint.

4. How long is delivery taking?

P50/P95/P99 delivery latency. Spikes in P99 often indicate destination slowness before full failures start.

Dashboard layout (recommended)

Row 1: Current health
  - Overall success rate (last 5m) — big number, green/red
  - Queue depth — gauge
  - Dead-letter count (last hour) — counter

Row 2: Trends (last 24h)
  - Delivery success rate — line chart
  - Events per second (ingest + delivered) — area chart
  - Delivery latency P50/P95/P99 — line chart

Row 3: Per-destination breakdown
  - Table: destination name | success rate | avg latency | DLQ count

Alerts That Actually Matter

The goal of alerting is to be woken up exactly when action is required — not for every transient blip, and not too late to prevent customer impact.

Alert 1: Delivery success rate drop

condition: delivery.success_rate(5m) < 0.95
severity: P1 (page on-call)
message: "Webhook delivery success rate dropped to {value}% in the last 5 minutes"

A sustained drop below 95% means a significant fraction of events are failing. This is customer-facing.

Alert 2: Dead-letter queue accumulating

condition: rate(delivery.dead_letter_count[1h]) > 10
severity: P2 (Slack notification)
message: "DLQ accumulating: {count} events in dead-letter in the last hour"

A rising DLQ rate isn't immediately customer-impacting (because retry is still happening), but left unaddressed it becomes one.

Alert 3: Queue depth not draining

condition: queue.depth > 5000 AND rate(queue.depth[10m]) > 0
severity: P1
message: "Webhook queue depth is {depth} and growing — workers may be stuck"

A growing queue with no drain is a sign the worker process has stalled or crashed.

Alert 4: Ingest latency spike

condition: ingest.latency_p99(5m) > 1000ms
severity: P2
message: "Webhook ingest P99 latency is {value}ms — events may be slow to accept"

Slow ingest means you're accepting events slowly. Upstream providers with short timeouts (like Shopify's 5s) may start timing out and retrying.

Alert 5: Signature failure spike

condition: rate(ingest.signature_failures[5m]) > 0.01
severity: P2
message: "Webhook signature failures at {rate}% — possible misconfiguration or attack"

A sudden spike in signature failures could mean a secret was rotated upstream without updating your config, or someone is probing your endpoint.


Sampling Strategy for High-Volume Systems

At 1M+ events/day, logging every event is expensive. Use a tiered sampling strategy:

Event typeSample rate
Successful deliveries (first attempt)1%
Successful deliveries (retry)100%
Failed deliveries100%
Dead-letter events100%
Signature failures100%

This captures everything you need for debugging while reducing log volume by ~95%.

Always log 100% of failures and retries. You can reduce success-path sampling significantly without losing signal quality.


Correlating Webhook Events with Application Traces

For distributed tracing (OpenTelemetry, Datadog APM, etc.), propagate trace context through the webhook pipeline:

On ingest: Extract traceparent header if present (some webhook senders include it). Create a new root span otherwise. Attach event_id to the span.

On delivery: Create a child span under the original ingest span. This connects the delivery attempt to the original receipt in your trace waterfall.

go
ctx, span := tracer.Start(ctx, "webhook.deliver",
    trace.WithAttributes(
        attribute.String("event.id", eventID),
        attribute.String("destination.id", destID),
        attribute.Int("attempt.number", attemptNum),
    ),
)
defer span.End()

This lets you query "show me all traces where webhook delivery took more than 2 seconds" — invaluable for diagnosing slow destinations.


GetHook's Built-in Observability

GetHook provides structured observability out of the box:

  • Event timeline — per-event view of all delivery attempts, response codes, latencies, and outcomes
  • Destination health — per-destination success rate, error rate, and P99 latency over the last 24 hours
  • DLQ dashboard — dead-letter events with one-click replay and root-cause details
  • Webhook logs — structured, searchable delivery logs retained for 90 days

For teams that want to export to their own observability stack, GetHook supports log forwarding to any HTTP endpoint — pipe events into Datadog, Elastic, or your internal log system.

View GetHook observability features →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.