Back to Blog
observabilitywebhooksmetricsalertingreliability

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Most webhook observability guides focus on the sender. If you're the team receiving webhooks, you have a different set of blind spots — and different metrics to watch. Here's how to instrument your consumer endpoints so failures surface before your users notice them.

L
Lena Hartmann
Infrastructure Engineer
April 26, 2026
9 min read

When people talk about webhook observability, they almost always mean sender-side visibility: did the event leave the gateway, how many retries did it take, did it reach the destination? That perspective matters — but if you're the team consuming webhooks from Stripe, GitHub, Shopify, or your own platform, the sender's dashboard doesn't tell you what you need to know.

Your consumer endpoint is a black box from the sender's perspective. They see HTTP status codes and response times. You see whatever your application happens to log. That asymmetry means you can have a serious processing failure — duplicate events silently overwriting data, background jobs crashing after responding 200, schema changes breaking deserialization — with no alert firing until a customer complains.

This post covers the metrics, logs, and alert patterns that belong on the consumer side of a webhook integration.


The Consumer's Observability Problem

A webhook consumer has a different failure surface than a webhook sender. Here's what can go wrong on your end, independently of whether the sender thinks delivery succeeded:

Failure ModeSender SeesYou See
Handler responds 200, background job crashesDelivery: successSilent data loss
Duplicate event processed twiceDelivery: success (retry)Double-charge, double-email, etc.
Signature verification skipped in a code changeDelivery: successUnverified payloads processed
Schema field renamed upstreamDelivery: successHandler reads zero-value, wrong behavior
Handler times out, returns 500Delivery: retryIncreased load, possible duplicate processing
Event processed but not acknowledged idempotentlyDelivery: retry after timeoutDuplicate side effects

None of these show up as red in the sender's dashboard. They all require consumer-side instrumentation to detect.


Metric 1: Handler Invocation Rate and Error Rate

The most basic signal is how often your handler is called and how often it returns a non-2xx response.

go
import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    webhookRequests = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "webhook_consumer_requests_total",
        Help: "Total webhook requests received, by source and event type",
    }, []string{"source", "event_type"})

    webhookErrors = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "webhook_consumer_errors_total",
        Help: "Webhook handler errors, by source, event type, and error class",
    }, []string{"source", "event_type", "error_class"})

    webhookDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "webhook_consumer_duration_seconds",
        Help:    "Handler duration in seconds",
        Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0},
    }, []string{"source", "event_type"})
)

func InstrumentedHandler(source string, next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        eventType := r.Header.Get("X-Event-Type") // adjust per provider
        start := time.Now()

        rw := &statusRecorder{ResponseWriter: w, status: 200}
        next(rw, r)

        duration := time.Since(start).Seconds()
        webhookRequests.WithLabelValues(source, eventType).Inc()
        webhookDuration.WithLabelValues(source, eventType).Observe(duration)

        if rw.status >= 400 {
            webhookErrors.WithLabelValues(source, eventType, statusClass(rw.status)).Inc()
        }
    }
}

The source label (e.g., "stripe", "github") is critical. Without it, a spike in errors from one provider is invisible against the baseline from others.

Alert on webhook_consumer_errors_total increasing by more than 5% in a 5-minute window for any given (source, event_type) pair. That threshold catches real failures without noise from transient network errors.


Metric 2: Processing Latency Distribution

Your handler's response time directly determines whether the sender retries. Most providers time out between 5 and 30 seconds. If your p99 response time is 18 seconds and a provider's timeout is 20 seconds, you have almost no margin for variance — a GC pause or a slow database query puts you into retry territory.

Track latency percentiles, not just averages:

go
// In Postgres, if you log handler durations to a table:
SELECT
    source,
    event_type,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY duration_ms) AS p50_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS p99_ms
FROM webhook_handler_log
WHERE received_at >= NOW() - INTERVAL '1 hour'
GROUP BY source, event_type
ORDER BY p99_ms DESC;

Alert when your p95 response time for any handler exceeds 50% of the provider's known timeout. For Stripe (30-second timeout), alert at 15 seconds. For GitHub (10-second timeout), alert at 5 seconds.


Metric 3: Signature Verification Failure Rate

Every webhook consumer should verify signatures. What most teams don't track is the failure rate of that verification.

A nonzero verification failure rate means one of three things:

  1. Someone is sending forged requests to your endpoint (your URL leaked or is being probed)
  2. The provider rotated their signing secret and you haven't updated yours
  3. A bug in your verification code
go
var signatureFailures = promauto.NewCounterVec(prometheus.CounterOpts{
    Name: "webhook_consumer_signature_failures_total",
    Help: "Requests that failed signature verification",
}, []string{"source"})

func verifyAndHandle(source string, secret []byte, w http.ResponseWriter, r *http.Request) {
    body, _ := io.ReadAll(io.LimitReader(r.Body, 5<<20))
    r.Body = io.NopCloser(bytes.NewReader(body))

    if !verifySignature(source, body, r.Header, secret) {
        signatureFailures.WithLabelValues(source).Inc()
        http.Error(w, "unauthorized", http.StatusUnauthorized)
        return
    }
    // proceed to handler
}

Alert immediately on any signature failure in production. A single failure is worth investigating. A burst of failures is an incident.


Metric 4: Duplicate Event Rate

Providers deliver webhooks at-least-once. Retries after timeouts, infrastructure hiccups, and failover events all produce duplicates. If you're not tracking your duplicate rate, you don't know whether your idempotency layer is working.

The right pattern is to log deduplication outcomes at the handler level:

go
var (
    duplicateEvents = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "webhook_consumer_duplicates_total",
        Help: "Events that were deduplicated (already processed)",
    }, []string{"source", "event_type"})

    freshEvents = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "webhook_consumer_fresh_events_total",
        Help: "Events processed for the first time",
    }, []string{"source", "event_type"})
)

func handleWithDedup(ctx context.Context, source, eventType, providerEventID string, process func() error) error {
    inserted, err := insertDedup(ctx, source, providerEventID)
    if err != nil {
        return err
    }
    if !inserted {
        duplicateEvents.WithLabelValues(source, eventType).Inc()
        return nil // already processed, respond 200 to silence retries
    }

    freshEvents.WithLabelValues(source, eventType).Inc()
    return process()
}

A normal duplicate rate is 0.1–2%, driven by at-least-once delivery guarantees. A rate above 5% indicates a retry storm from the sender — your endpoint may be responding too slowly, or the sender is experiencing delivery failures on their end.


Metric 5: Processing Queue Depth

If your webhook handler responds 200 and then dispatches work to a background queue (a common pattern to stay within timeout windows), the queue depth for that work is a critical signal.

bash
# If you're using a database-backed job queue (e.g., with a webhook_jobs table):
# Alert when pending jobs older than 5 minutes exceed a threshold

SELECT COUNT(*) AS stuck_jobs
FROM webhook_jobs
WHERE status = 'pending'
  AND created_at < NOW() - INTERVAL '5 minutes';

A growing queue of stale pending jobs means your background workers are unhealthy — even though your webhook handler is responding 200 and the sender considers delivery successful.

This is the "silent failure" that bites teams hardest. The sending side sees 100% delivery success. Your customers see stale data and missing notifications. Queue depth is your only window into what's actually happening.


Structured Log Fields Every Handler Should Emit

Metrics tell you that something is wrong. Logs tell you what. Every webhook handler invocation should emit a structured log line with:

json
{
  "event": "webhook.received",
  "source": "stripe",
  "event_type": "payment_intent.succeeded",
  "provider_event_id": "evt_1ABC...",
  "idempotency_outcome": "fresh",
  "signature_valid": true,
  "handler_duration_ms": 43,
  "http_status": 200,
  "request_id": "req_XYZ...",
  "timestamp": "2026-04-26T10:00:00.000Z"
}

The provider_event_id field is especially important. When a sender (or GetHook's delivery dashboard) shows you a specific event ID and asks "did you process this?", you need to be able to answer in seconds. A log query on provider_event_id should give you the complete history of what happened to that event on your end.


Alert Summary

AlertConditionSeverity
High error rateError rate > 5% for any (source, event_type) over 5 minWarning
Handler latency degradationp95 > 50% of provider timeoutWarning
Any signature failuresignature_failures_total increasesCritical
Duplicate rate spikeDuplicate rate > 10% for any sourceWarning
Background queue backlogPending jobs older than 5 min > thresholdCritical
Handler invocation dropRequest rate drops > 50% vs 1-week baselineWarning

The last alert — invocation drop — is often overlooked. If Stripe or GitHub stops sending webhooks to your endpoint (firewall change, endpoint URL change, secret mismatch that makes them disable delivery), your error rate goes to zero because no requests arrive. Baselining on historical volume and alerting on drops catches this failure mode.


Consumer-side observability is not glamorous. It's a set of counters, histograms, and structured log fields that most teams add only after a production incident. Add them before that happens.

If you're using GetHook to receive or forward webhooks, the delivery attempt records and structured event logs give you the sender-side view. Pair that with the consumer-side instrumentation above, and you have full end-to-end visibility across the entire webhook lifecycle.

Set up your webhook infrastructure with GetHook →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.