Back to Blog
opentelemetryobservabilitydistributed tracingdebugging

Tracing a Webhook Through Your Entire Stack with OpenTelemetry

Webhook failures are hard to debug because the event touches multiple services before it either delivers or dies. Here's how to wire OpenTelemetry end-to-end so you can follow a single webhook from ingest to delivery in one trace waterfall.

D
Dmitri Volkov
Distributed Systems Engineer
March 24, 2026
11 min read

A customer opens a support ticket: "Our order fulfillment webhook stopped firing at 3 PM yesterday." You have logs. You have metrics. But the event bounced through an ingest service, a Postgres-backed queue, a delivery worker, an HTTP client, and an external endpoint — and you have no single view that connects all of those hops.

This is the canonical webhook debugging problem. Logs from each service are in separate files or separate dashboards. Correlating them means manually stitching together timestamps and event IDs, and hoping nothing was dropped.

OpenTelemetry gives you a better tool: distributed traces that follow a webhook through every service boundary, represented as a single waterfall. This post shows you how to instrument a webhook pipeline end-to-end — from ingest to delivery — and what you can actually see once it's wired up.


How Trace Context Propagates (and Why Webhooks Are Awkward)

In a typical request/response system, trace context propagates via HTTP headers. The caller adds a traceparent header, the callee extracts it and creates a child span, and the trace grows naturally.

Webhooks break this model in two ways:

Problem 1: Ingest is asynchronous. When a provider (Stripe, GitHub, Shopify) sends you a webhook, they don't include a traceparent header — they're not part of your tracing system. The ingest leg and the delivery leg are in different processes, often on different machines, separated by a queue.

Problem 2: You're the bridge. The sender's trace context (if any) and your delivery worker's trace context are disconnected. You need to manufacture the linkage yourself.

The solution is to create a root span at ingest and carry the trace_id through the queue in your event record. When the delivery worker picks up the event, it reconstructs the span context and creates child spans under the same trace.


Instrumenting the Ingest Layer

The ingest handler is where you create the root span. If the sender includes a traceparent header, extract it as the parent. Otherwise, start a new root trace.

go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/trace"
)

func (h *IngestHandler) Handle(w http.ResponseWriter, r *http.Request) {
    // Extract incoming trace context if the sender provided one.
    // Most providers won't, so this creates a new root context.
    prop := otel.GetTextMapPropagator()
    ctx := prop.Extract(r.Context(), propagation.HeaderCarrier(r.Header))

    tracer := otel.Tracer("gethook/ingest")
    ctx, span := tracer.Start(ctx, "webhook.ingest",
        trace.WithSpanKind(trace.SpanKindServer),
        trace.WithAttributes(
            attribute.String("source.id", source.ID.String()),
            attribute.String("source.name", source.Name),
            attribute.Int("payload.bytes", int(r.ContentLength)),
        ),
    )
    defer span.End()

    // Verify signature, persist event...
    event, err := h.store.CreateEvent(ctx, source.AccountID, payload)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        httpx.InternalError(w, err)
        return
    }

    // Persist the trace context so the delivery worker can link to it.
    traceID := span.SpanContext().TraceID().String()
    spanID  := span.SpanContext().SpanID().String()
    h.store.SetEventTraceContext(ctx, event.ID, traceID, spanID)

    span.SetAttributes(attribute.String("event.id", event.ID.String()))
    httpx.Created(w, event)
}

The key line is SetEventTraceContext — you're storing the trace_id and span_id alongside the event in Postgres. This is the bridge across the async boundary.

Schema addition

sql
ALTER TABLE events
    ADD COLUMN trace_id  TEXT,
    ADD COLUMN span_id   TEXT;

These two columns are the only thing you need to reconnect the ingest span to the delivery span. They add minimal storage overhead (32 + 16 hex chars per event).


Instrumenting the Delivery Worker

The worker polls the queue, picks up pending events, and delivers them. It needs to reconstruct the span context from the stored trace_id and span_id, then create child spans under it.

go
func (w *Worker) deliverEvent(ctx context.Context, event *events.Event) error {
    // Reconstruct the span context from stored trace/span IDs.
    traceID, err := trace.TraceIDFromHex(event.TraceID)
    if err != nil {
        // Ingest wasn't instrumented yet — start a new root trace.
        return w.deliverWithNewTrace(ctx, event)
    }
    spanID, _ := trace.SpanIDFromHex(event.SpanID)

    remoteCtx := trace.NewSpanContext(trace.SpanContextConfig{
        TraceID:    traceID,
        SpanID:     spanID,
        TraceFlags: trace.FlagsSampled,
        Remote:     true,
    })

    ctx = trace.ContextWithRemoteSpanContext(ctx, remoteCtx)

    tracer := otel.Tracer("gethook/worker")
    ctx, span := tracer.Start(ctx, "webhook.deliver",
        trace.WithSpanKind(trace.SpanKindClient),
        trace.WithAttributes(
            attribute.String("event.id", event.ID.String()),
            attribute.String("destination.id", event.DestinationID.String()),
            attribute.Int("attempt.number", event.AttemptsCount+1),
        ),
    )
    defer span.End()

    resp, err := w.forwarder.Forward(ctx, event)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    span.SetAttributes(
        attribute.Int("http.response.status_code", resp.StatusCode),
        attribute.Int64("http.response.latency_ms", resp.LatencyMs),
    )

    if resp.StatusCode >= 400 {
        span.SetStatus(codes.Error, "destination returned non-2xx")
    }

    return nil
}

The trace.NewSpanContext call reconstructs the parent span that was created at ingest. The delivery worker span appears as a child in the same trace — despite running minutes or hours later in a different process.


What the Trace Waterfall Looks Like

Once instrumented, a successful webhook delivery produces a trace like this:

webhook.ingest                              [12ms]
  │
  └── webhook.deliver (attempt 1)           [342ms]
        │
        └── http.client POST /webhooks      [318ms]

A failed delivery with retries looks like:

webhook.ingest                              [11ms]
  │
  ├── webhook.deliver (attempt 1)           [5003ms]
  │     └── http.client POST /webhooks      [TIMEOUT after 5000ms]
  │
  ├── webhook.deliver (attempt 2)           [503ms]
  │     └── http.client POST /webhooks      [503 Service Unavailable]
  │
  └── webhook.deliver (attempt 3)           [87ms]
        └── http.client POST /webhooks      [200 OK]

Each delivery attempt is a separate child span under the same root trace. The retry schedule gap between attempts is visible as time between spans. You can immediately see: when did the ingest happen, how long each attempt took, what the destination returned, and which attempt finally succeeded.


Span Attributes Worth Adding

Beyond the basics, these attributes dramatically improve the usefulness of traces for debugging:

AttributeWhere to set itWhy it matters
event.idIngest + every delivery spanCross-reference with database
event.typeIngestFilter traces by event type
source.id / source.nameIngestWhich webhook source triggered this
destination.id / destination.urlDeliveryWhich endpoint was called
attempt.numberDeliveryDistinguish first attempt from retries
http.response.status_codeDeliveryFilter for 4xx/5xx
http.response.latency_msDeliveryFind slow destinations
retry.scheduled_atWorkerWhen the next attempt is queued
queue.wait_msWorkerHow long event sat in queue before pickup

The queue.wait_ms attribute is particularly useful. It's the gap between when the event was received and when the worker picked it up. If this number grows, your worker pool is undersized or your queue is backing up.


Configuring the OTel Exporter

You need to configure where your traces go. For most teams, this means a collector or a direct exporter to your observability backend:

go
import (
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

func initTracer(endpoint string) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracehttp.New(context.Background(),
        otlptracehttp.WithEndpoint(endpoint),
        otlptracehttp.WithInsecure(), // use WithTLSClientConfig in production
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("gethook-api"),
            semconv.ServiceVersion("1.0.0"),
            semconv.DeploymentEnvironment("production"),
        )),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))

    return tp, nil
}

The TraceIDRatioBased(0.1) sampler captures 10% of traces. For webhook infrastructure at moderate volume (sub-100k events/day), you can use AlwaysSample. At higher volumes, 10% still gives you enough data for tail-latency analysis while keeping your trace storage costs reasonable.


Sampling Strategy for Webhook Pipelines

Uniform sampling doesn't work well for webhooks. You always want to capture failures, timeouts, and retries — even if you downsample successes.

Use a head-based sampler at ingest combined with a tail-based collector rule:

ConditionSample rate
First-attempt success5%
First-attempt failure100%
Any retry attempt100%
Dead-letter outcome100%
Delivery latency > 2s100%

Most observability backends (Grafana Tempo, Jaeger, Honeycomb, Datadog APM) support tail-based sampling rules. Configure them to always keep traces containing error spans or high-latency spans.


Querying Traces for Debugging

Once traces are flowing, here are the queries you'll actually use:

Find the trace for a specific event ID: Most backends support attribute-based search. Query for event.id = "evt_01HX..." and you'll get the full waterfall.

Find all traces where delivery exceeded 3 seconds: Query for spans with span.name = "webhook.deliver" and duration > 3000ms.

Find all traces with dead-letter outcomes: Query for spans where retry.outcome = "dead_letter".

Find all traces for a specific destination: Query for destination.id = "dst_xyz789" to see the delivery history for one endpoint over time.

GetHook stores event_id, source_id, and destination_id on every event record, which makes it easy to pivot from a trace to the raw event data in the database and back.


The Payoff

Before distributed tracing, debugging "why did this webhook fail at 3 PM?" required: querying the ingest log, querying the delivery worker log, querying the retry scheduler, querying the destination response log, and mentally stitching them together by timestamp and event ID.

After tracing: search for the event ID in your trace backend, open the waterfall, and see the complete timeline — ingest, queue wait, all delivery attempts, response codes, latencies — in a single view.

This is the difference between a 45-minute debugging session and a 3-minute one.

If you want to start with instrumented webhook infrastructure out of the box, GetHook is designed with observability as a first-class concern. The delivery attempt timeline in the dashboard gives you the same event-level view, and the structured log export lets you pipe events into your own trace backend.

Get started with GetHook →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.