A customer opens a support ticket: "Our order fulfillment webhook stopped firing at 3 PM yesterday." You have logs. You have metrics. But the event bounced through an ingest service, a Postgres-backed queue, a delivery worker, an HTTP client, and an external endpoint — and you have no single view that connects all of those hops.
This is the canonical webhook debugging problem. Logs from each service are in separate files or separate dashboards. Correlating them means manually stitching together timestamps and event IDs, and hoping nothing was dropped.
OpenTelemetry gives you a better tool: distributed traces that follow a webhook through every service boundary, represented as a single waterfall. This post shows you how to instrument a webhook pipeline end-to-end — from ingest to delivery — and what you can actually see once it's wired up.
How Trace Context Propagates (and Why Webhooks Are Awkward)
In a typical request/response system, trace context propagates via HTTP headers. The caller adds a traceparent header, the callee extracts it and creates a child span, and the trace grows naturally.
Webhooks break this model in two ways:
Problem 1: Ingest is asynchronous. When a provider (Stripe, GitHub, Shopify) sends you a webhook, they don't include a traceparent header — they're not part of your tracing system. The ingest leg and the delivery leg are in different processes, often on different machines, separated by a queue.
Problem 2: You're the bridge. The sender's trace context (if any) and your delivery worker's trace context are disconnected. You need to manufacture the linkage yourself.
The solution is to create a root span at ingest and carry the trace_id through the queue in your event record. When the delivery worker picks up the event, it reconstructs the span context and creates child spans under the same trace.
Instrumenting the Ingest Layer
The ingest handler is where you create the root span. If the sender includes a traceparent header, extract it as the parent. Otherwise, start a new root trace.
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
)
func (h *IngestHandler) Handle(w http.ResponseWriter, r *http.Request) {
// Extract incoming trace context if the sender provided one.
// Most providers won't, so this creates a new root context.
prop := otel.GetTextMapPropagator()
ctx := prop.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracer := otel.Tracer("gethook/ingest")
ctx, span := tracer.Start(ctx, "webhook.ingest",
trace.WithSpanKind(trace.SpanKindServer),
trace.WithAttributes(
attribute.String("source.id", source.ID.String()),
attribute.String("source.name", source.Name),
attribute.Int("payload.bytes", int(r.ContentLength)),
),
)
defer span.End()
// Verify signature, persist event...
event, err := h.store.CreateEvent(ctx, source.AccountID, payload)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
httpx.InternalError(w, err)
return
}
// Persist the trace context so the delivery worker can link to it.
traceID := span.SpanContext().TraceID().String()
spanID := span.SpanContext().SpanID().String()
h.store.SetEventTraceContext(ctx, event.ID, traceID, spanID)
span.SetAttributes(attribute.String("event.id", event.ID.String()))
httpx.Created(w, event)
}The key line is SetEventTraceContext — you're storing the trace_id and span_id alongside the event in Postgres. This is the bridge across the async boundary.
Schema addition
ALTER TABLE events
ADD COLUMN trace_id TEXT,
ADD COLUMN span_id TEXT;These two columns are the only thing you need to reconnect the ingest span to the delivery span. They add minimal storage overhead (32 + 16 hex chars per event).
Instrumenting the Delivery Worker
The worker polls the queue, picks up pending events, and delivers them. It needs to reconstruct the span context from the stored trace_id and span_id, then create child spans under it.
func (w *Worker) deliverEvent(ctx context.Context, event *events.Event) error {
// Reconstruct the span context from stored trace/span IDs.
traceID, err := trace.TraceIDFromHex(event.TraceID)
if err != nil {
// Ingest wasn't instrumented yet — start a new root trace.
return w.deliverWithNewTrace(ctx, event)
}
spanID, _ := trace.SpanIDFromHex(event.SpanID)
remoteCtx := trace.NewSpanContext(trace.SpanContextConfig{
TraceID: traceID,
SpanID: spanID,
TraceFlags: trace.FlagsSampled,
Remote: true,
})
ctx = trace.ContextWithRemoteSpanContext(ctx, remoteCtx)
tracer := otel.Tracer("gethook/worker")
ctx, span := tracer.Start(ctx, "webhook.deliver",
trace.WithSpanKind(trace.SpanKindClient),
trace.WithAttributes(
attribute.String("event.id", event.ID.String()),
attribute.String("destination.id", event.DestinationID.String()),
attribute.Int("attempt.number", event.AttemptsCount+1),
),
)
defer span.End()
resp, err := w.forwarder.Forward(ctx, event)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.SetAttributes(
attribute.Int("http.response.status_code", resp.StatusCode),
attribute.Int64("http.response.latency_ms", resp.LatencyMs),
)
if resp.StatusCode >= 400 {
span.SetStatus(codes.Error, "destination returned non-2xx")
}
return nil
}The trace.NewSpanContext call reconstructs the parent span that was created at ingest. The delivery worker span appears as a child in the same trace — despite running minutes or hours later in a different process.
What the Trace Waterfall Looks Like
Once instrumented, a successful webhook delivery produces a trace like this:
webhook.ingest [12ms]
│
└── webhook.deliver (attempt 1) [342ms]
│
└── http.client POST /webhooks [318ms]A failed delivery with retries looks like:
webhook.ingest [11ms]
│
├── webhook.deliver (attempt 1) [5003ms]
│ └── http.client POST /webhooks [TIMEOUT after 5000ms]
│
├── webhook.deliver (attempt 2) [503ms]
│ └── http.client POST /webhooks [503 Service Unavailable]
│
└── webhook.deliver (attempt 3) [87ms]
└── http.client POST /webhooks [200 OK]Each delivery attempt is a separate child span under the same root trace. The retry schedule gap between attempts is visible as time between spans. You can immediately see: when did the ingest happen, how long each attempt took, what the destination returned, and which attempt finally succeeded.
Span Attributes Worth Adding
Beyond the basics, these attributes dramatically improve the usefulness of traces for debugging:
| Attribute | Where to set it | Why it matters |
|---|---|---|
event.id | Ingest + every delivery span | Cross-reference with database |
event.type | Ingest | Filter traces by event type |
source.id / source.name | Ingest | Which webhook source triggered this |
destination.id / destination.url | Delivery | Which endpoint was called |
attempt.number | Delivery | Distinguish first attempt from retries |
http.response.status_code | Delivery | Filter for 4xx/5xx |
http.response.latency_ms | Delivery | Find slow destinations |
retry.scheduled_at | Worker | When the next attempt is queued |
queue.wait_ms | Worker | How long event sat in queue before pickup |
The queue.wait_ms attribute is particularly useful. It's the gap between when the event was received and when the worker picked it up. If this number grows, your worker pool is undersized or your queue is backing up.
Configuring the OTel Exporter
You need to configure where your traces go. For most teams, this means a collector or a direct exporter to your observability backend:
import (
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)
func initTracer(endpoint string) (*sdktrace.TracerProvider, error) {
exporter, err := otlptracehttp.New(context.Background(),
otlptracehttp.WithEndpoint(endpoint),
otlptracehttp.WithInsecure(), // use WithTLSClientConfig in production
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName("gethook-api"),
semconv.ServiceVersion("1.0.0"),
semconv.DeploymentEnvironment("production"),
)),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp, nil
}The TraceIDRatioBased(0.1) sampler captures 10% of traces. For webhook infrastructure at moderate volume (sub-100k events/day), you can use AlwaysSample. At higher volumes, 10% still gives you enough data for tail-latency analysis while keeping your trace storage costs reasonable.
Sampling Strategy for Webhook Pipelines
Uniform sampling doesn't work well for webhooks. You always want to capture failures, timeouts, and retries — even if you downsample successes.
Use a head-based sampler at ingest combined with a tail-based collector rule:
| Condition | Sample rate |
|---|---|
| First-attempt success | 5% |
| First-attempt failure | 100% |
| Any retry attempt | 100% |
| Dead-letter outcome | 100% |
| Delivery latency > 2s | 100% |
Most observability backends (Grafana Tempo, Jaeger, Honeycomb, Datadog APM) support tail-based sampling rules. Configure them to always keep traces containing error spans or high-latency spans.
Querying Traces for Debugging
Once traces are flowing, here are the queries you'll actually use:
Find the trace for a specific event ID:
Most backends support attribute-based search. Query for event.id = "evt_01HX..." and you'll get the full waterfall.
Find all traces where delivery exceeded 3 seconds:
Query for spans with span.name = "webhook.deliver" and duration > 3000ms.
Find all traces with dead-letter outcomes:
Query for spans where retry.outcome = "dead_letter".
Find all traces for a specific destination:
Query for destination.id = "dst_xyz789" to see the delivery history for one endpoint over time.
GetHook stores event_id, source_id, and destination_id on every event record, which makes it easy to pivot from a trace to the raw event data in the database and back.
The Payoff
Before distributed tracing, debugging "why did this webhook fail at 3 PM?" required: querying the ingest log, querying the delivery worker log, querying the retry scheduler, querying the destination response log, and mentally stitching them together by timestamp and event ID.
After tracing: search for the event ID in your trace backend, open the waterfall, and see the complete timeline — ingest, queue wait, all delivery attempts, response codes, latencies — in a single view.
This is the difference between a 45-minute debugging session and a 3-minute one.
If you want to start with instrumented webhook infrastructure out of the box, GetHook is designed with observability as a first-class concern. The delivery attempt timeline in the dashboard gives you the same event-level view, and the structured log export lets you pipe events into your own trace backend.