Your webhook delivery pipeline is a chain of at least four moving parts: the ingest endpoint, the persistence layer, the delivery queue, and the worker process that makes the outbound HTTP call. Each link can fail independently. A deployment can break the worker without touching the ingest endpoint. A database migration can silently stall the queue without returning errors. The ingest endpoint can accept events and return 200 OK while the downstream worker is wedged.
Standard monitoring — uptime checks on your API, process health on your workers, database connection pool metrics — tells you whether each component is alive. It does not tell you whether an event traveling through all of them right now will actually arrive at its destination. That gap is where synthetic end-to-end testing lives.
Synthetic testing means deliberately sending known test events through your real production pipeline on a schedule and asserting that they arrive at a controlled destination within an expected time window. It catches delivery regressions the moment they happen rather than the moment a customer files a ticket.
What Synthetic Testing Catches That Health Checks Miss
Before writing any code, it's worth being specific about the failure modes that synthetic testing surfaces and health checks do not.
| Failure | Standard Health Check | Synthetic E2E Test |
|---|---|---|
| Worker process is running but not polling the queue | Not caught — process is "up" | Caught — event never delivered |
| Queue backlog spike — events enqueued but not dispatched | Not caught unless queue depth alerting is configured | Caught — delivery latency exceeds threshold |
| Route misconfiguration — event type not matching any destination | Not caught | Caught — canary event goes undelivered |
| HMAC signing failure — worker signs with stale key | Not caught — delivery returns 2xx if destination ignores signatures | Caught if canary destination verifies the signature |
| Network connectivity between worker and destination broken | Not caught by ingest or queue health | Caught — delivery fails with network error |
| Database index bloat slowing queue polling to a crawl | Not caught at P50 — only visible at P99 | Caught — latency threshold exceeded even at modest load |
The worker-not-polling failure is particularly insidious. A worker process that crashed and restarted with a config error might be running but consuming no events. Your process health check sees a running process. Your API health check sees a healthy ingest endpoint. Nothing alerts. Your customers start noticing that their webhooks stopped arriving.
The Architecture of a Canary Pipeline
You need three components to run synthetic e2e tests:
- ›A canary sender — a scheduled job that injects a known test event into your pipeline
- ›A canary receiver — an HTTP endpoint you control that accepts and records delivery
- ›An assertion service — something that checks whether the sent event was received within the latency SLA
The canary sender and assertion service can be the same job, with a delay between send and check. The receiver is a separate endpoint — it can be as simple as a serverless function that writes to a table.
Canary Sender ──► POST /ingest/{token} ──► Queue ──► Worker ──► Canary Receiver
│
records received_at
│
Assertion Job ◄─────────────────── checks: (sent_at + SLA) > received_at? ──┘Step 1: Build the Canary Receiver
The receiver is a purpose-built HTTP endpoint that accepts webhook deliveries and records them. Keep it simple — it should do almost nothing that could fail independently.
package main
import (
"database/sql"
"encoding/json"
"log"
"net/http"
"time"
_ "github.com/lib/pq"
)
type CanaryRecord struct {
EventID string `json:"id"`
ReceivedAt time.Time `json:"received_at"`
Source string `json:"source"`
}
func canaryHandler(db *sql.DB) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
var payload struct {
ID string `json:"id"`
Type string `json:"type"`
}
if err := json.NewDecoder(r.Body).Decode(&payload); err != nil {
http.Error(w, "bad request", http.StatusBadRequest)
return
}
_, err := db.ExecContext(r.Context(), `
INSERT INTO canary_receipts (event_id, received_at, source)
VALUES ($1, NOW(), $2)
ON CONFLICT (event_id) DO NOTHING
`, payload.ID, r.Header.Get("X-Canary-Source"))
if err != nil {
log.Printf("canary insert failed: %v", err)
// Still return 200 — don't cause the delivery worker to retry
}
w.WriteHeader(http.StatusOK)
}
}The ON CONFLICT DO NOTHING on event_id handles the case where your delivery layer retries a canary event — you only want to record the first receipt. The handler always returns 200 OK even on DB failure to avoid triggering delivery retries that would pollute your latency measurements.
Step 2: The Canary Schema
CREATE TABLE canary_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id TEXT NOT NULL UNIQUE, -- the ID you inject into the pipeline
pipeline TEXT NOT NULL, -- "primary", "high-priority", etc.
sent_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
sla_seconds INT NOT NULL DEFAULT 30,
received_at TIMESTAMPTZ, -- filled by receiver
latency_ms INT GENERATED ALWAYS AS (
EXTRACT(MILLISECONDS FROM (received_at - sent_at))::INT
) STORED
);
CREATE TABLE canary_receipts (
event_id TEXT PRIMARY KEY,
received_at TIMESTAMPTZ NOT NULL,
source TEXT
);
CREATE INDEX canary_events_unresolved ON canary_events (sent_at)
WHERE received_at IS NULL;The latency_ms generated column gives you delivery latency as a first-class metric without a JOIN — useful for dashboards and alerting queries.
Step 3: The Sender Job
The sender fires on a cron schedule — every 60 seconds is a reasonable cadence for a production pipeline. It creates a uniquely identifiable event and ingests it through the real ingest endpoint.
#!/usr/bin/env bash
# canary-send.sh — run every 60 seconds
set -euo pipefail
PIPELINE="${PIPELINE:-primary}"
SLA_SECONDS="${SLA_SECONDS:-30}"
INGEST_URL="${INGEST_URL}" # e.g. https://ingest.gethook.to/ingest/src_abc123
EVENT_ID="canary_$(date +%s)_$(openssl rand -hex 4)"
# Record the send in the canary table
psql "$CANARY_DB_URL" -c "
INSERT INTO canary_events (event_id, pipeline, sla_seconds)
VALUES ('$EVENT_ID', '$PIPELINE', $SLA_SECONDS)
"
# Fire the event into the real ingest pipeline
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST "$INGEST_URL" \
-H "Content-Type: application/json" \
-H "X-Canary-Source: $PIPELINE" \
-d "{
\"id\": \"$EVENT_ID\",
\"type\": \"canary.ping\",
\"sent_at\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
\"pipeline\": \"$PIPELINE\"
}")
if [ "$HTTP_STATUS" != "200" ]; then
echo "ALERT: canary ingest returned $HTTP_STATUS" >&2
exit 1
fi
echo "canary event $EVENT_ID sent, HTTP $HTTP_STATUS"The X-Canary-Source header identifies this as a canary delivery in the receiver logs, so you can distinguish canary traffic from real customer events in your observability stack.
Step 4: The Assertion Job
The assertion job runs slightly after the SLA window. For a 30-second SLA, run it every 60 seconds with a 45-second lookback. It finds canary events that were sent more than sla_seconds ago but have not been received.
-- Canary events past their SLA with no receipt
SELECT
ce.event_id,
ce.pipeline,
ce.sent_at,
ce.sla_seconds,
EXTRACT(EPOCH FROM (NOW() - ce.sent_at))::INT AS age_seconds,
cr.received_at
FROM canary_events ce
LEFT JOIN canary_receipts cr ON cr.event_id = ce.event_id
WHERE
ce.sent_at > NOW() - INTERVAL '10 minutes' -- don't look too far back
AND ce.sent_at < NOW() - (ce.sla_seconds || ' seconds')::INTERVAL
AND cr.received_at IS NULL;If this query returns any rows, you have a delivery regression. Fire an alert: PagerDuty, Slack, or wherever your on-call team watches.
Also update resolved canary records with receipt data for latency tracking:
UPDATE canary_events ce
SET received_at = cr.received_at
FROM canary_receipts cr
WHERE cr.event_id = ce.event_id
AND ce.received_at IS NULL;What to Measure
Once canary events are flowing, you have a continuous latency signal from ingest to delivery. Track:
| Metric | Alert Threshold |
|---|---|
canary_sla_miss_rate (missed / sent per 5 min) | > 0% over 3 consecutive windows |
canary_latency_p50 | > 2× baseline |
canary_latency_p99 | > SLA threshold |
canary_ingest_failure_rate | > 0 (ingest returned non-200) |
canary_age_max (oldest unresolved event) | > 2× SLA |
The canary_age_max metric is particularly useful for catching slow degradations — a worker that's polling at half speed shows up as gradual latency creep before it becomes a full SLA miss.
Isolating Canary Traffic in Production
Canary events should not pollute your customers' event feeds or dashboards. A few approaches:
Use a dedicated canary source. Create a source specifically for canary testing with a route to your canary receiver. Canary events never touch customer destinations. This is the cleanest isolation.
Filter in your event listing queries. Add a type != 'canary.ping' filter to any customer-facing event list. Canary events are still stored and visible in your internal ops tooling, but hidden from customer dashboards.
Tag canary events explicitly. Use a metadata field or custom header to mark events as synthetic. Some teams add a "canary": true field to the payload. This lets you filter in logs, metrics, and traces without needing a dedicated source.
GetHook's event filtering lets you route by event_type_pattern, which means a pattern like canary.* can send all canary events exclusively to a dedicated destination without any overlap with customer routes.
Multi-Region and Multi-Pipeline Canaries
If you run delivery workers across multiple regions or priority queues (e.g., a high-priority queue for paid accounts, a standard queue for free accounts), run a separate canary per path.
| Canary | Pipeline | SLA |
|---|---|---|
canary-primary-us | Standard queue, US worker | 60s |
canary-primary-eu | Standard queue, EU worker | 60s |
canary-priority-us | High-priority queue, US worker | 15s |
Each canary has its own source, its own route to a regional receiver, and its own SLA threshold. A regression in the EU pipeline shows up immediately without obscuring the US pipeline's health signal — and vice versa.
Avoiding Alert Fatigue
Synthetic testing produces false positives when your test infrastructure itself fails. The most common culprits:
- ›The canary sender script fails because
psqlisn't available in the container - ›The canary receiver is deployed to a non-production environment and the URL isn't updated
- ›The assertion job runs before the SLA window has fully elapsed for the first batch of events after a restart
Guard against these by: tracking sender failures as a separate metric (not as a delivery SLA miss), running the assertion job with a generous buffer after the SLA (sla_seconds + 15), and alerting on "canary sender hasn't fired in 3 minutes" as a separate signal from "canary event missed SLA."
The goal is a signal with high specificity: when the alert fires, you have high confidence the delivery pipeline is broken, not just that the monitoring script is misconfigured.
Synthetic end-to-end testing is the difference between finding out your delivery pipeline is broken from a customer and finding out from your own alerting — before the customer notices. The implementation is straightforward; the operational discipline of keeping the canary tests healthy and the SLA thresholds calibrated is what makes it stick.
If you're building on GetHook and want to add canary delivery to your monitoring setup, start here to configure a dedicated source and route for your test events.