Most teams treat webhook consumer deployments as atomic: flip the DNS, watch the logs for 60 seconds, and call it done. This works until it doesn't. A subtle parsing bug, a database query that assumes a field is never null, a changed retry backoff that turns a 2-second operation into a 30-second one — these failures surface minutes or hours after deployment, long after the deploy window has closed and the team has moved on.
The pattern that eliminates this class of deployment risk is canary delivery: routing a controlled percentage of live webhook traffic to a new consumer version while the old version handles the rest. You measure error rates, latency, and business outcomes across both cohorts and only promote when the new version proves itself on real events.
This is standard practice for HTTP request routing. It's underused for webhook consumers because most teams don't think of their handlers as stateful services that need progressive rollout. They are.
Why Webhook Canary Deployments Differ from HTTP Canaries
For a web API, canary routing is straightforward: put a load balancer in front, shift 5% of requests to the new version, watch for elevated error rates.
Webhook consumers complicate this in three important ways:
Events are not stateless. A webhook consumer typically writes to a database, triggers downstream jobs, or updates external state. If 5% of order.created events are processed by a handler that writes a malformed record, the resulting data corruption doesn't reveal itself as an immediate 500 — it surfaces as a business anomaly days later. You need to compare outcomes, not just response codes.
Producers don't control routing. With a web API, your load balancer sees every request. With webhooks, the event originates at a third-party provider. To do canary routing, you need the gateway layer — not the producer — to own the traffic split.
Retry semantics are asymmetric. If your canary handler returns a 500, the gateway retries. But a retry goes to the same destination, not back to the stable handler. A flapping canary generates retry pressure that spills back into your delivery queue and can inflate latency for unrelated events.
Each of these requires deliberate handling that informs how you architect canary support in your delivery layer.
Traffic Splitting at the Gateway Layer
The right place to own traffic splitting is the webhook gateway — the component that receives events and decides which destinations to forward to. Your consumer code should not need to know it's receiving canary traffic.
The delivery layer implements a weighted routing table per event source:
type WeightedDestination struct {
DestinationID string
Weight int // 0–100, weights per source must sum to 100
Label string // "stable" or "canary"
}
type CanaryRoute struct {
SourceID string
EventPattern string // "order.*", "*", etc.
Destinations []WeightedDestination
}At delivery time, use a deterministic hash rather than pure randomness. This ensures that if an event is retried, it routes to the same destination cohort — not flipped mid-flight:
// Stable hash: same event always lands in the same destination cohort
func selectDestination(route CanaryRoute, eventID string) WeightedDestination {
h := fnv.New32a()
h.Write([]byte(eventID))
n := int(h.Sum32() % 100)
cumulative := 0
for _, d := range route.Destinations {
cumulative += d.Weight
if n < cumulative {
return d
}
}
return route.Destinations[len(route.Destinations)-1]
}Store the routing decision in the delivery attempt record. Every attempt must record which label it was routed to so your analysis queries can cleanly compare cohorts.
Instrumenting for Canary Comparison
Response codes are necessary but not sufficient. Here is what to compare between cohorts:
| Signal | Alert threshold |
|---|---|
| Delivery success rate | > 2 percentage-point regression |
| P95 handler latency | > 25% regression vs. stable |
| Retry rate | > 1.5x stable baseline |
| Dead-letter rate | Any increase at all |
| Business metric (e.g., DB rows written) | Deviation > 1pp from expected % |
The business metric check is the one teams skip and the one that catches the most subtle bugs. If you're routing 5% of order.created events to the canary, roughly 5% of resulting database writes should originate from the canary handler. If the canary returns 200 but the write count is off, you have a silent failure.
Run this query every 5 minutes during a canary window:
SELECT
da.destination_label,
COUNT(*) AS attempts,
ROUND(AVG(da.response_duration_ms)) AS avg_latency_ms,
SUM(CASE WHEN da.outcome = 'success' THEN 1 ELSE 0 END) * 100.0
/ NULLIF(COUNT(*), 0) AS success_pct,
SUM(CASE WHEN da.outcome IN ('http_5xx', 'timeout', 'network_error')
THEN 1 ELSE 0 END) * 100.0
/ NULLIF(COUNT(*), 0) AS error_pct
FROM delivery_attempts da
JOIN events e ON da.event_id = e.id
WHERE e.source_id = $1
AND da.attempted_at > NOW() - INTERVAL '10 minutes'
AND da.destination_label IS NOT NULL
GROUP BY da.destination_label;If error_pct for the canary exceeds stable by more than 2 points, trigger an automatic rollback.
Handling Canary Rollbacks
Rollback for a webhook canary differs from HTTP rollback. With HTTP, shifting traffic is instantaneous. With webhooks, events already delivered to the canary have already been processed — or have failed. A rollback requires three steps:
- ›Set the canary weight to 0. New events stop routing to the canary immediately.
- ›Replay dead-lettered canary events against the stable destination.
- ›If the canary wrote incorrect state, run a compensating job to re-process those events against the stable handler.
Step 3 is the costly one. Limit its scope by keeping the canary weight at 5% and windows short — 30 minutes of traffic, not 4 hours:
# Replay canary dead-letter events against the stable destination
curl -X POST https://api.gethook.to/v1/events/replay \
-H "Authorization: Bearer hk_..." \
-H "Content-Type: application/json" \
-d '{
"destination_id": "dst_stable_abc123",
"filter": {
"original_destination_label": "canary",
"status": "dead_letter",
"from": "2026-04-14T09:00:00Z",
"to": "2026-04-14T09:45:00Z"
}
}'This is only possible if your delivery records store destination_label — which is why writing it at routing time is non-negotiable.
Shadow Mode: Validate Without Risk
For high-risk deployments — a logic rewrite, a dependency upgrade that changes serialization behavior — run the canary in shadow mode before committing any real traffic.
In shadow mode, every event is delivered to both stable and canary destinations. The canary's response is recorded but ignored: retries don't fire on failure, and its outcome has no effect on event status. You get full visibility into how the new handler behaves against live payloads with zero blast radius.
type DeliveryMode int
const (
DeliveryModeNormal DeliveryMode = iota
DeliveryModeShadow // deliver, record outcome, ignore for retry/status
)Shadow mode doubles delivery work, but for a 30-minute validation window before a major release, the cost is negligible compared to the debugging time you'd spend after a botched cutover.
A full progressive rollout looks like this:
| Phase | Stable weight | Canary weight | Mode | Duration |
|---|---|---|---|---|
| Shadow validation | 100% | 100% (shadow) | Shadow | 30 min |
| Canary 5% | 95% | 5% | Normal | 30 min |
| Canary 25% | 75% | 25% | Normal | 30 min |
| Full cutover | 0% | 100% | Normal | Ongoing |
| Stable decommissioned | — | — | — | After 24h |
Each phase gate is an automated check: success rate, latency, business metric. Pass all three, promote to the next phase. Fail any one, roll back and page the on-call.
Three Constraints Worth Knowing
Not all event types need canaries. Reserve the full protocol for handlers that write to your primary database, trigger financial transactions, or have complex branching logic. For simpler consumers, shadow mode alone is sufficient.
Canary routing interacts with ordered delivery. If your consumer expects events for a given resource to arrive in order, routing one to stable and another to canary breaks that assumption. For ordered streams, route by entity ID rather than event ID: all events for a given order_id go to the same cohort. Swap the hash input in selectDestination from eventID to entityID.
Rollback must be a one-command operation. If rolling back requires a Terraform apply or a Helm chart update, operators will hesitate to trigger it. That hesitation turns a 5% error spike into a 40-minute outage. Your gateway needs a weight API callable from CI/CD in under 5 seconds.
Canary delivery for webhooks requires more upfront tooling than HTTP canaries — stable hashing, per-attempt destination labels, and replay-aware rollback. The investment pays for itself the first time it catches a bug that would have corrupted production data across every event for the next hour.
GetHook supports multiple destinations per route and per-event delivery tracking, which gives you the infrastructure building blocks for canary routing today. See the routing docs →