Webhooks fail in interesting ways. A destination goes down mid-batch, your delivery worker times out waiting for a response that never arrives, or a DNS change routes traffic to a host that returns 200 for everything — including actual error payloads. These aren't edge cases. They're routine failure modes that production webhook infrastructure hits regularly.
The problem is that none of them surface in standard integration tests. You test the happy path: event arrives, gets signed, gets delivered, destination returns 200. Retry logic, circuit breakers, and timeout handling are code paths that only activate under failure conditions you haven't written a test for yet.
Chaos engineering is the practice of deliberately inducing those failures to find weaknesses before your users do. For webhook delivery pipelines, it's one of the highest-return reliability investments a team can make.
The Failure Modes That Matter
Before you inject failures, you need a catalog of what actually goes wrong in webhook delivery:
| Failure Mode | Root Cause | Typical Symptom |
|---|---|---|
| Destination timeout | Slow consumer, long processing | Events stuck in delivering |
| Destination 5xx | Consumer crash, deploy, DB failure | Retry queue grows, eventual DLQ |
| Destination 4xx | Bad endpoint, auth mismatch | Events fail immediately, no retry |
| Network partition | Firewall rule, routing change | TCP connection hangs indefinitely |
| DNS failure | Misconfigured TTL, expired domain | Delivery fails at resolution step |
| Intermittent 5xx | Overloaded consumer, rate limit | Retry succeeds on attempt 2 or 3 |
| Flapping destination | Consumer repeatedly crashes/recovers | Retry exhausted, premature DLQ |
| Slow response | Consumer doing synchronous work | Worker thread/goroutine pool exhaustion |
| Connection reset | Load balancer restart, pod eviction | Partial delivery, ambiguous retry |
| Large payload rejection | 413 from consumer | Permanent failure, retry is wasteful |
Not every webhook system handles all of these correctly. The only way to know is to test them.
Building a Chaos Destination Server
The fastest way to inject failures is a local destination server that returns whatever failure mode you want to exercise. Here's a minimal Go server that simulates the most common failure patterns:
package main
import (
"fmt"
"log"
"net/http"
"os"
"strconv"
"sync/atomic"
"time"
)
func main() {
mux := http.NewServeMux()
// /ok — always succeeds
mux.HandleFunc("/ok", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
})
// /timeout — hangs for 10 minutes (tests worker timeout enforcement)
mux.HandleFunc("/timeout", func(w http.ResponseWriter, r *http.Request) {
time.Sleep(10 * time.Minute)
})
// /slow?ms=<n> — delays by query param (tests slow-response handling)
mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
ms, _ := strconv.Atoi(r.URL.Query().Get("ms"))
if ms > 0 {
time.Sleep(time.Duration(ms) * time.Millisecond)
}
w.WriteHeader(http.StatusOK)
})
// /flap — alternates between 200 and 500 on each request
var flapCount atomic.Int64
mux.HandleFunc("/flap", func(w http.ResponseWriter, r *http.Request) {
n := flapCount.Add(1)
if n%2 == 0 {
w.WriteHeader(http.StatusOK)
} else {
w.WriteHeader(http.StatusInternalServerError)
}
})
// /fail?after=<n> — 200 for first n requests, then permanent 500
var successes atomic.Int64
mux.HandleFunc("/fail", func(w http.ResponseWriter, r *http.Request) {
after, _ := strconv.ParseInt(r.URL.Query().Get("after"), 10, 64)
if successes.Load() < after {
successes.Add(1)
w.WriteHeader(http.StatusOK)
return
}
w.WriteHeader(http.StatusInternalServerError)
fmt.Fprint(w, `{"error":"service unavailable"}`)
})
// /reset — closes connection immediately without a response
mux.HandleFunc("/reset", func(w http.ResponseWriter, r *http.Request) {
hj, ok := w.(http.Hijacker)
if !ok {
w.WriteHeader(http.StatusInternalServerError)
return
}
conn, _, _ := hj.Hijack()
conn.Close()
})
port := os.Getenv("PORT")
if port == "" {
port = "9000"
}
log.Printf("chaos destination running on :%s", port)
log.Fatal(http.ListenAndServe(":"+port, mux))
}Point your webhook delivery system at each endpoint in turn and verify the behavior you expect: timeouts should trigger retries, 5xx should back off, 4xx should stop immediately, connection resets should retry with a fresh connection.
Network-Level Chaos with Toxiproxy
The chaos server above tests application-level failures. For network-level failures — packet loss, latency spikes, bandwidth throttling — Toxiproxy is the right tool.
Toxiproxy is a TCP proxy that sits between your delivery worker and the destination, with a REST API for injecting conditions at runtime:
# Start Toxiproxy
docker run -d -p 8474:8474 -p 9001:9001 ghcr.io/shopify/toxiproxy
# Create a proxy: localhost:9001 → your chaos destination on :9000
curl -s -X POST http://localhost:8474/proxies \
-H 'Content-Type: application/json' \
-d '{"name":"dest","listen":"0.0.0.0:9001","upstream":"localhost:9000"}'
# Add latency: 300ms delay, 50ms jitter
curl -s -X POST http://localhost:8474/proxies/dest/toxics \
-H 'Content-Type: application/json' \
-d '{"name":"latency","type":"latency","attributes":{"latency":300,"jitter":50}}'
# Simulate a bandwidth cap that kills connections (network partition)
curl -s -X POST http://localhost:8474/proxies/dest/toxics \
-H 'Content-Type: application/json' \
-d '{"name":"partition","type":"limit_data","attributes":{"bytes":0}}'
# Remove a toxic
curl -s -X DELETE http://localhost:8474/proxies/dest/toxics/latencyWith Toxiproxy in place, point your delivery worker at localhost:9001 instead of the real destination. Your worker sees an upstream with genuine network characteristics — not just a test handler returning a status code.
This is how you find bugs like "our HTTP client doesn't set a dial timeout, so a network partition hangs the goroutine for 30 minutes" or "our retry backoff has no ceiling, so a recovering destination gets hammered as it comes back online."
The Chaos Test Matrix
Run through this matrix when validating a new delivery system or a meaningful change to retry logic:
| Scenario | Inject via | Expected behavior | Failure signal |
|---|---|---|---|
| Destination times out | /timeout or Toxiproxy timeout toxic | Retry after timeout, exponential backoff | Worker goroutine leak |
| Destination returns 500 | Static 500 server or /flap | Retry up to max attempts, then DLQ | Events disappear with no DLQ entry |
| Destination returns 400 | Static 400 server | No retry, permanent failure recorded | 400 incorrectly triggers retry |
| Network partition | Toxiproxy bandwidth limit | Retry with fresh TCP connection | Delivery hangs indefinitely |
| Intermittent 5xx | /flap | Eventual success within retry budget | Never recovers despite destination recovery |
| Worker restart during delivery | Kill worker mid-batch | In-flight events retry after restart | Event loss or double delivery |
| Destination recovers after DLQ | /fail?after=5 then swap to /ok | Manual replay succeeds | Replay re-DLQs immediately |
| Slow destination (2s response) | /slow?ms=2000 | Delivered once within timeout | Timeout fires, duplicate delivered |
Document which behaviors your system exhibits and which you consider bugs. Not all deviations are bugs: delivering twice is sometimes acceptable, sometimes catastrophic. The point is an explicit decision, not a surprise under production load.
Running Chaos Scenarios in CI
Add a subset of chaos tests to your integration test suite. The full matrix above is for periodic manual validation; CI should cover the most critical invariants:
- ›Timeout safety: the delivery worker must not leak goroutines when a destination hangs indefinitely.
- ›Retry on 5xx, no retry on 4xx: 500 responses must consume retry budget; 400 responses must fail permanently without retrying.
- ›DLQ completeness: after max retry attempts, events must be queryable in
dead_letterstate with the full attempt history attached. - ›Worker restart safety: no event should be permanently lost after an unclean worker shutdown.
The test structure is straightforward. Start the chaos destination as a fixture, configure your delivery system to target it, emit events, and then assert on final event state in your database. A clean invariant to check: send 100 events through a /flap destination, and after the retry cycle completes, every event should be in either delivered or dead_letter. Zero events should be stuck in delivering or queued. That single assertion catches a surprising share of real bugs.
What Chaos Testing Reveals About Your Retry Design
The most common finding from chaos testing webhook delivery systems is that retry logic is too optimistic. Systems built around the happy path tend to have:
- ›No retry window cap: events stay in
retry_scheduledfor days, hiding real failures from operators. - ›Retry on 4xx: non-retriable errors consume retry budget and queue capacity on events that will never succeed.
- ›No circuit breaker: a destination returning 500 on every attempt gets hammered every retry cycle instead of being paused and given time to recover.
- ›Goroutine leaks on timeout: HTTP clients without explicit dial timeouts accumulate hung goroutines as destinations slow down under load.
None of these are obvious from reading code. They're obvious when you run the chaos matrix and watch what actually happens.
GetHook's delivery engine includes per-destination circuit breaking, configurable retry budgets, and enforced delivery timeouts, so many of these failure modes are handled at the infrastructure layer when you route events through it. That said, your consumer still needs to handle duplicate delivery correctly — retries can and do deliver the same event more than once, and idempotency is always your responsibility.
Starting Small
You don't need a full chaos engineering program to get value from this. Start with two tests:
- ›Timeout test: configure your delivery worker to point at an endpoint that never responds. Confirm the delivery times out, the event moves to retry state, and no goroutines are left hanging.
- ›5xx exhaustion test: configure a destination that always returns 500. Run the full retry cycle to completion and confirm the event lands in DLQ with the correct attempt count and that it remains queryable for replay.
Those two tests will surface the majority of reliability bugs in a webhook delivery system. Add scenarios from the matrix as you find gaps in your coverage.
A webhook pipeline that you've run through a chaos matrix is one you can operate with confidence. One you haven't tested under failure: you're discovering its behavior at the same time your customers are.
See how GetHook handles delivery failures, retries, and dead-letter queues out of the box →