Webhooks fail in interesting ways. A destination goes down mid-batch, your delivery worker times out waiting for a response that never arrives, or a DNS change routes traffic to a host that returns 200 for everything — including actual error payloads. These aren't edge cases. They're routine failure modes that production webhook infrastructure hits regularly.

The problem is that none of them surface in standard integration tests. You test the happy path: event arrives, gets signed, gets delivered, destination returns 200. Retry logic, circuit breakers, and timeout handling are code paths that only activate under failure conditions you haven't written a test for yet.

Chaos engineering is the practice of deliberately inducing those failures to find weaknesses before your users do. For webhook delivery pipelines, it's one of the highest-return reliability investments a team can make.

The Failure Modes That Matter

Before you inject failures, you need a catalog of what actually goes wrong in webhook delivery:

Failure Mode	Root Cause	Typical Symptom
Destination timeout	Slow consumer, long processing	Events stuck in `delivering`
Destination 5xx	Consumer crash, deploy, DB failure	Retry queue grows, eventual DLQ
Destination 4xx	Bad endpoint, auth mismatch	Events fail immediately, no retry
Network partition	Firewall rule, routing change	TCP connection hangs indefinitely
DNS failure	Misconfigured TTL, expired domain	Delivery fails at resolution step
Intermittent 5xx	Overloaded consumer, rate limit	Retry succeeds on attempt 2 or 3
Flapping destination	Consumer repeatedly crashes/recovers	Retry exhausted, premature DLQ
Slow response	Consumer doing synchronous work	Worker thread/goroutine pool exhaustion
Connection reset	Load balancer restart, pod eviction	Partial delivery, ambiguous retry
Large payload rejection	413 from consumer	Permanent failure, retry is wasteful

Not every webhook system handles all of these correctly. The only way to know is to test them.

Building a Chaos Destination Server

The fastest way to inject failures is a local destination server that returns whatever failure mode you want to exercise. Here's a minimal Go server that simulates the most common failure patterns:

package main

import (
	"fmt"
	"log"
	"net/http"
	"os"
	"strconv"
	"sync/atomic"
	"time"
)

func main() {
	mux := http.NewServeMux()

	// /ok — always succeeds
	mux.HandleFunc("/ok", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
	})

	// /timeout — hangs for 10 minutes (tests worker timeout enforcement)
	mux.HandleFunc("/timeout", func(w http.ResponseWriter, r *http.Request) {
		time.Sleep(10 * time.Minute)
	})

	// /slow?ms=<n> — delays by query param (tests slow-response handling)
	mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
		ms, _ := strconv.Atoi(r.URL.Query().Get("ms"))
		if ms > 0 {
			time.Sleep(time.Duration(ms) * time.Millisecond)
		}
		w.WriteHeader(http.StatusOK)
	})

	// /flap — alternates between 200 and 500 on each request
	var flapCount atomic.Int64
	mux.HandleFunc("/flap", func(w http.ResponseWriter, r *http.Request) {
		n := flapCount.Add(1)
		if n%2 == 0 {
			w.WriteHeader(http.StatusOK)
		} else {
			w.WriteHeader(http.StatusInternalServerError)
		}
	})

	// /fail?after=<n> — 200 for first n requests, then permanent 500
	var successes atomic.Int64
	mux.HandleFunc("/fail", func(w http.ResponseWriter, r *http.Request) {
		after, _ := strconv.ParseInt(r.URL.Query().Get("after"), 10, 64)
		if successes.Load() < after {
			successes.Add(1)
			w.WriteHeader(http.StatusOK)
			return
		}
		w.WriteHeader(http.StatusInternalServerError)
		fmt.Fprint(w, `{"error":"service unavailable"}`)
	})

	// /reset — closes connection immediately without a response
	mux.HandleFunc("/reset", func(w http.ResponseWriter, r *http.Request) {
		hj, ok := w.(http.Hijacker)
		if !ok {
			w.WriteHeader(http.StatusInternalServerError)
			return
		}
		conn, _, _ := hj.Hijack()
		conn.Close()
	})

	port := os.Getenv("PORT")
	if port == "" {
		port = "9000"
	}
	log.Printf("chaos destination running on :%s", port)
	log.Fatal(http.ListenAndServe(":"+port, mux))
}

Point your webhook delivery system at each endpoint in turn and verify the behavior you expect: timeouts should trigger retries, 5xx should back off, 4xx should stop immediately, connection resets should retry with a fresh connection.

Network-Level Chaos with Toxiproxy

The chaos server above tests application-level failures. For network-level failures — packet loss, latency spikes, bandwidth throttling — Toxiproxy is the right tool.

Toxiproxy is a TCP proxy that sits between your delivery worker and the destination, with a REST API for injecting conditions at runtime:

bash

# Start Toxiproxy
docker run -d -p 8474:8474 -p 9001:9001 ghcr.io/shopify/toxiproxy

# Create a proxy: localhost:9001 → your chaos destination on :9000
curl -s -X POST http://localhost:8474/proxies \
  -H 'Content-Type: application/json' \
  -d '{"name":"dest","listen":"0.0.0.0:9001","upstream":"localhost:9000"}'

# Add latency: 300ms delay, 50ms jitter
curl -s -X POST http://localhost:8474/proxies/dest/toxics \
  -H 'Content-Type: application/json' \
  -d '{"name":"latency","type":"latency","attributes":{"latency":300,"jitter":50}}'

# Simulate a bandwidth cap that kills connections (network partition)
curl -s -X POST http://localhost:8474/proxies/dest/toxics \
  -H 'Content-Type: application/json' \
  -d '{"name":"partition","type":"limit_data","attributes":{"bytes":0}}'

# Remove a toxic
curl -s -X DELETE http://localhost:8474/proxies/dest/toxics/latency

With Toxiproxy in place, point your delivery worker at localhost:9001 instead of the real destination. Your worker sees an upstream with genuine network characteristics — not just a test handler returning a status code.

This is how you find bugs like "our HTTP client doesn't set a dial timeout, so a network partition hangs the goroutine for 30 minutes" or "our retry backoff has no ceiling, so a recovering destination gets hammered as it comes back online."

The Chaos Test Matrix

Run through this matrix when validating a new delivery system or a meaningful change to retry logic:

Scenario	Inject via	Expected behavior	Failure signal
Destination times out	`/timeout` or Toxiproxy timeout toxic	Retry after timeout, exponential backoff	Worker goroutine leak
Destination returns 500	Static 500 server or `/flap`	Retry up to max attempts, then DLQ	Events disappear with no DLQ entry
Destination returns 400	Static 400 server	No retry, permanent failure recorded	400 incorrectly triggers retry
Network partition	Toxiproxy bandwidth limit	Retry with fresh TCP connection	Delivery hangs indefinitely
Intermittent 5xx	`/flap`	Eventual success within retry budget	Never recovers despite destination recovery
Worker restart during delivery	Kill worker mid-batch	In-flight events retry after restart	Event loss or double delivery
Destination recovers after DLQ	`/fail?after=5` then swap to `/ok`	Manual replay succeeds	Replay re-DLQs immediately
Slow destination (2s response)	`/slow?ms=2000`	Delivered once within timeout	Timeout fires, duplicate delivered

Document which behaviors your system exhibits and which you consider bugs. Not all deviations are bugs: delivering twice is sometimes acceptable, sometimes catastrophic. The point is an explicit decision, not a surprise under production load.

Running Chaos Scenarios in CI

Add a subset of chaos tests to your integration test suite. The full matrix above is for periodic manual validation; CI should cover the most critical invariants:

›Timeout safety: the delivery worker must not leak goroutines when a destination hangs indefinitely.
›Retry on 5xx, no retry on 4xx: 500 responses must consume retry budget; 400 responses must fail permanently without retrying.
›DLQ completeness: after max retry attempts, events must be queryable in dead_letter state with the full attempt history attached.
›Worker restart safety: no event should be permanently lost after an unclean worker shutdown.

The test structure is straightforward. Start the chaos destination as a fixture, configure your delivery system to target it, emit events, and then assert on final event state in your database. A clean invariant to check: send 100 events through a /flap destination, and after the retry cycle completes, every event should be in either delivered or dead_letter. Zero events should be stuck in delivering or queued. That single assertion catches a surprising share of real bugs.

What Chaos Testing Reveals About Your Retry Design

The most common finding from chaos testing webhook delivery systems is that retry logic is too optimistic. Systems built around the happy path tend to have:

›No retry window cap: events stay in retry_scheduled for days, hiding real failures from operators.
›Retry on 4xx: non-retriable errors consume retry budget and queue capacity on events that will never succeed.
›No circuit breaker: a destination returning 500 on every attempt gets hammered every retry cycle instead of being paused and given time to recover.
›Goroutine leaks on timeout: HTTP clients without explicit dial timeouts accumulate hung goroutines as destinations slow down under load.

None of these are obvious from reading code. They're obvious when you run the chaos matrix and watch what actually happens.

GetHook's delivery engine includes per-destination circuit breaking, configurable retry budgets, and enforced delivery timeouts, so many of these failure modes are handled at the infrastructure layer when you route events through it. That said, your consumer still needs to handle duplicate delivery correctly — retries can and do deliver the same event more than once, and idempotency is always your responsibility.

Starting Small

You don't need a full chaos engineering program to get value from this. Start with two tests:

›Timeout test: configure your delivery worker to point at an endpoint that never responds. Confirm the delivery times out, the event moves to retry state, and no goroutines are left hanging.
›5xx exhaustion test: configure a destination that always returns 500. Run the full retry cycle to completion and confirm the event lands in DLQ with the correct attempt count and that it remains queryable for replay.

Those two tests will surface the majority of reliability bugs in a webhook delivery system. Add scenarios from the matrix as you find gaps in your coverage.

A webhook pipeline that you've run through a chaos matrix is one you can operate with confidence. One you haven't tested under failure: you're discovering its behavior at the same time your customers are.

See how GetHook handles delivery failures, retries, and dead-letter queues out of the box →

Webhook Chaos Engineering: Fault-Injecting Your Delivery Pipeline Before Production Does

The Failure Modes That Matter

Building a Chaos Destination Server

Network-Level Chaos with Toxiproxy

The Chaos Test Matrix

Running Chaos Scenarios in CI

What Chaos Testing Reveals About Your Retry Design

Starting Small

Related articles

Validating Webhook Signatures in Edge Runtimes: Cloudflare Workers, Vercel Edge, and Deno Deploy

Webhook Event TTLs: When Late Delivery Is Worse Than No Delivery

Deploying Webhook Consumers in Kubernetes: Ingress, Probes, and Zero-Downtime Rollouts

Stop losing webhook events.