Back to Blog
reliabilitychaos-engineeringwebhookstestinginfrastructure

Webhook Chaos Engineering: Fault-Injecting Your Delivery Pipeline Before Production Does

Most webhook delivery bugs only surface under failure conditions that are hard to reproduce in tests. Chaos engineering gives you a systematic way to find them before your customers do.

J
Jordan Okafor
Senior Backend Engineer
April 29, 2026
10 min read

Webhooks fail in interesting ways. A destination goes down mid-batch, your delivery worker times out waiting for a response that never arrives, or a DNS change routes traffic to a host that returns 200 for everything — including actual error payloads. These aren't edge cases. They're routine failure modes that production webhook infrastructure hits regularly.

The problem is that none of them surface in standard integration tests. You test the happy path: event arrives, gets signed, gets delivered, destination returns 200. Retry logic, circuit breakers, and timeout handling are code paths that only activate under failure conditions you haven't written a test for yet.

Chaos engineering is the practice of deliberately inducing those failures to find weaknesses before your users do. For webhook delivery pipelines, it's one of the highest-return reliability investments a team can make.


The Failure Modes That Matter

Before you inject failures, you need a catalog of what actually goes wrong in webhook delivery:

Failure ModeRoot CauseTypical Symptom
Destination timeoutSlow consumer, long processingEvents stuck in delivering
Destination 5xxConsumer crash, deploy, DB failureRetry queue grows, eventual DLQ
Destination 4xxBad endpoint, auth mismatchEvents fail immediately, no retry
Network partitionFirewall rule, routing changeTCP connection hangs indefinitely
DNS failureMisconfigured TTL, expired domainDelivery fails at resolution step
Intermittent 5xxOverloaded consumer, rate limitRetry succeeds on attempt 2 or 3
Flapping destinationConsumer repeatedly crashes/recoversRetry exhausted, premature DLQ
Slow responseConsumer doing synchronous workWorker thread/goroutine pool exhaustion
Connection resetLoad balancer restart, pod evictionPartial delivery, ambiguous retry
Large payload rejection413 from consumerPermanent failure, retry is wasteful

Not every webhook system handles all of these correctly. The only way to know is to test them.


Building a Chaos Destination Server

The fastest way to inject failures is a local destination server that returns whatever failure mode you want to exercise. Here's a minimal Go server that simulates the most common failure patterns:

go
package main

import (
	"fmt"
	"log"
	"net/http"
	"os"
	"strconv"
	"sync/atomic"
	"time"
)

func main() {
	mux := http.NewServeMux()

	// /ok — always succeeds
	mux.HandleFunc("/ok", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
	})

	// /timeout — hangs for 10 minutes (tests worker timeout enforcement)
	mux.HandleFunc("/timeout", func(w http.ResponseWriter, r *http.Request) {
		time.Sleep(10 * time.Minute)
	})

	// /slow?ms=<n> — delays by query param (tests slow-response handling)
	mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
		ms, _ := strconv.Atoi(r.URL.Query().Get("ms"))
		if ms > 0 {
			time.Sleep(time.Duration(ms) * time.Millisecond)
		}
		w.WriteHeader(http.StatusOK)
	})

	// /flap — alternates between 200 and 500 on each request
	var flapCount atomic.Int64
	mux.HandleFunc("/flap", func(w http.ResponseWriter, r *http.Request) {
		n := flapCount.Add(1)
		if n%2 == 0 {
			w.WriteHeader(http.StatusOK)
		} else {
			w.WriteHeader(http.StatusInternalServerError)
		}
	})

	// /fail?after=<n> — 200 for first n requests, then permanent 500
	var successes atomic.Int64
	mux.HandleFunc("/fail", func(w http.ResponseWriter, r *http.Request) {
		after, _ := strconv.ParseInt(r.URL.Query().Get("after"), 10, 64)
		if successes.Load() < after {
			successes.Add(1)
			w.WriteHeader(http.StatusOK)
			return
		}
		w.WriteHeader(http.StatusInternalServerError)
		fmt.Fprint(w, `{"error":"service unavailable"}`)
	})

	// /reset — closes connection immediately without a response
	mux.HandleFunc("/reset", func(w http.ResponseWriter, r *http.Request) {
		hj, ok := w.(http.Hijacker)
		if !ok {
			w.WriteHeader(http.StatusInternalServerError)
			return
		}
		conn, _, _ := hj.Hijack()
		conn.Close()
	})

	port := os.Getenv("PORT")
	if port == "" {
		port = "9000"
	}
	log.Printf("chaos destination running on :%s", port)
	log.Fatal(http.ListenAndServe(":"+port, mux))
}

Point your webhook delivery system at each endpoint in turn and verify the behavior you expect: timeouts should trigger retries, 5xx should back off, 4xx should stop immediately, connection resets should retry with a fresh connection.


Network-Level Chaos with Toxiproxy

The chaos server above tests application-level failures. For network-level failures — packet loss, latency spikes, bandwidth throttling — Toxiproxy is the right tool.

Toxiproxy is a TCP proxy that sits between your delivery worker and the destination, with a REST API for injecting conditions at runtime:

bash
# Start Toxiproxy
docker run -d -p 8474:8474 -p 9001:9001 ghcr.io/shopify/toxiproxy

# Create a proxy: localhost:9001 → your chaos destination on :9000
curl -s -X POST http://localhost:8474/proxies \
  -H 'Content-Type: application/json' \
  -d '{"name":"dest","listen":"0.0.0.0:9001","upstream":"localhost:9000"}'

# Add latency: 300ms delay, 50ms jitter
curl -s -X POST http://localhost:8474/proxies/dest/toxics \
  -H 'Content-Type: application/json' \
  -d '{"name":"latency","type":"latency","attributes":{"latency":300,"jitter":50}}'

# Simulate a bandwidth cap that kills connections (network partition)
curl -s -X POST http://localhost:8474/proxies/dest/toxics \
  -H 'Content-Type: application/json' \
  -d '{"name":"partition","type":"limit_data","attributes":{"bytes":0}}'

# Remove a toxic
curl -s -X DELETE http://localhost:8474/proxies/dest/toxics/latency

With Toxiproxy in place, point your delivery worker at localhost:9001 instead of the real destination. Your worker sees an upstream with genuine network characteristics — not just a test handler returning a status code.

This is how you find bugs like "our HTTP client doesn't set a dial timeout, so a network partition hangs the goroutine for 30 minutes" or "our retry backoff has no ceiling, so a recovering destination gets hammered as it comes back online."


The Chaos Test Matrix

Run through this matrix when validating a new delivery system or a meaningful change to retry logic:

ScenarioInject viaExpected behaviorFailure signal
Destination times out/timeout or Toxiproxy timeout toxicRetry after timeout, exponential backoffWorker goroutine leak
Destination returns 500Static 500 server or /flapRetry up to max attempts, then DLQEvents disappear with no DLQ entry
Destination returns 400Static 400 serverNo retry, permanent failure recorded400 incorrectly triggers retry
Network partitionToxiproxy bandwidth limitRetry with fresh TCP connectionDelivery hangs indefinitely
Intermittent 5xx/flapEventual success within retry budgetNever recovers despite destination recovery
Worker restart during deliveryKill worker mid-batchIn-flight events retry after restartEvent loss or double delivery
Destination recovers after DLQ/fail?after=5 then swap to /okManual replay succeedsReplay re-DLQs immediately
Slow destination (2s response)/slow?ms=2000Delivered once within timeoutTimeout fires, duplicate delivered

Document which behaviors your system exhibits and which you consider bugs. Not all deviations are bugs: delivering twice is sometimes acceptable, sometimes catastrophic. The point is an explicit decision, not a surprise under production load.


Running Chaos Scenarios in CI

Add a subset of chaos tests to your integration test suite. The full matrix above is for periodic manual validation; CI should cover the most critical invariants:

  • Timeout safety: the delivery worker must not leak goroutines when a destination hangs indefinitely.
  • Retry on 5xx, no retry on 4xx: 500 responses must consume retry budget; 400 responses must fail permanently without retrying.
  • DLQ completeness: after max retry attempts, events must be queryable in dead_letter state with the full attempt history attached.
  • Worker restart safety: no event should be permanently lost after an unclean worker shutdown.

The test structure is straightforward. Start the chaos destination as a fixture, configure your delivery system to target it, emit events, and then assert on final event state in your database. A clean invariant to check: send 100 events through a /flap destination, and after the retry cycle completes, every event should be in either delivered or dead_letter. Zero events should be stuck in delivering or queued. That single assertion catches a surprising share of real bugs.


What Chaos Testing Reveals About Your Retry Design

The most common finding from chaos testing webhook delivery systems is that retry logic is too optimistic. Systems built around the happy path tend to have:

  • No retry window cap: events stay in retry_scheduled for days, hiding real failures from operators.
  • Retry on 4xx: non-retriable errors consume retry budget and queue capacity on events that will never succeed.
  • No circuit breaker: a destination returning 500 on every attempt gets hammered every retry cycle instead of being paused and given time to recover.
  • Goroutine leaks on timeout: HTTP clients without explicit dial timeouts accumulate hung goroutines as destinations slow down under load.

None of these are obvious from reading code. They're obvious when you run the chaos matrix and watch what actually happens.

GetHook's delivery engine includes per-destination circuit breaking, configurable retry budgets, and enforced delivery timeouts, so many of these failure modes are handled at the infrastructure layer when you route events through it. That said, your consumer still needs to handle duplicate delivery correctly — retries can and do deliver the same event more than once, and idempotency is always your responsibility.


Starting Small

You don't need a full chaos engineering program to get value from this. Start with two tests:

  1. Timeout test: configure your delivery worker to point at an endpoint that never responds. Confirm the delivery times out, the event moves to retry state, and no goroutines are left hanging.
  2. 5xx exhaustion test: configure a destination that always returns 500. Run the full retry cycle to completion and confirm the event lands in DLQ with the correct attempt count and that it remains queryable for replay.

Those two tests will surface the majority of reliability bugs in a webhook delivery system. Add scenarios from the matrix as you find gaps in your coverage.

A webhook pipeline that you've run through a chaos matrix is one you can operate with confidence. One you haven't tested under failure: you're discovering its behavior at the same time your customers are.


See how GetHook handles delivery failures, retries, and dead-letter queues out of the box →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.