Webhook consumers look like ordinary HTTP services, but they behave differently under load and during deployments. A standard API endpoint can safely return HTTP 503 when it isn't ready — the client retries. A webhook sender, depending on its configuration, may treat that 503 as a failed delivery, start an exponential backoff window, or — in the worst case — silently drop the event after exhausting retries.

Running a webhook consumer in Kubernetes without tuning it for these semantics will cause dropped events during rollouts. This post covers the configuration that matters: ingress setup, liveness vs. readiness probes, graceful shutdown, and rolling update strategy.

Expose Your Endpoint Correctly

There's nothing exotic about a webhook ingress — it's a standard HTTP route. Two details matter that are easy to miss:

1. Preserve the Host header

Many providers include the destination hostname in their HMAC signature computation. If your ingress rewrites or strips the Host header, signature verification will fail for every request. With NGINX ingress, pass the original host explicitly:

yaml

nginx.ingress.kubernetes.io/configuration-snippet: |
  proxy_set_header Host $http_host;

2. Raise body size and read timeout limits

Batch webhooks — aggregated events from providers like Shopify or Stripe — can be several megabytes. Default NGINX limits (1m body size, 60s read timeout) will silently reject oversized payloads with 413 Request Entity Too Large. Set these on the webhook ingress specifically so you aren't relaxing limits globally:

yaml

metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "16m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "90"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "90"

Health Probes: The Most Misunderstood Part

Most teams copy a liveness probe from a tutorial and move on. For webhook consumers, the liveness/readiness distinction is load-bearing.

Liveness probe — answers: "is this process alive and worth keeping?" When liveness fails, Kubernetes kills and restarts the pod. A flapping liveness probe causes cascading restarts that drop in-flight events.

Readiness probe — answers: "should I route traffic to this pod right now?" When readiness fails, the pod is removed from the Service's endpoint list without being killed. Traffic stops arriving; the pod waits for its dependency to recover.

For webhook consumers, readiness is almost always the right primitive. Your consumer is "ready" when it can accept an event and durably enqueue it. If your database or internal queue is unhealthy, return 503 on /readyz — the sender retries against a healthy pod.

// /healthz — is the process alive?
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte("ok"))
})

// /readyz — can we durably accept events right now?
mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()
    if err := db.PingContext(ctx); err != nil {
        http.Error(w, "db unavailable", http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte("ready"))
})

In your pod spec:

yaml

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Keep failureThreshold low on readiness — you want traffic to divert quickly when your connection pool is exhausted, not after 30 seconds of 503s exhausting the sender's retry budget.

Graceful Shutdown

When Kubernetes terminates a pod, it sends SIGTERM and then waits terminationGracePeriodSeconds before issuing SIGKILL. The default grace period is 30 seconds, which is plenty of time to drain in-flight requests — but only if your application actually responds to SIGTERM.

Go's net/http server does not stop accepting connections on SIGTERM unless you call Shutdown explicitly. During a rolling update, pods that haven't caught the signal continue receiving new connections right up until they're killed, resulting in connection reset errors mid-request.

func main() {
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      buildRouter(),
        ReadTimeout:  30 * time.Second,
        WriteTimeout: 60 * time.Second,
    }

    go func() {
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            log.Fatalf("server error: %v", err)
        }
    }()

    stop := make(chan os.Signal, 1)
    signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
    <-stop

    log.Println("shutting down — draining in-flight requests")
    ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Printf("shutdown error: %v", err)
    }
    log.Println("shutdown complete")
}

Set terminationGracePeriodSeconds a few seconds longer than your application's shutdown timeout to avoid races:

yaml

spec:
  terminationGracePeriodSeconds: 30  # app drains in 25s; K8s allows 30s

Rolling Update Strategy

The default RollingUpdate strategy is the right choice, but maxUnavailable: 25% is often too aggressive for webhook consumers. With a 4-replica deployment, that means a single rollout can drop you to 3 pods — but if those 3 pods also have in-flight migrations or cold-start latency, you've reduced capacity further right when retrying senders are increasing load.

Use maxUnavailable: 0 combined with maxSurge: 1. Kubernetes will spin up a new pod before removing an old one, so you never dip below your target replica count:

yaml

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

The trade-off is brief overcapacity (5 pods during the rollout) and slightly slower deployments. For webhook consumers, that's the correct trade-off.

Autoscaling Webhook Workloads

Webhook traffic is bursty. CPU utilization is a lagging indicator for I/O-bound consumers — by the time CPU rises, you're already dropping requests. Scale on metrics that reflect actual queue pressure:

Metric	Typical Threshold	Notes
HTTP requests per second	200–400 req/s per pod	Best general-purpose signal
p95 request latency	> 400ms	Early signal of consumer backpressure
Pod CPU utilization	70%	Useful only if processing is CPU-bound
Queue depth (custom metric)	> 500 unprocessed	Best signal if you use an internal queue

With the NGINX ingress controller, you can expose RPS via Prometheus and feed it into an HPA using the external metrics API. For most teams, starting with CPU-based HPA and refining to RPS after a few weeks of production data is the pragmatic approach — better than tuning without real traffic patterns.

A Minimal Production Deployment

Putting it all together:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webhook-consumer
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: webhook-consumer
  template:
    metadata:
      labels:
        app: webhook-consumer
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: consumer
          image: your-registry/webhook-consumer:latest
          ports:
            - containerPort: 8080
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 2
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"

One caveat on CPU limits: if your consumer is CPU-throttled during a burst, request latency rises and your readiness probe starts failing — removing healthy pods from the load balancer at the worst possible time. Monitor p95 latency against CPU throttle metrics (container_cpu_cfs_throttled_seconds_total) in your first few weeks to validate your limit settings.

Common Mistakes and Fixes

Mistake	What Breaks	Fix
No readiness probe	Traffic sent to pods before DB connection is established	Add `/readyz` with a dependency ping
Liveness probe too aggressive	Cascading restarts under high load	Raise `failureThreshold`, lower `periodSeconds`
Missing `SIGTERM` handler	In-flight requests killed mid-write	Call `http.Server.Shutdown` on signal
`maxUnavailable: 25%` on small fleets	Drops to 1 pod during rollouts, triggering sender retries	Set `maxUnavailable: 0, maxSurge: 1`
Default ingress body size limit	Large batch payloads rejected with `413`	Set `proxy-body-size` on webhook routes
CPU-only HPA	Scaling lags behind burst traffic	Add RPS or latency metric to HPA

If you're routing inbound events through GetHook, delivery retries use exponential backoff with jitter — so a brief 503 during your rolling update doesn't burn through your retry budget before a healthy pod comes online. Connect your Kubernetes consumer to GetHook →

Deploying Webhook Consumers in Kubernetes: Ingress, Probes, and Zero-Downtime Rollouts

Expose Your Endpoint Correctly

Health Probes: The Most Misunderstood Part

Graceful Shutdown

Rolling Update Strategy

Autoscaling Webhook Workloads

A Minimal Production Deployment

Common Mistakes and Fixes

Related articles

Webhook Event TTLs: When Late Delivery Is Worse Than No Delivery

CloudEvents for Webhooks: Standard Envelope or Unnecessary Abstraction?

Webhook Payload Transformation: Normalizing, Enriching, and Redacting Events at the Gateway

Stop losing webhook events.