Back to Blog
webhooksbatchingreliabilityarchitectureperformance

Webhook Batching: Sending and Receiving Events in Bulk Without Losing Reliability

Batching webhook events reduces overhead for both producers and consumers, but it introduces new failure modes that can silently drop events. Here's how to get the throughput wins while keeping delivery guarantees intact.

D
Dmitri Volkov
Distributed Systems Engineer
April 18, 2026
10 min read

Most webhook systems are designed around one-event-one-request: a thing happens, a POST fires, your endpoint processes it and returns 200. This model is simple and easy to reason about. It also becomes expensive at high throughput — in terms of HTTP connection overhead, TLS handshakes, request queuing latency, and per-request processing cost on the consumer side.

Batching solves the throughput problem by grouping multiple events into a single HTTP request. But batching changes the reliability model in ways that are easy to get wrong. A batch is not an atomic unit. Individual events within a batch can succeed or fail independently, and how you handle partial failures determines whether batching helps you or introduces a new class of silent data loss.

This post covers the mechanics of webhook batching from both sides: producing batched events as a webhook platform, and consuming them as an application. We'll focus on the failure modes that trip teams up and the patterns that handle them correctly.


Why Batching Matters at Scale

At low throughput — say, under 100 events per second — per-event delivery is fine. Each HTTP request carries a modest payload, connection pools stay warm, and consumers process events fast enough that queue depth stays low.

At high throughput — thousands of events per second per destination — the economics shift:

ApproachConnections/secTLS handshakes/secAvg latency per event
Per-event (1 event/req)5,0005,000 (with TLS resumption: ~500)15–30 ms
Micro-batch (10 events/req)5005005–10 ms
Batch (100 events/req)50502–5 ms

The latency improvement comes from amortizing connection setup and application processing overhead across more events. If your consumer endpoint does any work that's per-request rather than per-event (parsing headers, authenticating, acquiring locks), batching multiplies the efficiency of that work.

There's also a queueing effect. When you have 10,000 events to deliver to a slow consumer (one that takes 200 ms to respond), per-event delivery creates a deep queue and head-of-line blocking. Batching reduces the number of in-flight requests and lets the delivery layer drain the queue with far fewer concurrent connections.


The Batch Payload Format

A well-designed batch payload wraps an array of events in a consistent envelope. Each event retains its individual identity — its own ID, type, timestamp, and data — so the consumer can process them independently:

json
{
  "batch_id": "bat_01HZ7GQX3K9N3VRPJ8TZFM0Y4W",
  "delivered_at": "2026-04-18T09:14:32Z",
  "events": [
    {
      "id": "evt_01HZ7GQ8B4NXRQ5T8HKWCF9JV2",
      "type": "order.created",
      "created_at": "2026-04-18T09:14:29Z",
      "data": { "order_id": "ord_8821", "amount_cents": 4999 }
    },
    {
      "id": "evt_01HZ7GQA1PMKN3VRPJ8TZFM1K5X",
      "type": "order.payment_captured",
      "created_at": "2026-04-18T09:14:31Z",
      "data": { "order_id": "ord_8821", "payment_id": "pay_9031" }
    }
  ]
}

The batch_id is distinct from the individual event IDs. It identifies the delivery attempt — useful for debugging and for the consumer to log at the batch level. Individual event IDs are what you use for idempotency and deduplication.

The HMAC signature for a batched request should cover the entire serialized batch body — not individual events. Your consumer verifies the signature once per request, then processes events individually.


Partial Failure: The Core Problem

In a per-event model, HTTP status codes map cleanly to outcomes: 200 means success, 5xx means retry the event. In a batch model, this breaks down. What status code do you return if 7 out of 10 events in a batch processed successfully and 3 failed?

There are three common approaches, each with different trade-offs:

Option 1: All-or-nothing semantics. Return 5xx if any event fails; the producer retries the entire batch. Simple to implement on the producer side, but forces the consumer to process already-succeeded events again. Only viable if every event handler is idempotent (which it should be, but "should be" and "is" are different in production).

Option 2: Per-event response body. Return 200 always, but include a structured body describing which events succeeded and which failed. The producer uses this to re-queue only the failed events.

json
{
  "batch_id": "bat_01HZ7GQX3K9N3VRPJ8TZFM0Y4W",
  "results": [
    { "event_id": "evt_01HZ7GQ8B4NXRQ5T8HKWCF9JV2", "status": "ok" },
    { "event_id": "evt_01HZ7GQA1PMKN3VRPJ8TZFM1K5X", "status": "error", "message": "downstream timeout" }
  ]
}

This is the most flexible approach and the one we recommend for any high-reliability system. It requires the producer to parse the response body and act on per-event outcomes — more implementation work, but the only way to avoid over-retrying.

Option 3: Acknowledge-and-filter. Return 200 always. Failed events are handled by a separate consumer-side dead-letter queue, and the consumer never signals individual failures back to the producer. This works when the consumer owns the retry infrastructure and doesn't want to rely on the producer for redelivery. It trades producer-side retry logic for consumer-side DLQ complexity.


Producing Batches: Micro-Batching with a Flush Window

On the producer side, batching is almost always implemented as micro-batching: events are accumulated in a buffer and flushed either when the buffer reaches a size threshold or when a time window expires — whichever comes first.

Here's a Go implementation of a basic micro-batch flusher:

go
type BatchFlusher struct {
    maxSize     int
    flushWindow time.Duration
    deliver     func(events []Event) error

    mu     sync.Mutex
    buffer []Event
    timer  *time.Timer
}

func NewBatchFlusher(maxSize int, window time.Duration, deliver func([]Event) error) *BatchFlusher {
    f := &BatchFlusher{
        maxSize:     maxSize,
        flushWindow: window,
        deliver:     deliver,
    }
    f.resetTimer()
    return f
}

func (f *BatchFlusher) Add(event Event) {
    f.mu.Lock()
    defer f.mu.Unlock()

    f.buffer = append(f.buffer, event)
    if len(f.buffer) >= f.maxSize {
        f.flushLocked()
    }
}

func (f *BatchFlusher) flushLocked() {
    if len(f.buffer) == 0 {
        return
    }

    batch := make([]Event, len(f.buffer))
    copy(batch, f.buffer)
    f.buffer = f.buffer[:0]

    if f.timer != nil {
        f.timer.Stop()
    }

    // Deliver in a goroutine to avoid holding the lock during I/O
    go func() {
        if err := f.deliver(batch); err != nil {
            log.Printf("batch delivery failed: %v", err)
            // Re-queue failed events here
        }
    }()

    f.resetTimer()
}

func (f *BatchFlusher) resetTimer() {
    f.timer = time.AfterFunc(f.flushWindow, func() {
        f.mu.Lock()
        defer f.mu.Unlock()
        f.flushLocked()
    })
}

Two parameters control the throughput vs. latency trade-off:

  • maxSize: the maximum number of events per batch. Larger batches reduce connection overhead but increase worst-case latency for the first event in a batch.
  • flushWindow: the maximum time an event waits in the buffer before it's flushed. A 100 ms window means no event waits more than 100 ms even if the batch never fills.

For most webhook delivery workloads, a maxSize of 25–100 and a flushWindow of 50–500 ms is a reasonable starting point. Tune based on your observed event rate and consumer response time SLA.


Consumer-Side: Processing Events Individually

Your consumer endpoint receives a batch but should process each event as if it arrived individually. The wrapper logic is straightforward:

go
func (h *WebhookHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    body, err := io.ReadAll(io.LimitReader(r.Body, 5<<20))
    if err != nil {
        http.Error(w, "read error", http.StatusBadRequest)
        return
    }

    // Verify HMAC over the full batch body
    if !h.verifySignature(body, r.Header.Get("Webhook-Signature")) {
        http.Error(w, "invalid signature", http.StatusUnauthorized)
        return
    }

    var batch BatchPayload
    if err := json.Unmarshal(body, &batch); err != nil {
        http.Error(w, "invalid JSON", http.StatusBadRequest)
        return
    }

    results := make([]EventResult, 0, len(batch.Events))
    for _, event := range batch.Events {
        err := h.processEvent(r.Context(), event)
        if err != nil {
            results = append(results, EventResult{
                EventID: event.ID,
                Status:  "error",
                Message: err.Error(),
            })
        } else {
            results = append(results, EventResult{
                EventID: event.ID,
                Status:  "ok",
            })
        }
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(BatchResponse{
        BatchID: batch.BatchID,
        Results: results,
    })
}

A few things to get right here:

Process events sequentially by default. The temptation is to process batch events in parallel goroutines for throughput. Resist it unless you've verified your handlers are safe for concurrent execution against the same resource. If two events in the same batch touch the same database row, parallel processing creates a race.

Idempotency is mandatory. The producer will retry events that you reported as failed. If your network response gets dropped after you write results to your database, the producer will retry the whole batch. Every event handler must be idempotent — recording the event ID and using INSERT ... ON CONFLICT DO NOTHING before doing any work.

Set a realistic timeout. A batch of 100 events takes longer to process than a single event. Your endpoint's response timeout needs to accommodate the full batch processing time. If you return a 504 mid-batch, the producer doesn't know which events succeeded.


Ordering Guarantees Within a Batch

Event ordering within a batch is only meaningful if the producer guarantees it. A well-behaved producer sends events in a batch sorted by created_at ascending — the natural causal order. But not all producers make this guarantee, and delivery retries can interleave events from different original batches.

The safe assumption: events within a batch are not guaranteed to be causally ordered relative to events in other batches, even when batches are delivered in sequence. Design your handlers to tolerate out-of-order events, or partition events by entity ID and process each partition sequentially.


When Not to Batch

Batching is not universally beneficial. Avoid it in these cases:

ScenarioReason to avoid batching
Low event volume (< 10 events/sec per destination)Overhead of batching logic exceeds the savings
Consumers with strict per-event SLAsBatching adds latency for events that arrive early in the window
Non-idempotent handlersPartial failure + retry creates duplicate processing risk
Events with strict causal ordering requirementsBatch boundaries make ordering harder to reason about
Small payloads to fast consumersPer-event delivery is already near-optimal

GetHook delivers events per-destination and will add configurable batch delivery as a route-level option — useful when your destination is a high-throughput internal service that prefers bulk ingestion over individual POSTs.


Batching trades simplicity for throughput. The throughput gains are real — an order of magnitude fewer connections, lower latency under load, and more efficient consumer processing. But the reliability model is harder. You must handle partial failures correctly, design idempotent handlers, and understand the ordering guarantees your producer makes. Get those right and batching is a straightforward performance win. Get them wrong and you have a system that silently drops events when a batch partially fails and the consumer returns 200 anyway.

If you're building high-throughput webhook delivery and want a foundation that handles batching, retries, and partial-failure tracking, start with GetHook or read the delivery configuration docs.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.