Back to Blog
reliabilityarchitecturepostgreswebhooksoutbox pattern

The Webhook Outbox Pattern: Surviving Application Crashes Between Processing and Delivery

Your application commits a database transaction, then crashes before enqueuing the webhook. The event is lost and your customer never finds out. The transactional outbox pattern closes this gap permanently.

Y
Yuki Tanaka
Founding Engineer
April 20, 2026
11 min read

There is a window between two operations in most webhook-sending systems that almost nobody talks about until something goes wrong:

  1. Your application commits a database transaction (order created, payment captured, user updated).
  2. Your application enqueues a webhook delivery job.

If your process crashes, is OOM-killed, or loses network connectivity between steps 1 and 2, the transaction is durable — the order exists — but the webhook job was never written. The event is silently lost. Your customer's integration never fires. Nobody knows until the customer notices their data is wrong.

This is not a theoretical edge case. Rolling deployments, container restarts, Kubernetes pod evictions, and database failovers all trigger this window regularly in production systems. If you're sending webhooks at any meaningful volume, you will hit it.

The transactional outbox pattern closes this gap. It is the correct foundation for any webhook delivery system that claims to offer at-least-once delivery.


Why Dual-Write Fails

The instinct when you first encounter this problem is to wrap both operations in a retry loop or make the enqueue "more reliable" somehow. Both miss the root cause.

The problem is that your database and your job queue are two separate systems with independent durability guarantees. Writing to Postgres and writing to Redis (or RabbitMQ, or SQS) are two distinct I/O operations. There is no distributed transaction that spans both. Either one can fail independently.

Even if you retry the enqueue, you can only retry if your process is still running. A crash between the two writes means the retry never happens.

The fix is to eliminate the second system entirely during the critical path. Write your webhook delivery intent into the same database, inside the same transaction as your business data. Make the delivery job a row in Postgres, not a message in a separate queue.


The Outbox Pattern

The outbox pattern works as follows:

  1. Inside your existing database transaction, write a row to an outbox table describing the webhook event to deliver.
  2. Commit the transaction. Both your business data and the outbox row are now durable together — atomically.
  3. A background relay process polls the outbox table, picks up pending rows, and delivers the webhook (or hands it off to a delivery worker).
  4. Once delivery is confirmed, the relay marks the outbox row as processed.

The business logic writes to two tables inside one transaction. The relay reads from the outbox table asynchronously. If the application crashes after step 2, the outbox row survives. The relay picks it up on the next poll.

Here is what the outbox table looks like in Postgres:

sql
CREATE TABLE webhook_outbox (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    account_id      UUID NOT NULL,
    event_type      TEXT NOT NULL,
    source_id       UUID,
    payload         JSONB NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at    TIMESTAMPTZ,
    attempts        INT NOT NULL DEFAULT 0,
    last_error      TEXT
);

-- The relay only cares about unprocessed rows
CREATE INDEX webhook_outbox_pending
    ON webhook_outbox (created_at)
    WHERE processed_at IS NULL;

Writing to the Outbox Inside Your Transaction

Here is a Go example that creates an order and writes a webhook outbox row in one transaction:

go
func (s *OrderService) CreateOrder(ctx context.Context, req CreateOrderRequest) (*Order, error) {
    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return nil, fmt.Errorf("begin tx: %w", err)
    }
    defer tx.Rollback()

    // Step 1: Insert business data
    order, err := insertOrder(ctx, tx, req)
    if err != nil {
        return nil, fmt.Errorf("insert order: %w", err)
    }

    // Step 2: Write webhook intent to the outbox — same transaction
    payload, err := json.Marshal(map[string]any{
        "id":         order.ID,
        "status":     order.Status,
        "amount":     order.AmountCents,
        "created_at": order.CreatedAt,
    })
    if err != nil {
        return nil, fmt.Errorf("marshal payload: %w", err)
    }

    _, err = tx.ExecContext(ctx, `
        INSERT INTO webhook_outbox (account_id, event_type, source_id, payload)
        VALUES ($1, $2, $3, $4)
    `, req.AccountID, "order.created", req.SourceID, payload)
    if err != nil {
        return nil, fmt.Errorf("insert outbox: %w", err)
    }

    // Both rows commit together or neither does
    if err := tx.Commit(); err != nil {
        return nil, fmt.Errorf("commit: %w", err)
    }

    return order, nil
}

After this commit, the delivery guarantee is transferred from your application code to the database. The order and the delivery intent are either both present or both absent. Your application process can crash immediately after the commit; the outbox row will be picked up by the relay.


The Relay Process

The relay is a polling loop that reads unprocessed outbox rows and hands them to the delivery layer. It uses FOR UPDATE SKIP LOCKED to allow multiple relay instances to run without conflicts:

go
func (r *OutboxRelay) Poll(ctx context.Context) error {
    tx, err := r.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    rows, err := tx.QueryContext(ctx, `
        SELECT id, account_id, event_type, source_id, payload
        FROM webhook_outbox
        WHERE processed_at IS NULL
          AND attempts < 5
        ORDER BY created_at
        LIMIT 50
        FOR UPDATE SKIP LOCKED
    `)
    if err != nil {
        return err
    }
    defer rows.Close()

    var entries []OutboxEntry
    for rows.Next() {
        var e OutboxEntry
        if err := rows.Scan(&e.ID, &e.AccountID, &e.EventType, &e.SourceID, &e.Payload); err != nil {
            return err
        }
        entries = append(entries, e)
    }
    rows.Close()

    for _, entry := range entries {
        if err := r.enqueueDelivery(ctx, tx, entry); err != nil {
            // Increment attempts, record error — do not mark processed
            _, _ = tx.ExecContext(ctx, `
                UPDATE webhook_outbox
                SET attempts = attempts + 1, last_error = $2
                WHERE id = $1
            `, entry.ID, err.Error())
            continue
        }

        _, err = tx.ExecContext(ctx, `
            UPDATE webhook_outbox
            SET processed_at = now()
            WHERE id = $1
        `, entry.ID)
        if err != nil {
            return err
        }
    }

    return tx.Commit()
}

The key detail: enqueueDelivery and the processed_at update happen inside the same transaction. If the relay crashes after enqueuing but before committing, the outbox row remains unprocessed. The next poll will enqueue the delivery again — producing a duplicate. This is expected and correct behavior for at-least-once delivery. Your delivery layer must handle idempotency separately.


Relay Latency vs. Poll Interval

The outbox pattern introduces relay latency — the time between the transaction committing and the relay picking up the row. This is a function of your poll interval.

Poll IntervalMedian Relay LatencyPostgres Load
100ms~50msHigh (10 queries/sec per relay instance)
500ms~250msModerate (2 queries/sec per relay instance)
1s~500msLow (1 query/sec per relay instance)
5s~2.5sMinimal

For most webhook use cases, 500ms–1s poll interval is the right default. Sub-second relay latency is imperceptible to end users and keeps database load manageable.

If you need lower latency without aggressive polling, you can use Postgres LISTEN/NOTIFY to wake the relay immediately when a row is inserted:

sql
-- Trigger that notifies the relay channel when a row is inserted
CREATE OR REPLACE FUNCTION notify_outbox_insert()
RETURNS TRIGGER LANGUAGE plpgsql AS $$
BEGIN
    PERFORM pg_notify('webhook_outbox', NEW.id::text);
    RETURN NEW;
END;
$$;

CREATE TRIGGER webhook_outbox_notify
    AFTER INSERT ON webhook_outbox
    FOR EACH ROW EXECUTE FUNCTION notify_outbox_insert();

Your relay listens on the webhook_outbox channel and wakes immediately on inserts, falling back to a 5-second poll for robustness. This gives you near-zero relay latency without the database load of aggressive polling.


Outbox vs. Change Data Capture

A common alternative to polling is Change Data Capture (CDC) via Postgres logical replication — tools like Debezium read the WAL and emit outbox rows as messages to Kafka. This eliminates polling entirely and can achieve sub-100ms latency.

CDC is worth evaluating if you're already running Kafka or need extremely low relay latency. For most teams, it is significant operational complexity — you need logical replication enabled, a Debezium connector running, and a Kafka cluster, all for what amounts to a polling loop replacement.

The polling outbox pattern described here runs entirely on Postgres with no additional infrastructure. If you're already running Postgres as your job queue with FOR UPDATE SKIP LOCKED, the outbox relay is a natural extension of the same machinery. Reach for CDC when you've exhausted the polling approach or have specific latency requirements it can't meet.


Cleanup and Retention

Processed outbox rows should be deleted or archived on a schedule. Keeping them around indefinitely inflates the table and slows the partial index scan.

sql
-- Delete processed rows older than 7 days (adjust retention as needed)
DELETE FROM webhook_outbox
WHERE processed_at IS NOT NULL
  AND processed_at < now() - interval '7 days';

Run this as a scheduled job — daily is sufficient. If you need the history for auditing, archive to a separate table or object storage before deleting.

Rows that never reach processed_at (because attempts >= 5) are your dead letters. Route these to an alert channel and review them manually. They represent events where your delivery infrastructure had a persistent failure, not just a transient one.


Where GetHook Fits

If you're building outbound webhooks on top of GetHook, the outbox pattern applies to the hop between your application database and the GetHook API. Write the outbox row inside your business transaction, and have the relay call POST /v1/outbound-events to hand off the event. From that point, GetHook owns delivery, retry, signing, and observability — your relay's only job is a reliable handoff.

This is a cleaner boundary than trying to make the GetHook API call transactional. The relay absorbs transient API failures independently of your business logic.


Summary

The transactional outbox pattern gives you:

  • Atomic write: business data and delivery intent commit together or not at all
  • Crash safety: no silent event loss from application crashes or restarts
  • No additional infrastructure: runs entirely on Postgres if you're already using it as a job queue
  • Independent delivery: the relay's failure mode doesn't affect your application's write path

The trade-off is relay latency (typically sub-second with a 500ms poll) and the requirement that your delivery layer handles duplicates. Both are manageable in practice and far preferable to silent event loss.

If you want reliable outbound webhook delivery without building the delivery infrastructure yourself, start with GetHook — your application writes events, GetHook handles everything downstream.

Get started with GetHook →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.