Back to Blog
webhooksreliabilityretryarchitectureevent-design

Webhook Event TTLs: When Late Delivery Is Worse Than No Delivery

Most webhook infrastructure retries every event the same way, for the same duration. That's the wrong default. Time-sensitive events need TTLs, shorter retry windows, and consumers that validate before acting.

C
Camille Beaumont
Backend Architect
April 28, 2026
9 min read

A payment authorization expires in 7 minutes. An OTP is valid for 60 seconds. A price alert fires because a stock crossed a threshold that it reversed 30 seconds later.

Your webhook delivery infrastructure doesn't know any of this. It retries all three events on the same exponential backoff schedule — 30s, 2m, 10m, 1h — until they're either delivered or exhausted. When the OTP event finally reaches your consumer 3 minutes after the code expired, the delivery system marks it delivered. The event isn't delivered. It's delivered late, which is a different outcome, and in some cases a worse one than not delivering it at all.

This post covers how to model event TTLs, enforce them in the delivery layer, and configure retry policies that match the actual semantics of each event type.


The Two Event Categories

The core issue is that most webhook systems treat every event as equally durable. They're not.

CategoryExample event typesLate delivery consequence
Durableuser.created, subscription.updated, invoice.paidDelayed but harmless — state is still valid
Time-sensitiveotp.requested, payment.auth.captured, price.alert.triggered, inventory.depletedIncorrect state — the relevant window has closed

A user.created event delivered 20 minutes late is just a delay. Your consumer creates the user 20 minutes late. That's a problem, but not a correctness problem.

A payment.auth.captured event delivered 10 minutes late is different. The authorization window has closed. Any action your consumer takes — fulfilling an order, reserving inventory, sending a confirmation — is based on stale state. The authorization may have already expired and been refunded. Acting on it causes incorrect behavior, not just a delay.

The retry logic that helps durable events actively harms time-sensitive ones. A well-intentioned 5-attempt backoff schedule delivers an event that should have been discarded.


Modeling TTLs on the Event Envelope

The fix starts with making TTL semantics explicit in the event itself. Two approaches work in practice.

Absolute expiry timestamp — the event carries an expires_at field:

json
{
  "id": "evt_01HX9P3KQY",
  "type": "otp.requested",
  "created_at": "2026-04-28T14:00:00Z",
  "expires_at": "2026-04-28T14:01:00Z",
  "data": {
    "user_id": "usr_abc123",
    "otp_code": "847291",
    "channel": "sms"
  }
}

This is the most precise approach. Each event carries its own expiry, which can vary by context — a payment authorization over SEPA Direct Debit might expire in 24 hours, while one over a card swipe might expire in 7 minutes. The delivery layer checks expires_at before each attempt and skips delivery if the window has passed.

Per-event-type TTL policy — rather than per-event, the TTL is a configuration property of the event type that your delivery infrastructure applies uniformly:

json
{
  "event_type": "otp.requested",
  "delivery_ttl_seconds": 60,
  "max_attempts": 2
}

This is operationally simpler. You configure the rule once, and it applies to every event of that type. It works well when all events of a given type share the same time sensitivity — which is common for OTP and alerting events.

If your upstream provider includes expires_at in the payload (Stripe does this for payment intents), honor it. Otherwise, configure per-type TTLs based on your understanding of the business semantics.


Enforcing TTLs in the Delivery Worker

TTL metadata is useless without enforcement. Your delivery worker needs to check TTL state before attempting delivery:

go
func (w *Worker) shouldDeliver(event Event) (bool, string) {
    // Check absolute expiry field if present
    if event.ExpiresAt != nil && time.Now().After(*event.ExpiresAt) {
        return false, "event expired before delivery"
    }

    // Check per-event-type TTL policy
    if policy, ok := w.typePolicies[event.Type]; ok && policy.DeliveryTTLSeconds > 0 {
        ttl := time.Duration(policy.DeliveryTTLSeconds) * time.Second
        if time.Since(event.CreatedAt) > ttl {
            return false, fmt.Sprintf("event exceeded type TTL of %v", ttl)
        }
    }

    return true, ""
}

func (w *Worker) processJob(ctx context.Context, job DeliveryJob) error {
    event, err := w.events.Get(ctx, job.EventID)
    if err != nil {
        return err
    }

    if ok, reason := w.shouldDeliver(event); !ok {
        return w.events.MarkExpired(ctx, event.ID, reason)
    }

    return w.deliver(ctx, job, event)
}

The MarkExpired transition sets status = 'expired' with a reason string. This is a distinct terminal state from dead_letter. A dead-lettered event exhausted delivery attempts. An expired event was intentionally skipped. The distinction matters for debugging and for operator dashboards — you want to know whether events aren't reaching consumers because delivery failed or because they were discarded as stale.


Retry Policy Interaction

The standard retry schedule — 30s → 2m → 10m → 1h — is designed for durable events where eventual delivery is the goal. For time-sensitive events, it's the wrong shape.

An event with a 60-second TTL gets two realistic attempts on this schedule: the initial delivery at T+0 and one retry at T+30. The second retry at T+90 is already expired before it fires. Your retry policy is effectively truncated by the TTL, whether you've designed it that way or not.

Better to be explicit:

Event TTLRecommended retry strategy
< 2 minutes1–2 attempts, 15s interval, no further retries
2–15 minutes2–3 attempts, 30s–60s interval
> 30 minutesStandard exponential backoff up to TTL boundary
No TTLStandard exponential backoff with full jitter

The principle: maximize delivery probability within the valid window, then stop. For short-TTL events, that means aggressive early retries and hard termination, not long backoff schedules that extend well past expiry.

go
func retryPolicyForEvent(event Event, typePolicies map[string]EventTypePolicy) RetryPolicy {
    // If event has an absolute expiry, tune retry policy to the remaining window
    if event.ExpiresAt != nil {
        remaining := time.Until(*event.ExpiresAt)
        switch {
        case remaining < 2*time.Minute:
            return RetryPolicy{MaxAttempts: 2, BaseDelay: 15 * time.Second, Jitter: false}
        case remaining < 15*time.Minute:
            return RetryPolicy{MaxAttempts: 3, BaseDelay: 30 * time.Second, Jitter: true}
        }
    }

    // Fall back to per-type configuration or the global default
    if p, ok := typePolicies[event.Type]; ok {
        return p.RetryPolicy
    }
    return defaultRetryPolicy // 5 attempts, exponential backoff, full jitter
}

Consumer-Side Validation

TTL enforcement at the delivery layer stops most stale deliveries. But there's a gap: an event delivered just before expiry can sit in your consumer's internal processing queue past the TTL. Your consumer should validate whether the event is still actionable before taking irreversible action.

go
func (h *OTPHandler) Handle(event WebhookEvent) error {
    var payload struct {
        UserID  string `json:"user_id"`
        OTPCode string `json:"otp_code"`
        Channel string `json:"channel"`
    }
    if err := json.Unmarshal(event.Data, &payload); err != nil {
        return err
    }

    // Validate the OTP is still valid in our own system
    if !h.otpStore.IsValid(payload.UserID, payload.OTPCode) {
        log.Printf("otp.requested delivered but OTP already expired: event_id=%s", event.ID)
        // Return nil (HTTP 200) — this is not a delivery failure
        return nil
    }

    return h.notifier.Send(payload.Channel, payload.UserID, payload.OTPCode)
}

The important detail: return 200 OK when you've decided not to act on a stale event. A 4xx would trigger retries of an event you already know is expired. A 200 tells the delivery layer that the event was received and handled — your consumer made the call about what to do with it. The delivery system's job is delivery, not business logic.


Document TTL Semantics for Your Consumers

If you're building a webhook-producing platform, your TTL semantics belong in your event catalog alongside payload schemas. Consumers who don't know an event is time-sensitive will treat it as durable and build handlers that act on stale state.

Event typeTTLConsumer guidance
otp.requested60 secondsValidate OTP validity in your own store before delivering
payment.auth.captured7 minutesCheck authorization status via API before fulfilling
price.alert.triggered5 minutesRe-fetch current price before surfacing to end user
inventory.depleted2 minutesCheck current stock level before triggering resupply
user.createdNoneIdempotent; safe to process at any point
invoice.paidNoneIdempotent; safe to process at any point

This table should live in your webhook documentation, not just in a JIRA comment. Consumers who don't know price.alert.triggered is time-sensitive will write handlers that fire correctly in development (where there's no delivery delay) and incorrectly in production (where there is).

GetHook lets you configure per-event-type TTL policies and retry strategies per source, and surfaces expired events separately in the delivery log so you can audit how many time-sensitive events are being discarded versus delivered successfully.


Late delivery of time-sensitive events is a predictable failure mode of any webhook system that treats all events identically. Adding expires_at to your event envelope, enforcing TTLs before each delivery attempt, and tuning retry policies to match event semantics is a day's worth of work. The alternative is debugging payment double-charges, OTPs that arrive after the session expired, and price alerts that send customers on fruitless chases — all of which look like application bugs until you trace them to the delivery layer.

Configure event TTLs and per-event-type retry policies on GetHook →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.