Back to Blog
SLAreliabilityobservabilityinfrastructureengineering

Webhook Delivery SLAs: How to Define and Measure Them

Most teams operate webhook infrastructure without any formal delivery SLA — until a customer contract demands one. Here's how to define meaningful SLA targets, instrument your system to measure them, and know when you're violating them before your customers do.

M
Marcus Webb
Platform Engineer
March 24, 2026
10 min read

A customer's procurement team sends over a contract addendum: "Webhook events must be delivered within 30 seconds of occurrence, 99.5% of the time." Your engineering team looks at each other. Nobody knows what your current p99.5 delivery latency actually is.

This is the most common way webhook SLAs get introduced: reactively, during an enterprise sales process, after your infrastructure was built without any measurement in place.

This post covers how to define a webhook delivery SLA, how to instrument your system to measure compliance, and what query patterns to use to know if you're meeting it.


What Goes Into a Webhook Delivery SLA

A webhook SLA has three components:

  1. Latency target — how quickly must events be delivered after they occur?
  2. Success rate target — what percentage of events must be delivered at all (eventually)?
  3. Measurement window — over what period is compliance evaluated (per hour, per day, per month)?

These combine into a statement like: "99.5% of webhook events will be delivered within 60 seconds, measured as a rolling 30-day window."

Latency buckets to define

Don't think of latency as a single number. Define three thresholds:

TierLatency TargetTypical Use Case
Real-time< 5 secondsPayment confirmations, fraud signals
Near-real-time< 30 secondsOrder state changes, user lifecycle events
Batch-tolerant< 5 minutesReporting events, analytics triggers
Best-effort< 1 hourLow-priority notifications

Most SaaS products fit in the near-real-time bucket. "Real-time" is expensive to guarantee because it requires your delivery path to have no queue depth at all during bursts.

What latency actually means

There are two clocks at play:

  • Event occurrence time — when the business event happened on the source system
  • Ingest time — when your webhook gateway received the raw HTTP POST
  • Delivery time — when the destination returned a 2xx response

The end-to-end SLA that customers care about is delivery_time - event_occurrence_time. In practice, you often only control delivery_time - ingest_time — the segment from ingest to delivery. Make this distinction explicit in your SLA language.


Instrumenting Your Delivery Pipeline

Before you can measure an SLA, you need timestamps at each stage. Minimum required fields on your events and delivery attempts:

sql
-- events table
ALTER TABLE events ADD COLUMN occurred_at     TIMESTAMPTZ; -- event source timestamp
ALTER TABLE events ADD COLUMN ingested_at     TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE events ADD COLUMN first_delivered_at TIMESTAMPTZ;

-- delivery_attempts table
ALTER TABLE delivery_attempts ADD COLUMN attempted_at  TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE delivery_attempts ADD COLUMN responded_at  TIMESTAMPTZ;
ALTER TABLE delivery_attempts ADD COLUMN outcome       TEXT; -- 'success', 'timeout', 'http_5xx', etc.
ALTER TABLE delivery_attempts ADD COLUMN response_ms   INTEGER; -- destination response time

When a delivery attempt succeeds, write back to events.first_delivered_at:

go
func (w *Worker) markDelivered(ctx context.Context, eventID uuid.UUID) error {
    _, err := w.db.ExecContext(ctx, `
        UPDATE events
        SET    status             = 'delivered',
               first_delivered_at = NOW()
        WHERE  id = $1
          AND  first_delivered_at IS NULL
    `, eventID)
    return err
}

The IS NULL guard ensures you record the timestamp of the first successful delivery, not a later replay.


The Core SLA Query

With timestamps in place, the compliance query is straightforward:

sql
-- Delivery SLA compliance: % of events delivered within 30 seconds
-- Rolling 7-day window
SELECT
    COUNT(*)                                                    AS total_events,
    COUNT(*) FILTER (
        WHERE first_delivered_at - ingested_at <= INTERVAL '30 seconds'
    )                                                           AS delivered_on_time,
    ROUND(
        100.0 * COUNT(*) FILTER (
            WHERE first_delivered_at - ingested_at <= INTERVAL '30 seconds'
        ) / NULLIF(COUNT(*), 0),
        2
    )                                                           AS compliance_pct
FROM events
WHERE
    ingested_at  >= NOW() - INTERVAL '7 days'
    AND status   IN ('delivered', 'dead_letter')
    AND direction = 'outbound';

Run this query in your monitoring system on a schedule (every 5 minutes for a live dashboard, every hour for alerting). If compliance_pct drops below your target, you have a violation in progress.

Breaking it down by account

For multi-tenant systems, per-account compliance matters:

sql
SELECT
    account_id,
    COUNT(*)                                                    AS total_events,
    ROUND(
        100.0 * COUNT(*) FILTER (
            WHERE first_delivered_at - ingested_at <= INTERVAL '30 seconds'
        ) / NULLIF(COUNT(*), 0),
        2
    )                                                           AS compliance_pct
FROM events
WHERE
    ingested_at >= NOW() - INTERVAL '24 hours'
    AND status  IN ('delivered', 'dead_letter')
GROUP BY account_id
ORDER BY compliance_pct ASC;

Accounts at the bottom of this list are your most at-risk customers. An enterprise customer whose SLA is being violated at 11pm on a Friday is a churn risk.


Percentile Latency Metrics

Compliance percentage answers "are we meeting the SLA?" but doesn't tell you how close to the edge you are. Complement it with latency percentiles:

sql
SELECT
    PERCENTILE_CONT(0.50) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
    ) AS p50_seconds,
    PERCENTILE_CONT(0.95) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
    ) AS p95_seconds,
    PERCENTILE_CONT(0.99) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
    ) AS p99_seconds,
    PERCENTILE_CONT(0.999) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
    ) AS p999_seconds
FROM events
WHERE
    ingested_at          >= NOW() - INTERVAL '24 hours'
    AND first_delivered_at IS NOT NULL;

A healthy outbound webhook system at moderate volume should look like:

PercentileExpected Latency
p50< 1 second
p95< 5 seconds
p99< 15 seconds
p99.9< 60 seconds (retry territory)

If your p99 is 45 seconds, you're one traffic spike away from an SLA breach. Track trends over time — a rising p99 that hasn't yet crossed your threshold is a leading indicator of an upcoming violation.


Alerting on SLA Risk

Don't wait for a breach to happen. Alert on leading indicators:

Alert 1: Queue depth rising

sql
SELECT COUNT(*) AS queued_events
FROM events
WHERE status = 'queued'
  AND ingested_at < NOW() - INTERVAL '60 seconds';

If events have been sitting in queued state for more than 60 seconds, your delivery workers are behind. Fire a warning before they miss the SLA window.

Alert 2: Delivery worker throughput drop

Track delivery_attempts per minute. A sudden drop in attempt rate (not a drop in event volume) means your worker process may be stuck or restarting.

Alert 3: p99 latency crossing threshold

Alert when your 15-minute rolling p99 latency crosses 20 seconds, giving you 10 seconds of runway before a 30-second SLA is at risk.

The operational discipline is to treat these leading indicators as seriously as the SLA metric itself. A breach that you caught 10 minutes early can often be avoided by scaling up workers or shedding load.


What to Put in the SLA Document

When an enterprise customer asks for a written SLA, be precise about what you're committing to and what you're excluding:

Include:

  • Latency definition: delivery_time - ingested_at (not delivery_time - occurred_at)
  • Success rate definition: events that eventually deliver within N retry attempts
  • Measurement window: rolling 30-day
  • Exclusions: events that fail due to the destination returning persistent 4xx errors (those are customer-side issues, not infrastructure failures)
  • Remediation: service credits if the SLA is breached, typically 10% of monthly fee per 1% below target

Exclude explicitly:

  • Third-party provider uptime (the system that sends events to you)
  • Destination endpoint uptime (your customer's server)
  • Events that are in dead-letter queue due to customer misconfiguration

SLA documents that don't clearly define these exclusions will generate disputes. A destination that returns 503 for six hours is not a failure of your delivery infrastructure — but without explicit language, it looks like one in the metrics.


Using GetHook's Delivery Data for SLA Reporting

GetHook records ingested_at, first_delivered_at, and per-attempt timestamps for every event. You can query the events API to pull delivery latency data for a given account and time window, making it straightforward to generate a monthly SLA compliance report for enterprise customers.

The key fields available in the events API response:

json
{
  "id": "evt_01HWXYZ...",
  "status": "delivered",
  "ingested_at": "2026-03-24T09:00:00Z",
  "first_delivered_at": "2026-03-24T09:00:04Z",
  "attempts_count": 1,
  "direction": "outbound"
}

From ingested_at to first_delivered_at you get a 4-second delivery latency for this event — well within any reasonable SLA target.


Summary

Defining a webhook delivery SLA forces you to be precise about what you're actually measuring. The work has two parts: adding the right timestamps to your data model (which you should do even if you have no SLA today), and writing the queries that turn timestamps into compliance percentages and latency percentiles.

The teams that get into trouble are those who agree to an SLA in a sales conversation without any prior measurement in place. Add ingested_at and first_delivered_at to your events table now, run the compliance query weekly, and you'll always know where you stand.

Start measuring your webhook delivery performance with GetHook →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.