A customer's procurement team sends over a contract addendum: "Webhook events must be delivered within 30 seconds of occurrence, 99.5% of the time." Your engineering team looks at each other. Nobody knows what your current p99.5 delivery latency actually is.
This is the most common way webhook SLAs get introduced: reactively, during an enterprise sales process, after your infrastructure was built without any measurement in place.
This post covers how to define a webhook delivery SLA, how to instrument your system to measure compliance, and what query patterns to use to know if you're meeting it.
What Goes Into a Webhook Delivery SLA
A webhook SLA has three components:
- ›Latency target — how quickly must events be delivered after they occur?
- ›Success rate target — what percentage of events must be delivered at all (eventually)?
- ›Measurement window — over what period is compliance evaluated (per hour, per day, per month)?
These combine into a statement like: "99.5% of webhook events will be delivered within 60 seconds, measured as a rolling 30-day window."
Latency buckets to define
Don't think of latency as a single number. Define three thresholds:
| Tier | Latency Target | Typical Use Case |
|---|---|---|
| Real-time | < 5 seconds | Payment confirmations, fraud signals |
| Near-real-time | < 30 seconds | Order state changes, user lifecycle events |
| Batch-tolerant | < 5 minutes | Reporting events, analytics triggers |
| Best-effort | < 1 hour | Low-priority notifications |
Most SaaS products fit in the near-real-time bucket. "Real-time" is expensive to guarantee because it requires your delivery path to have no queue depth at all during bursts.
What latency actually means
There are two clocks at play:
- ›Event occurrence time — when the business event happened on the source system
- ›Ingest time — when your webhook gateway received the raw HTTP POST
- ›Delivery time — when the destination returned a 2xx response
The end-to-end SLA that customers care about is delivery_time - event_occurrence_time. In practice, you often only control delivery_time - ingest_time — the segment from ingest to delivery. Make this distinction explicit in your SLA language.
Instrumenting Your Delivery Pipeline
Before you can measure an SLA, you need timestamps at each stage. Minimum required fields on your events and delivery attempts:
-- events table
ALTER TABLE events ADD COLUMN occurred_at TIMESTAMPTZ; -- event source timestamp
ALTER TABLE events ADD COLUMN ingested_at TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE events ADD COLUMN first_delivered_at TIMESTAMPTZ;
-- delivery_attempts table
ALTER TABLE delivery_attempts ADD COLUMN attempted_at TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE delivery_attempts ADD COLUMN responded_at TIMESTAMPTZ;
ALTER TABLE delivery_attempts ADD COLUMN outcome TEXT; -- 'success', 'timeout', 'http_5xx', etc.
ALTER TABLE delivery_attempts ADD COLUMN response_ms INTEGER; -- destination response timeWhen a delivery attempt succeeds, write back to events.first_delivered_at:
func (w *Worker) markDelivered(ctx context.Context, eventID uuid.UUID) error {
_, err := w.db.ExecContext(ctx, `
UPDATE events
SET status = 'delivered',
first_delivered_at = NOW()
WHERE id = $1
AND first_delivered_at IS NULL
`, eventID)
return err
}The IS NULL guard ensures you record the timestamp of the first successful delivery, not a later replay.
The Core SLA Query
With timestamps in place, the compliance query is straightforward:
-- Delivery SLA compliance: % of events delivered within 30 seconds
-- Rolling 7-day window
SELECT
COUNT(*) AS total_events,
COUNT(*) FILTER (
WHERE first_delivered_at - ingested_at <= INTERVAL '30 seconds'
) AS delivered_on_time,
ROUND(
100.0 * COUNT(*) FILTER (
WHERE first_delivered_at - ingested_at <= INTERVAL '30 seconds'
) / NULLIF(COUNT(*), 0),
2
) AS compliance_pct
FROM events
WHERE
ingested_at >= NOW() - INTERVAL '7 days'
AND status IN ('delivered', 'dead_letter')
AND direction = 'outbound';Run this query in your monitoring system on a schedule (every 5 minutes for a live dashboard, every hour for alerting). If compliance_pct drops below your target, you have a violation in progress.
Breaking it down by account
For multi-tenant systems, per-account compliance matters:
SELECT
account_id,
COUNT(*) AS total_events,
ROUND(
100.0 * COUNT(*) FILTER (
WHERE first_delivered_at - ingested_at <= INTERVAL '30 seconds'
) / NULLIF(COUNT(*), 0),
2
) AS compliance_pct
FROM events
WHERE
ingested_at >= NOW() - INTERVAL '24 hours'
AND status IN ('delivered', 'dead_letter')
GROUP BY account_id
ORDER BY compliance_pct ASC;Accounts at the bottom of this list are your most at-risk customers. An enterprise customer whose SLA is being violated at 11pm on a Friday is a churn risk.
Percentile Latency Metrics
Compliance percentage answers "are we meeting the SLA?" but doesn't tell you how close to the edge you are. Complement it with latency percentiles:
SELECT
PERCENTILE_CONT(0.50) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
) AS p50_seconds,
PERCENTILE_CONT(0.95) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
) AS p95_seconds,
PERCENTILE_CONT(0.99) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
) AS p99_seconds,
PERCENTILE_CONT(0.999) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (first_delivered_at - ingested_at))
) AS p999_seconds
FROM events
WHERE
ingested_at >= NOW() - INTERVAL '24 hours'
AND first_delivered_at IS NOT NULL;A healthy outbound webhook system at moderate volume should look like:
| Percentile | Expected Latency |
|---|---|
| p50 | < 1 second |
| p95 | < 5 seconds |
| p99 | < 15 seconds |
| p99.9 | < 60 seconds (retry territory) |
If your p99 is 45 seconds, you're one traffic spike away from an SLA breach. Track trends over time — a rising p99 that hasn't yet crossed your threshold is a leading indicator of an upcoming violation.
Alerting on SLA Risk
Don't wait for a breach to happen. Alert on leading indicators:
Alert 1: Queue depth rising
SELECT COUNT(*) AS queued_events
FROM events
WHERE status = 'queued'
AND ingested_at < NOW() - INTERVAL '60 seconds';If events have been sitting in queued state for more than 60 seconds, your delivery workers are behind. Fire a warning before they miss the SLA window.
Alert 2: Delivery worker throughput drop
Track delivery_attempts per minute. A sudden drop in attempt rate (not a drop in event volume) means your worker process may be stuck or restarting.
Alert 3: p99 latency crossing threshold
Alert when your 15-minute rolling p99 latency crosses 20 seconds, giving you 10 seconds of runway before a 30-second SLA is at risk.
The operational discipline is to treat these leading indicators as seriously as the SLA metric itself. A breach that you caught 10 minutes early can often be avoided by scaling up workers or shedding load.
What to Put in the SLA Document
When an enterprise customer asks for a written SLA, be precise about what you're committing to and what you're excluding:
Include:
- ›Latency definition:
delivery_time - ingested_at(notdelivery_time - occurred_at) - ›Success rate definition: events that eventually deliver within N retry attempts
- ›Measurement window: rolling 30-day
- ›Exclusions: events that fail due to the destination returning persistent 4xx errors (those are customer-side issues, not infrastructure failures)
- ›Remediation: service credits if the SLA is breached, typically 10% of monthly fee per 1% below target
Exclude explicitly:
- ›Third-party provider uptime (the system that sends events to you)
- ›Destination endpoint uptime (your customer's server)
- ›Events that are in dead-letter queue due to customer misconfiguration
SLA documents that don't clearly define these exclusions will generate disputes. A destination that returns 503 for six hours is not a failure of your delivery infrastructure — but without explicit language, it looks like one in the metrics.
Using GetHook's Delivery Data for SLA Reporting
GetHook records ingested_at, first_delivered_at, and per-attempt timestamps for every event. You can query the events API to pull delivery latency data for a given account and time window, making it straightforward to generate a monthly SLA compliance report for enterprise customers.
The key fields available in the events API response:
{
"id": "evt_01HWXYZ...",
"status": "delivered",
"ingested_at": "2026-03-24T09:00:00Z",
"first_delivered_at": "2026-03-24T09:00:04Z",
"attempts_count": 1,
"direction": "outbound"
}From ingested_at to first_delivered_at you get a 4-second delivery latency for this event — well within any reasonable SLA target.
Summary
Defining a webhook delivery SLA forces you to be precise about what you're actually measuring. The work has two parts: adding the right timestamps to your data model (which you should do even if you have no SLA today), and writing the queries that turn timestamps into compliance percentages and latency percentiles.
The teams that get into trouble are those who agree to an SLA in a sales conversation without any prior measurement in place. Add ingested_at and first_delivered_at to your events table now, run the compliance query weekly, and you'll always know where you stand.
Start measuring your webhook delivery performance with GetHook →