Your webhook integration breaks at 11:47 PM on a Saturday. A customer's order confirmation flow goes silent. By the time your on-call engineer opens their laptop, the customer has already filed a support ticket, posted to your community forum, and started evaluating a competitor.
That sequence is avoidable — not by preventing every failure, but by making the failure visible and self-explanatory before the frustration escalates. A webhook status page is how you do that.
This post covers what to show, how to compute it, and how to build the data layer that makes it accurate.
What customers actually need to know
A status page for webhooks isn't the same as a general infrastructure status page. API: Operational tells a developer nothing about whether their order.created events are being delivered.
What they actually care about:
| Question | What it maps to |
|---|---|
| "Are my webhooks firing?" | Current delivery success rate, last 15 minutes |
| "Is this a me problem or a you problem?" | Delivery errors broken down by destination vs. system-wide |
| "How far behind is the queue?" | Oldest undelivered event age |
| "When did this start?" | Incident timeline |
| "Are other customers affected?" | Aggregate vs. per-account signal |
Most status pages answer none of these. They show a green dot and a vague "All systems operational" message — which is useless when a specific webhook type stopped firing two hours ago.
The data model you need first
Accurate status requires pre-computed signal, not on-demand queries across your events table. At GetHook, we maintain a delivery_health view that rolls up per-5-minute buckets:
CREATE MATERIALIZED VIEW delivery_health_5m AS
SELECT
date_trunc('minute', attempted_at) -
INTERVAL '1 minute' * (EXTRACT(minute FROM attempted_at)::int % 5) AS bucket,
COUNT(*) FILTER (WHERE outcome = 'success') AS delivered,
COUNT(*) FILTER (WHERE outcome = 'http_5xx') AS dest_5xx,
COUNT(*) FILTER (WHERE outcome = 'timeout') AS timeouts,
COUNT(*) FILTER (WHERE outcome = 'network_error') AS network_errors,
COUNT(*) AS total
FROM delivery_attempts
WHERE attempted_at > NOW() - INTERVAL '24 hours'
GROUP BY 1;
CREATE UNIQUE INDEX ON delivery_health_5m (bucket);Refresh it on a cron:
REFRESH MATERIALIZED VIEW CONCURRENTLY delivery_health_5m;CONCURRENTLY is key — it lets reads continue without a lock. Refresh every 60 seconds and you have a status signal that's at most 60 seconds stale with no impact on query latency.
For the "oldest undelivered event" metric:
SELECT MIN(created_at) AS oldest_queued
FROM events
WHERE status IN ('queued', 'retry_scheduled')
AND next_attempt_at < NOW() + INTERVAL '5 minutes';If that timestamp is more than 10 minutes old, your queue is backing up.
Computing a health status
Translate raw numbers into a signal your API can expose:
type DeliveryStatus struct {
Status string `json:"status"` // "operational", "degraded", "outage"
SuccessRate float64 `json:"success_rate_15m"`
OldestQueuedAge int `json:"oldest_queued_seconds"`
UpdatedAt time.Time `json:"updated_at"`
}
func ComputeStatus(delivered, total int64, oldestQueuedAge time.Duration) DeliveryStatus {
var rate float64
if total > 0 {
rate = float64(delivered) / float64(total)
} else {
rate = 1.0 // no events = healthy
}
status := "operational"
switch {
case rate < 0.80 || oldestQueuedAge > 30*time.Minute:
status = "outage"
case rate < 0.95 || oldestQueuedAge > 10*time.Minute:
status = "degraded"
}
return DeliveryStatus{
Status: status,
SuccessRate: rate,
OldestQueuedAge: int(oldestQueuedAge.Seconds()),
UpdatedAt: time.Now().UTC(),
}
}The thresholds here are opinionated. A 95% success rate sounds good — but if you're sending 10,000 events an hour, 5% failures is 500 broken deliveries. Adjust based on your volume and your SLA commitments.
What to expose publicly vs. privately
Not everything belongs on the public page.
Public (unauthenticated):
- ›System-wide delivery success rate (rolling 15 minutes)
- ›Queue backlog depth (operational / degraded / backed up)
- ›Incident history (last 30 days)
- ›Current active incidents with start time and updates
Authenticated (per-account):
- ›Per-destination success rate
- ›Their specific events in
dead_letterstatus - ›Retry schedule for in-flight failures
- ›Full delivery attempt log
The public page builds trust with prospects and reduces support volume. The per-account view is what your customer opens when they're debugging their integration — which is a different job.
Incident tracking: the part most teams skip
A status page without incident history is just a green dot. Customers remember outages — they want to see that you do too.
At minimum, store:
CREATE TABLE incidents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title TEXT NOT NULL,
impact TEXT NOT NULL CHECK (impact IN ('degraded', 'partial_outage', 'major_outage')),
started_at TIMESTAMPTZ NOT NULL,
resolved_at TIMESTAMPTZ,
updates JSONB NOT NULL DEFAULT '[]'
);The updates column holds a JSON array of { message, timestamp } entries — your postmortem trail. When an incident closes, write a brief RCA in plain language: what failed, why, how long, what changed.
This sounds like overhead, but a three-sentence incident update does more for customer trust than three months of clean uptime. It signals that humans are watching and that failures are understood, not just fixed.
The status page endpoint
Expose a single, fast, public endpoint:
GET /status
{
"delivery": {
"status": "degraded",
"success_rate_15m": 0.91,
"oldest_queued_seconds": 420,
"updated_at": "2026-03-31T11:52:00Z"
},
"active_incidents": [
{
"id": "inc_01htx...",
"title": "Elevated timeout rate to eu-west destinations",
"impact": "degraded",
"started_at": "2026-03-31T11:30:00Z",
"latest_update": "Identified network path issue, mitigation in progress"
}
],
"recent_incidents": []
}Keep this endpoint fast — it should hit a cache, not run queries. Write the computed status to a system_status cache key (Redis, Postgres, or even an in-memory value updated by a background goroutine) and return it in under 5ms. Your frontend status page should poll this every 30 seconds.
Dashboard integration
The status page data becomes more valuable when it lives in your product, not just a separate URL. In your customer dashboard, surface:
- ›A banner when the customer has events in
dead_letterstate - ›A callout when their specific destination has a degraded success rate
- ›A link to the incident log if there's an active incident affecting their region or event type
These in-product signals reduce support tickets. Instead of "my webhooks aren't working," customers arrive at support with "I saw my eu-west destination went degraded at 11:30 — is there a known incident?" That's a solvable conversation.
Practical thresholds by scale
| Daily volume | Refresh interval | Alert threshold | Dead letter alert |
|---|---|---|---|
| < 10K events | 5 minutes | < 90% success | Any |
| 10K–100K | 2 minutes | < 95% success | > 5 events |
| 100K–1M | 60 seconds | < 98% success | > 50 events |
| > 1M | 30 seconds | < 99% success | > 200 events |
These aren't rules — they're starting points. Your actual thresholds should come from your SLA commitments and what failure rate makes customers notice.
A webhook status page isn't about showing that you never break. It's about demonstrating that when something breaks, you know about it, you're communicating, and your customers don't have to file a ticket to find out what's happening.
If you're building webhook infrastructure and want the delivery pipeline, retry logic, and event history that power a status page like this, GetHook is available to try free — you can have an endpoint accepting and delivering events in under five minutes.