When webhook delivery is degraded, your customers discover it before your on-call does. They see retries stacking up in their own logs, downstream jobs failing, and dashboards going stale. Then they open a support ticket asking "are your webhooks down?"
A customer-facing webhook status page answers that question before they have to ask it. It also shifts the conversation from reactive damage control ("we're investigating") to proactive trust-building ("here's what we know and when we'll fix it"). This post covers what to measure, how to compute status, and what the implementation looks like end to end.
What a Webhook Status Page Should Show
A status page is not a vanity dashboard. It exists to answer three specific questions for a customer who is debugging a problem:
- ›Is delivery currently degraded? (current status: operational / degraded / outage)
- ›Has it been degraded recently? (rolling 90-day history of incidents)
- ›What happened during a past incident? (incident timeline with start, end, and root cause)
Secondary information that adds value without adding noise:
- ›P95 delivery latency for the past 24 hours (is it slower than usual?)
- ›Error rate as a percentage of delivery attempts (what fraction are failing?)
- ›Per-region status if you deliver from multiple regions
What not to include: internal metrics, queue depth, worker counts, database stats. Customers don't need to understand your architecture. They need to know whether their webhooks will arrive.
The Metrics That Drive Status
Your status computation should be built on top of delivery attempt data you're already storing. If you have a delivery_attempts table, you have everything you need.
-- Compute error rate over a rolling 5-minute window
SELECT
COUNT(*) FILTER (WHERE outcome IN ('http_5xx', 'timeout', 'network_error'))::float
/ NULLIF(COUNT(*), 0) AS error_rate,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95_latency_ms,
COUNT(*) AS total_attempts
FROM delivery_attempts
WHERE created_at >= NOW() - INTERVAL '5 minutes';Map the result to a status level:
| Error Rate | P95 Latency | Status |
|---|---|---|
| < 1% | < 2,000 ms | Operational |
| 1%–5% | 2,000–5,000 ms | Degraded Performance |
| 5%–15% | 5,000–15,000 ms | Partial Outage |
| > 15% | > 15,000 ms | Major Outage |
These thresholds are starting points — calibrate them to your baseline. A platform that normally operates at 0.1% error rate and 300ms P95 should alert at much lower thresholds than one with a noisier baseline.
Run this query on a schedule (every 60 seconds is fine) and write the result into a system_status_snapshots table:
CREATE TABLE system_status_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
status TEXT NOT NULL, -- 'operational' | 'degraded' | 'partial_outage' | 'major_outage'
error_rate NUMERIC(6, 4),
p95_latency_ms INT,
total_attempts INT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX system_status_snapshots_time ON system_status_snapshots (created_at DESC);Keeping a history of snapshots lets you render the 90-day incident history without storing incidents manually.
Incident Detection and Management
Status snapshots tell you the current state of delivery. Incidents are a higher-level concept: a continuous period of degraded or worse status, with a start time, end time, and description.
You can auto-detect incidents from status transitions:
func detectIncidentTransition(prev, curr string) (start bool, end bool) {
degradedStates := map[string]bool{
"degraded": true,
"partial_outage": true,
"major_outage": true,
}
wasDown := degradedStates[prev]
isDown := degradedStates[curr]
return !wasDown && isDown, wasDown && !isDown
}When a transition to degraded is detected, open an incident. When the status returns to operational, close it. Your incident table:
CREATE TABLE status_incidents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'investigating',
-- 'investigating' | 'identified' | 'monitoring' | 'resolved'
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ,
auto_detected BOOLEAN NOT NULL DEFAULT true
);
CREATE TABLE status_incident_updates (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
incident_id UUID NOT NULL REFERENCES status_incidents(id),
status TEXT NOT NULL,
message TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);Auto-detection handles most cases. But your on-call engineer needs to be able to post manual updates ("we've identified the root cause — a misconfigured CDN cache rule is rejecting valid HMAC signatures") and override the incident status. A simple internal ops UI is sufficient for this — no customer-facing write access.
The Public API
The status page itself should be a statically renderable page backed by a public (no-auth) API. Keep the API surface minimal:
GET /status — current status + active incident if any
GET /status/history — past 90 days of daily status summaries
GET /status/incidents/{id} — full incident timeline with updatesThe current status endpoint:
{
"status": "degraded",
"updated_at": "2026-03-28T14:32:00Z",
"metrics": {
"error_rate_pct": 3.2,
"p95_latency_ms": 4100
},
"active_incident": {
"id": "inc_01HZ...",
"title": "Elevated delivery latency for US-East destinations",
"status": "investigating",
"started_at": "2026-03-28T14:25:00Z",
"latest_update": "We are investigating elevated latency affecting deliveries to US-East endpoints. Other regions are unaffected."
}
}The history endpoint returns 90 days of daily status summaries for the uptime calendar:
{
"data": [
{ "date": "2026-03-28", "status": "degraded", "incident_count": 1, "uptime_pct": 98.7 },
{ "date": "2026-03-27", "status": "operational", "incident_count": 0, "uptime_pct": 100.0 }
]
}Computing uptime_pct per day is straightforward from your snapshots: count the snapshots where status = 'operational' and divide by total snapshots for that day.
Rendering the Status Page
You have two good options: a statically generated page (rebuilt every 60 seconds) or a server-rendered page backed by the API above.
Static generation is preferable because a status page needs to be available even when your main application is struggling. If your API server is the problem, a status page backed by that same server won't load — which is the worst possible user experience during an outage.
The standard pattern is to deploy your status page to a CDN-backed static host (Cloudflare Pages, Vercel, S3+CloudFront) that is entirely separate from your main application infrastructure. A background job rebuilds and republishes the static HTML on a 60-second interval.
# Cron job: rebuild and deploy status page every 60 seconds
* * * * * /usr/local/bin/rebuild-status-page.sh
* * * * * sleep 30 && /usr/local/bin/rebuild-status-page.shRunning it twice per minute (offset by 30 seconds) gives you a maximum staleness of 30 seconds — good enough for incident transparency.
The rebuild script fetches from your internal status API (not the public one, to avoid rate limits), renders the HTML, and pushes it to your CDN origin.
Subscribing Customers to Updates
A status page that customers have to manually check is half as useful as one that notifies them. Give customers two subscription options:
Email notifications — opt-in mailing list for incident start, updates, and resolution. A simple webhook from your incident management system to a transactional email provider is sufficient.
RSS / Atom feed — low-effort for you, useful for customers who aggregate status feeds in monitoring tools like Better Uptime or PagerDuty's status aggregator.
<feed xmlns="http://www.w3.org/2005/Atom">
<title>GetHook System Status</title>
<link href="https://status.gethook.to/feed.xml" rel="self"/>
<updated>2026-03-28T14:32:00Z</updated>
<entry>
<title>Elevated delivery latency for US-East destinations</title>
<id>https://status.gethook.to/incidents/inc_01HZ...</id>
<updated>2026-03-28T14:32:00Z</updated>
<summary>We are investigating elevated latency affecting deliveries to US-East endpoints.</summary>
</entry>
</feed>Do not build a real-time push mechanism (WebSockets, SSE) for status updates. Your customers don't need sub-second status updates. Email on incident start is fine; RSS covers the rest.
What to Get Wrong (and How to Avoid It)
| Mistake | Consequence | Fix |
|---|---|---|
| Status page hosted on same infrastructure as the product | Page goes down during an outage | Separate static host on a different CDN and domain |
| Computing status from a single 1-minute snapshot | Flapping status on transient errors | Use a 5-minute rolling average; require 3 consecutive degraded readings before declaring an incident |
| Auto-publishing vague titles like "Service Degraded" | Customers have no useful information | Queue human review for the title; auto-title is a fallback only |
| No incident update after initial detection | Customers assume you're not working on it | Commit to an update cadence (15 minutes during active incidents) |
| Showing only current status, no history | Customers can't evaluate your reliability track record | Display 90 days of history, always |
| Not alerting on-call when status degrades | Customers see the incident before your engineers do | Page on status transition, not just on error rate threshold |
Connecting to Your Webhook Gateway
If you're using a gateway like GetHook for inbound or outbound delivery, your status page metrics should be drawn from the gateway's delivery attempt data — not from your application's logs. The gateway sees 100% of delivery attempts including those that time out before your application receives them. Application-level logging misses network errors and upstream timeouts, which underrepresents actual failure rates.
GetHook's delivery_attempts table is the authoritative source for delivery outcomes. Wire your status computation query directly against it, and you'll have an accurate picture of what customers are actually experiencing.
A webhook status page is a trust artifact as much as a technical tool. Customers who can see that you take reliability seriously — that you track uptime, publish incidents transparently, and resolve them quickly — are more likely to stay and more likely to expand. Build it before you need it.
Start with GetHook to get delivery attempt data worth publishing on a status page →