Back to Blog
webhooksobservabilitycustomer experiencereliabilityinfrastructure

Building a Webhook Status Page for Your Customers

When your webhook delivery is degraded, your customers find out before you do — from their own error logs. A customer-facing webhook status page turns reactive support tickets into proactive transparency and cuts incident response time in half.

M
Marcus Webb
Platform Engineer
March 28, 2026
9 min read

When webhook delivery is degraded, your customers discover it before your on-call does. They see retries stacking up in their own logs, downstream jobs failing, and dashboards going stale. Then they open a support ticket asking "are your webhooks down?"

A customer-facing webhook status page answers that question before they have to ask it. It also shifts the conversation from reactive damage control ("we're investigating") to proactive trust-building ("here's what we know and when we'll fix it"). This post covers what to measure, how to compute status, and what the implementation looks like end to end.


What a Webhook Status Page Should Show

A status page is not a vanity dashboard. It exists to answer three specific questions for a customer who is debugging a problem:

  1. Is delivery currently degraded? (current status: operational / degraded / outage)
  2. Has it been degraded recently? (rolling 90-day history of incidents)
  3. What happened during a past incident? (incident timeline with start, end, and root cause)

Secondary information that adds value without adding noise:

  • P95 delivery latency for the past 24 hours (is it slower than usual?)
  • Error rate as a percentage of delivery attempts (what fraction are failing?)
  • Per-region status if you deliver from multiple regions

What not to include: internal metrics, queue depth, worker counts, database stats. Customers don't need to understand your architecture. They need to know whether their webhooks will arrive.


The Metrics That Drive Status

Your status computation should be built on top of delivery attempt data you're already storing. If you have a delivery_attempts table, you have everything you need.

sql
-- Compute error rate over a rolling 5-minute window
SELECT
    COUNT(*) FILTER (WHERE outcome IN ('http_5xx', 'timeout', 'network_error'))::float
        / NULLIF(COUNT(*), 0) AS error_rate,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95_latency_ms,
    COUNT(*) AS total_attempts
FROM delivery_attempts
WHERE created_at >= NOW() - INTERVAL '5 minutes';

Map the result to a status level:

Error RateP95 LatencyStatus
< 1%< 2,000 msOperational
1%–5%2,000–5,000 msDegraded Performance
5%–15%5,000–15,000 msPartial Outage
> 15%> 15,000 msMajor Outage

These thresholds are starting points — calibrate them to your baseline. A platform that normally operates at 0.1% error rate and 300ms P95 should alert at much lower thresholds than one with a noisier baseline.

Run this query on a schedule (every 60 seconds is fine) and write the result into a system_status_snapshots table:

sql
CREATE TABLE system_status_snapshots (
    id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    status         TEXT NOT NULL,  -- 'operational' | 'degraded' | 'partial_outage' | 'major_outage'
    error_rate     NUMERIC(6, 4),
    p95_latency_ms INT,
    total_attempts INT,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX system_status_snapshots_time ON system_status_snapshots (created_at DESC);

Keeping a history of snapshots lets you render the 90-day incident history without storing incidents manually.


Incident Detection and Management

Status snapshots tell you the current state of delivery. Incidents are a higher-level concept: a continuous period of degraded or worse status, with a start time, end time, and description.

You can auto-detect incidents from status transitions:

go
func detectIncidentTransition(prev, curr string) (start bool, end bool) {
    degradedStates := map[string]bool{
        "degraded":       true,
        "partial_outage": true,
        "major_outage":   true,
    }
    wasDown := degradedStates[prev]
    isDown  := degradedStates[curr]

    return !wasDown && isDown, wasDown && !isDown
}

When a transition to degraded is detected, open an incident. When the status returns to operational, close it. Your incident table:

sql
CREATE TABLE status_incidents (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title        TEXT NOT NULL,
    status       TEXT NOT NULL DEFAULT 'investigating',
    -- 'investigating' | 'identified' | 'monitoring' | 'resolved'
    started_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    resolved_at  TIMESTAMPTZ,
    auto_detected BOOLEAN NOT NULL DEFAULT true
);

CREATE TABLE status_incident_updates (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    incident_id UUID NOT NULL REFERENCES status_incidents(id),
    status      TEXT NOT NULL,
    message     TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

Auto-detection handles most cases. But your on-call engineer needs to be able to post manual updates ("we've identified the root cause — a misconfigured CDN cache rule is rejecting valid HMAC signatures") and override the incident status. A simple internal ops UI is sufficient for this — no customer-facing write access.


The Public API

The status page itself should be a statically renderable page backed by a public (no-auth) API. Keep the API surface minimal:

GET /status                  — current status + active incident if any
GET /status/history          — past 90 days of daily status summaries
GET /status/incidents/{id}   — full incident timeline with updates

The current status endpoint:

json
{
  "status": "degraded",
  "updated_at": "2026-03-28T14:32:00Z",
  "metrics": {
    "error_rate_pct": 3.2,
    "p95_latency_ms": 4100
  },
  "active_incident": {
    "id": "inc_01HZ...",
    "title": "Elevated delivery latency for US-East destinations",
    "status": "investigating",
    "started_at": "2026-03-28T14:25:00Z",
    "latest_update": "We are investigating elevated latency affecting deliveries to US-East endpoints. Other regions are unaffected."
  }
}

The history endpoint returns 90 days of daily status summaries for the uptime calendar:

json
{
  "data": [
    { "date": "2026-03-28", "status": "degraded", "incident_count": 1, "uptime_pct": 98.7 },
    { "date": "2026-03-27", "status": "operational", "incident_count": 0, "uptime_pct": 100.0 }
  ]
}

Computing uptime_pct per day is straightforward from your snapshots: count the snapshots where status = 'operational' and divide by total snapshots for that day.


Rendering the Status Page

You have two good options: a statically generated page (rebuilt every 60 seconds) or a server-rendered page backed by the API above.

Static generation is preferable because a status page needs to be available even when your main application is struggling. If your API server is the problem, a status page backed by that same server won't load — which is the worst possible user experience during an outage.

The standard pattern is to deploy your status page to a CDN-backed static host (Cloudflare Pages, Vercel, S3+CloudFront) that is entirely separate from your main application infrastructure. A background job rebuilds and republishes the static HTML on a 60-second interval.

bash
# Cron job: rebuild and deploy status page every 60 seconds
* * * * * /usr/local/bin/rebuild-status-page.sh
* * * * * sleep 30 && /usr/local/bin/rebuild-status-page.sh

Running it twice per minute (offset by 30 seconds) gives you a maximum staleness of 30 seconds — good enough for incident transparency.

The rebuild script fetches from your internal status API (not the public one, to avoid rate limits), renders the HTML, and pushes it to your CDN origin.


Subscribing Customers to Updates

A status page that customers have to manually check is half as useful as one that notifies them. Give customers two subscription options:

Email notifications — opt-in mailing list for incident start, updates, and resolution. A simple webhook from your incident management system to a transactional email provider is sufficient.

RSS / Atom feed — low-effort for you, useful for customers who aggregate status feeds in monitoring tools like Better Uptime or PagerDuty's status aggregator.

xml
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>GetHook System Status</title>
  <link href="https://status.gethook.to/feed.xml" rel="self"/>
  <updated>2026-03-28T14:32:00Z</updated>
  <entry>
    <title>Elevated delivery latency for US-East destinations</title>
    <id>https://status.gethook.to/incidents/inc_01HZ...</id>
    <updated>2026-03-28T14:32:00Z</updated>
    <summary>We are investigating elevated latency affecting deliveries to US-East endpoints.</summary>
  </entry>
</feed>

Do not build a real-time push mechanism (WebSockets, SSE) for status updates. Your customers don't need sub-second status updates. Email on incident start is fine; RSS covers the rest.


What to Get Wrong (and How to Avoid It)

MistakeConsequenceFix
Status page hosted on same infrastructure as the productPage goes down during an outageSeparate static host on a different CDN and domain
Computing status from a single 1-minute snapshotFlapping status on transient errorsUse a 5-minute rolling average; require 3 consecutive degraded readings before declaring an incident
Auto-publishing vague titles like "Service Degraded"Customers have no useful informationQueue human review for the title; auto-title is a fallback only
No incident update after initial detectionCustomers assume you're not working on itCommit to an update cadence (15 minutes during active incidents)
Showing only current status, no historyCustomers can't evaluate your reliability track recordDisplay 90 days of history, always
Not alerting on-call when status degradesCustomers see the incident before your engineers doPage on status transition, not just on error rate threshold

Connecting to Your Webhook Gateway

If you're using a gateway like GetHook for inbound or outbound delivery, your status page metrics should be drawn from the gateway's delivery attempt data — not from your application's logs. The gateway sees 100% of delivery attempts including those that time out before your application receives them. Application-level logging misses network errors and upstream timeouts, which underrepresents actual failure rates.

GetHook's delivery_attempts table is the authoritative source for delivery outcomes. Wire your status computation query directly against it, and you'll have an accurate picture of what customers are actually experiencing.


A webhook status page is a trust artifact as much as a technical tool. Customers who can see that you take reliability seriously — that you track uptime, publish incidents transparently, and resolve them quickly — are more likely to stay and more likely to expand. Build it before you need it.

Start with GetHook to get delivery attempt data worth publishing on a status page →

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.