Back to Blog
developer experienceobservabilityreliabilitycustomer trust

Building a Webhook Status Page for Your Customers

When webhook delivery breaks, your customers feel it before you do. Here's how to build a public-facing status page that turns a support crisis into a self-service moment.

S
Sofia Andreou
Product Manager
March 31, 2026
9 min read

Your webhook integration breaks at 11:47 PM on a Saturday. A customer's order confirmation flow goes silent. By the time your on-call engineer opens their laptop, the customer has already filed a support ticket, posted to your community forum, and started evaluating a competitor.

That sequence is avoidable — not by preventing every failure, but by making the failure visible and self-explanatory before the frustration escalates. A webhook status page is how you do that.

This post covers what to show, how to compute it, and how to build the data layer that makes it accurate.


What customers actually need to know

A status page for webhooks isn't the same as a general infrastructure status page. API: Operational tells a developer nothing about whether their order.created events are being delivered.

What they actually care about:

QuestionWhat it maps to
"Are my webhooks firing?"Current delivery success rate, last 15 minutes
"Is this a me problem or a you problem?"Delivery errors broken down by destination vs. system-wide
"How far behind is the queue?"Oldest undelivered event age
"When did this start?"Incident timeline
"Are other customers affected?"Aggregate vs. per-account signal

Most status pages answer none of these. They show a green dot and a vague "All systems operational" message — which is useless when a specific webhook type stopped firing two hours ago.


The data model you need first

Accurate status requires pre-computed signal, not on-demand queries across your events table. At GetHook, we maintain a delivery_health view that rolls up per-5-minute buckets:

sql
CREATE MATERIALIZED VIEW delivery_health_5m AS
SELECT
  date_trunc('minute', attempted_at) - 
    INTERVAL '1 minute' * (EXTRACT(minute FROM attempted_at)::int % 5) AS bucket,
  COUNT(*) FILTER (WHERE outcome = 'success')     AS delivered,
  COUNT(*) FILTER (WHERE outcome = 'http_5xx')    AS dest_5xx,
  COUNT(*) FILTER (WHERE outcome = 'timeout')     AS timeouts,
  COUNT(*) FILTER (WHERE outcome = 'network_error') AS network_errors,
  COUNT(*) AS total
FROM delivery_attempts
WHERE attempted_at > NOW() - INTERVAL '24 hours'
GROUP BY 1;

CREATE UNIQUE INDEX ON delivery_health_5m (bucket);

Refresh it on a cron:

sql
REFRESH MATERIALIZED VIEW CONCURRENTLY delivery_health_5m;

CONCURRENTLY is key — it lets reads continue without a lock. Refresh every 60 seconds and you have a status signal that's at most 60 seconds stale with no impact on query latency.

For the "oldest undelivered event" metric:

sql
SELECT MIN(created_at) AS oldest_queued
FROM events
WHERE status IN ('queued', 'retry_scheduled')
  AND next_attempt_at < NOW() + INTERVAL '5 minutes';

If that timestamp is more than 10 minutes old, your queue is backing up.


Computing a health status

Translate raw numbers into a signal your API can expose:

go
type DeliveryStatus struct {
    Status          string    `json:"status"` // "operational", "degraded", "outage"
    SuccessRate     float64   `json:"success_rate_15m"`
    OldestQueuedAge int       `json:"oldest_queued_seconds"`
    UpdatedAt       time.Time `json:"updated_at"`
}

func ComputeStatus(delivered, total int64, oldestQueuedAge time.Duration) DeliveryStatus {
    var rate float64
    if total > 0 {
        rate = float64(delivered) / float64(total)
    } else {
        rate = 1.0 // no events = healthy
    }

    status := "operational"
    switch {
    case rate < 0.80 || oldestQueuedAge > 30*time.Minute:
        status = "outage"
    case rate < 0.95 || oldestQueuedAge > 10*time.Minute:
        status = "degraded"
    }

    return DeliveryStatus{
        Status:          status,
        SuccessRate:     rate,
        OldestQueuedAge: int(oldestQueuedAge.Seconds()),
        UpdatedAt:       time.Now().UTC(),
    }
}

The thresholds here are opinionated. A 95% success rate sounds good — but if you're sending 10,000 events an hour, 5% failures is 500 broken deliveries. Adjust based on your volume and your SLA commitments.


What to expose publicly vs. privately

Not everything belongs on the public page.

Public (unauthenticated):

  • System-wide delivery success rate (rolling 15 minutes)
  • Queue backlog depth (operational / degraded / backed up)
  • Incident history (last 30 days)
  • Current active incidents with start time and updates

Authenticated (per-account):

  • Per-destination success rate
  • Their specific events in dead_letter status
  • Retry schedule for in-flight failures
  • Full delivery attempt log

The public page builds trust with prospects and reduces support volume. The per-account view is what your customer opens when they're debugging their integration — which is a different job.


Incident tracking: the part most teams skip

A status page without incident history is just a green dot. Customers remember outages — they want to see that you do too.

At minimum, store:

sql
CREATE TABLE incidents (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  title        TEXT NOT NULL,
  impact       TEXT NOT NULL CHECK (impact IN ('degraded', 'partial_outage', 'major_outage')),
  started_at   TIMESTAMPTZ NOT NULL,
  resolved_at  TIMESTAMPTZ,
  updates      JSONB NOT NULL DEFAULT '[]'
);

The updates column holds a JSON array of { message, timestamp } entries — your postmortem trail. When an incident closes, write a brief RCA in plain language: what failed, why, how long, what changed.

This sounds like overhead, but a three-sentence incident update does more for customer trust than three months of clean uptime. It signals that humans are watching and that failures are understood, not just fixed.


The status page endpoint

Expose a single, fast, public endpoint:

GET /status

{
  "delivery": {
    "status": "degraded",
    "success_rate_15m": 0.91,
    "oldest_queued_seconds": 420,
    "updated_at": "2026-03-31T11:52:00Z"
  },
  "active_incidents": [
    {
      "id": "inc_01htx...",
      "title": "Elevated timeout rate to eu-west destinations",
      "impact": "degraded",
      "started_at": "2026-03-31T11:30:00Z",
      "latest_update": "Identified network path issue, mitigation in progress"
    }
  ],
  "recent_incidents": []
}

Keep this endpoint fast — it should hit a cache, not run queries. Write the computed status to a system_status cache key (Redis, Postgres, or even an in-memory value updated by a background goroutine) and return it in under 5ms. Your frontend status page should poll this every 30 seconds.


Dashboard integration

The status page data becomes more valuable when it lives in your product, not just a separate URL. In your customer dashboard, surface:

  • A banner when the customer has events in dead_letter state
  • A callout when their specific destination has a degraded success rate
  • A link to the incident log if there's an active incident affecting their region or event type

These in-product signals reduce support tickets. Instead of "my webhooks aren't working," customers arrive at support with "I saw my eu-west destination went degraded at 11:30 — is there a known incident?" That's a solvable conversation.


Practical thresholds by scale

Daily volumeRefresh intervalAlert thresholdDead letter alert
< 10K events5 minutes< 90% successAny
10K–100K2 minutes< 95% success> 5 events
100K–1M60 seconds< 98% success> 50 events
> 1M30 seconds< 99% success> 200 events

These aren't rules — they're starting points. Your actual thresholds should come from your SLA commitments and what failure rate makes customers notice.


A webhook status page isn't about showing that you never break. It's about demonstrating that when something breaks, you know about it, you're communicating, and your customers don't have to file a ticket to find out what's happening.

If you're building webhook infrastructure and want the delivery pipeline, retry logic, and event history that power a status page like this, GetHook is available to try free — you can have an endpoint accepting and delivering events in under five minutes.

Stop losing webhook events.

GetHook gives you reliable delivery, automatic retry, and full observability — in minutes.