Most webhook delivery systems discover that a destination is unhealthy the same way: a delivery attempt fails, the retry counter increments, and eventually the event ends up in a dead-letter queue. By then, you've already wasted retry budget on a destination that was never going to accept the event. Worse, the consumer team finds out through an alert — or a customer complaint — not through your dashboard.

Reactive health scoring improves on this by building a failure history per destination and disabling endpoints that consistently fail. But reactive is still backward-looking. You're reacting to failures that have already happened.

Proactive health checks flip the model. You test destinations on a scheduled cadence — independent of live traffic — and route around problems before they affect event delivery. This post covers how to design that system: what to check, how to implement the probe, and how to wire the results into your delivery routing.

What a Proactive Health Check Actually Tests

A health check for a webhook destination is not the same as an uptime ping. You're not just asking "is this IP reachable?" — you're validating the entire consumer stack:

Check	What It Validates
TCP connectivity	The host is up and the port is open
TLS handshake	The certificate is valid and not expired
HTTP 200 on probe request	The application is running and responding
Response time	The consumer can respond within your timeout budget
Signature verification	The consumer correctly validates HMAC signatures
Idempotent re-delivery	Sending the same event twice doesn't cause errors

The last two are specific to webhooks and often skipped. They shouldn't be. A consumer that rejects probe requests with 401 because it can't verify your probe signature is a consumer that will silently drop live events.

The Probe Request Design

A health probe is a synthetic webhook delivery — a real HTTP POST with a valid signature, a well-formed payload, and a recognizable event type. The consumer should handle it and return 200.

json

{
  "id": "evt_probe_01JQMR9KXV4B2P7TNWY63ZHFD",
  "type": "system.health_check",
  "api_version": "2026-01",
  "created_at": "2026-04-23T08:00:00Z",
  "livemode": false,
  "data": {
    "object": {
      "probe_id": "probe_01JQMR9KXV4B2P7TNWY63ZHFD",
      "message": "This is a synthetic health check event. No action required."
    }
  }
}

Key design decisions in this payload:

›livemode: false — this is the most important field. Consumers that correctly handle livemode will skip business logic for test events and return 200 without side effects. Consumers that don't check livemode will still return 200 (the desired outcome), though they may process the event. That's a consumer bug worth knowing about.
›type: "system.health_check" — a dedicated event type tells consumers exactly what this is. Document it in your event catalog. Consumers can explicitly no-op on this type.
›id with a probe_ prefix — makes probes easily filterable in consumer logs.
›Full HMAC signature in the delivery headers — the probe must be signed with the destination's actual secret, or you won't catch signature misconfiguration.

Implementing the Probe Worker

The probe worker runs on a schedule — every 60 seconds is reasonable for destinations with active traffic; every 5 minutes for idle ones. It should be a separate execution path from your main delivery worker, not sharing the same queue.

type HealthProber struct {
    destinations DestinationStore
    delivery     *Forwarder
    results      HealthResultStore
}

type ProbeResult struct {
    DestinationID string
    ProbeID       string
    Timestamp     time.Time
    Healthy       bool
    StatusCode    int
    LatencyMs     int64
    Error         string
    TLSExpiresDays int
}

func (p *HealthProber) ProbeAll(ctx context.Context) error {
    dsts, err := p.destinations.ListActive(ctx)
    if err != nil {
        return fmt.Errorf("listing destinations: %w", err)
    }

    var wg sync.WaitGroup
    results := make(chan ProbeResult, len(dsts))

    for _, dst := range dsts {
        wg.Add(1)
        go func(d Destination) {
            defer wg.Done()
            result := p.probe(ctx, d)
            results <- result
        }(dst)
    }

    wg.Wait()
    close(results)

    for r := range results {
        if err := p.results.Store(ctx, r); err != nil {
            // log but don't fail — we still want to process remaining results
            log.Printf("storing probe result for %s: %v", r.DestinationID, err)
        }
    }
    return nil
}

func (p *HealthProber) probe(ctx context.Context, dst Destination) ProbeResult {
    probeID := newProbeID()
    payload := buildProbePayload(probeID)
    start := time.Now()

    // Deliver using the same forwarder used for live events —
    // this ensures probe results reflect real delivery conditions.
    resp, err := p.delivery.Send(ctx, dst, payload)
    latency := time.Since(start).Milliseconds()

    result := ProbeResult{
        DestinationID: dst.ID,
        ProbeID:       probeID,
        Timestamp:     start,
        LatencyMs:     latency,
    }

    if err != nil {
        result.Healthy = false
        result.Error = err.Error()
        return result
    }

    result.StatusCode = resp.StatusCode
    result.Healthy = resp.StatusCode >= 200 && resp.StatusCode < 300
    result.TLSExpiresDays = daysUntilTLSExpiry(dst.URL)
    return result
}

The critical line is using the same forwarder (p.delivery.Send) for probe delivery. If you implement probe delivery differently from live delivery, your health check results won't reflect real conditions. Network paths, connection pooling, TLS behavior, and header handling should all be identical between probes and live events.

Storing and Querying Probe Results

Probe results need to support two access patterns: "is this destination healthy right now?" and "what's this destination's health trend over the last 24 hours?"

sql

CREATE TABLE destination_health_probes (
    id              UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
    destination_id  UUID        NOT NULL REFERENCES destinations(id) ON DELETE CASCADE,
    probe_id        TEXT        NOT NULL,
    probed_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    healthy         BOOLEAN     NOT NULL,
    status_code     INTEGER,
    latency_ms      INTEGER,
    tls_expires_days INTEGER,
    error_message   TEXT,
    UNIQUE (destination_id, probe_id)
);

CREATE INDEX idx_dhp_destination_time
    ON destination_health_probes (destination_id, probed_at DESC);

To compute current health status from recent probes:

sql

-- Health window: last 5 probes
-- Destination is "degraded" if >= 2 of last 5 probes failed
-- Destination is "down" if all 5 of last 5 probes failed
SELECT
    destination_id,
    COUNT(*) FILTER (WHERE healthy = false) AS failures,
    COUNT(*) AS total,
    CASE
        WHEN COUNT(*) FILTER (WHERE healthy = false) = COUNT(*) THEN 'down'
        WHEN COUNT(*) FILTER (WHERE healthy = false) >= 2 THEN 'degraded'
        ELSE 'healthy'
    END AS health_status,
    MAX(probed_at) AS last_probe_at
FROM (
    SELECT *
    FROM destination_health_probes
    WHERE destination_id = $1
    ORDER BY probed_at DESC
    LIMIT 5
) recent
GROUP BY destination_id;

A sliding window of five probes gives you fast detection without false positives from transient blips. Adjust the window size based on your probe interval — with 60-second probes, five probes covers five minutes.

Routing Around Unhealthy Destinations

Health check data is only useful if you act on it. There are three escalation levels you should implement:

Status	Probe Results	Action
`healthy`	0–1 failures in last 5 probes	Route live events normally
`degraded`	2–4 failures in last 5 probes	Route live events; surface warning in dashboard
`down`	5/5 failures in last 5 probes	Hold events in a pending queue; alert the consumer team

The "hold" behavior for down destinations deserves careful design. You have two options:

Option A: Queue events and retry when healthy returns. Events accumulate. When the destination recovers (two consecutive successful probes is a reasonable threshold), you flush the queue in order. The consumer sees no gaps — just a delayed batch. This is the right choice when events are time-sensitive business data.

Option B: Reject new events immediately and alert. The producer gets a 4xx response indicating the destination is unhealthy. This is appropriate for real-time alerting use cases where a stale alert delivered two hours late is worse than no alert.

Make this configurable per destination. A payment processing endpoint warrants Option A; a Slack notification endpoint might warrant Option B.

TLS Certificate Expiry Monitoring

Webhook endpoints with expired TLS certificates are a frequent source of delivery failures that are entirely preventable. Your probe worker already has the destination URL — checking certificate expiry adds roughly 20ms per probe and eliminates an entire class of preventable outages.

func daysUntilTLSExpiry(rawURL string) int {
    u, err := url.Parse(rawURL)
    if err != nil || u.Scheme != "https" {
        return -1
    }

    conn, err := tls.Dial("tcp", u.Host+":443", &tls.Config{
        InsecureSkipVerify: false,
    })
    if err != nil {
        return 0 // can't connect or cert is already invalid
    }
    defer conn.Close()

    certs := conn.ConnectionState().PeerCertificates
    if len(certs) == 0 {
        return 0
    }

    days := int(time.Until(certs[0].NotAfter).Hours() / 24)
    if days < 0 {
        return 0
    }
    return days
}

Surface TLS expiry warnings at 30 days and 7 days. At 7 days, escalate to an email alert. At expiry, the destination transitions to down and you should surface the specific error — "TLS certificate expired" — not just "delivery failed."

Surfacing Health Data to Consumers

Raw probe results belong in your internal observability stack. What belongs in your customer-facing dashboard is a simplified health summary that answers one question: "Is my endpoint ready to receive events?"

A good consumer health view shows:

›Current status: healthy / degraded / down
›Last probe time and result
›Average latency over the last 24 hours (p50 and p99)
›TLS certificate expiry date
›A timeline of status changes (so they can see "went degraded at 14:32, recovered at 14:41")

GetHook exposes destination health status in the dashboard alongside delivery metrics. When a probe detects a degraded destination, the dashboard surfaces a banner with the failure detail before any live events are affected — giving your consumers a chance to investigate and fix the issue before it causes a gap in their event stream.

Avoiding Probe Noise in Consumer Metrics

One operational concern: probes generate HTTP traffic that will appear in your consumers' access logs, APM dashboards, and error budgets. If you're running probes every 60 seconds across 500 destinations, that's 500 requests per minute of synthetic traffic your consumers didn't ask for.

Mitigate this with two conventions:

›Use a consistent User-Agent header — GetHook-HealthProbe/1.0 (+https://docs.gethook.to/health-probes). Consumers can filter this in their metrics.
›Document the system.health_check event type in your event catalog and tell consumers to no-op on it. Add a code example showing how to skip it in one line.

func handleWebhook(w http.ResponseWriter, r *http.Request) {
    var event WebhookEvent
    if err := json.NewDecoder(r.Body).Decode(&event); err != nil {
        http.Error(w, "bad request", http.StatusBadRequest)
        return
    }

    // Acknowledge health probes immediately without processing
    if event.Type == "system.health_check" {
        w.WriteHeader(http.StatusOK)
        return
    }

    // Normal event handling below
    processEvent(event)
    w.WriteHeader(http.StatusOK)
}

This pattern is safe to ship. The probe validates that the consumer endpoint is reachable and responds to authenticated requests — the two things that matter for delivery reliability. Whether the consumer processes the probe payload is irrelevant.

When Probes Are Not Enough

Proactive probes catch infrastructure failures: the server is down, TLS is expired, the application isn't starting. What they don't catch is application-level correctness — a consumer that returns 200 but silently discards events it can't parse, or one that has a bug in its event handler that causes invisible data corruption.

For those problems, you need end-to-end integration tests running against a staging endpoint. Probes and integration tests are complementary, not substitutes. Use probes for continuous operational health; use integration tests for correctness guarantees at deploy time.

Proactive health checks are low-cost infrastructure with high-leverage outcomes. A 60-second probe interval means you detect failures within a minute, hold events automatically, and alert consumers — all before the on-call engineer gets paged. That's the difference between a 5-minute incident and a silent three-hour gap in event delivery.

If you want delivery infrastructure that monitors your consumers and routes around failures automatically, start with GetHook.

Proactive Webhook Consumer Health Checks: Testing Endpoints Before Traffic Arrives

What a Proactive Health Check Actually Tests

The Probe Request Design

Implementing the Probe Worker

Storing and Querying Probe Results

Routing Around Unhealthy Destinations

TLS Certificate Expiry Monitoring

Surfacing Health Data to Consumers

Avoiding Probe Noise in Consumer Metrics

When Probes Are Not Enough

Related articles

Webhook Consumer Observability: Metrics and Alerts on the Receiving End

Designing a Great Webhook SDK: Verification, Typing, and Developer Ergonomics

Synthetic End-to-End Testing for Webhook Delivery Pipelines

Stop losing webhook events.