Most webhook delivery systems discover that a destination is unhealthy the same way: a delivery attempt fails, the retry counter increments, and eventually the event ends up in a dead-letter queue. By then, you've already wasted retry budget on a destination that was never going to accept the event. Worse, the consumer team finds out through an alert — or a customer complaint — not through your dashboard.
Reactive health scoring improves on this by building a failure history per destination and disabling endpoints that consistently fail. But reactive is still backward-looking. You're reacting to failures that have already happened.
Proactive health checks flip the model. You test destinations on a scheduled cadence — independent of live traffic — and route around problems before they affect event delivery. This post covers how to design that system: what to check, how to implement the probe, and how to wire the results into your delivery routing.
What a Proactive Health Check Actually Tests
A health check for a webhook destination is not the same as an uptime ping. You're not just asking "is this IP reachable?" — you're validating the entire consumer stack:
| Check | What It Validates |
|---|---|
| TCP connectivity | The host is up and the port is open |
| TLS handshake | The certificate is valid and not expired |
| HTTP 200 on probe request | The application is running and responding |
| Response time | The consumer can respond within your timeout budget |
| Signature verification | The consumer correctly validates HMAC signatures |
| Idempotent re-delivery | Sending the same event twice doesn't cause errors |
The last two are specific to webhooks and often skipped. They shouldn't be. A consumer that rejects probe requests with 401 because it can't verify your probe signature is a consumer that will silently drop live events.
The Probe Request Design
A health probe is a synthetic webhook delivery — a real HTTP POST with a valid signature, a well-formed payload, and a recognizable event type. The consumer should handle it and return 200.
{
"id": "evt_probe_01JQMR9KXV4B2P7TNWY63ZHFD",
"type": "system.health_check",
"api_version": "2026-01",
"created_at": "2026-04-23T08:00:00Z",
"livemode": false,
"data": {
"object": {
"probe_id": "probe_01JQMR9KXV4B2P7TNWY63ZHFD",
"message": "This is a synthetic health check event. No action required."
}
}
}Key design decisions in this payload:
- ›
livemode: false— this is the most important field. Consumers that correctly handlelivemodewill skip business logic for test events and return 200 without side effects. Consumers that don't checklivemodewill still return 200 (the desired outcome), though they may process the event. That's a consumer bug worth knowing about. - ›
type: "system.health_check"— a dedicated event type tells consumers exactly what this is. Document it in your event catalog. Consumers can explicitly no-op on this type. - ›
idwith aprobe_prefix — makes probes easily filterable in consumer logs. - ›Full HMAC signature in the delivery headers — the probe must be signed with the destination's actual secret, or you won't catch signature misconfiguration.
Implementing the Probe Worker
The probe worker runs on a schedule — every 60 seconds is reasonable for destinations with active traffic; every 5 minutes for idle ones. It should be a separate execution path from your main delivery worker, not sharing the same queue.
type HealthProber struct {
destinations DestinationStore
delivery *Forwarder
results HealthResultStore
}
type ProbeResult struct {
DestinationID string
ProbeID string
Timestamp time.Time
Healthy bool
StatusCode int
LatencyMs int64
Error string
TLSExpiresDays int
}
func (p *HealthProber) ProbeAll(ctx context.Context) error {
dsts, err := p.destinations.ListActive(ctx)
if err != nil {
return fmt.Errorf("listing destinations: %w", err)
}
var wg sync.WaitGroup
results := make(chan ProbeResult, len(dsts))
for _, dst := range dsts {
wg.Add(1)
go func(d Destination) {
defer wg.Done()
result := p.probe(ctx, d)
results <- result
}(dst)
}
wg.Wait()
close(results)
for r := range results {
if err := p.results.Store(ctx, r); err != nil {
// log but don't fail — we still want to process remaining results
log.Printf("storing probe result for %s: %v", r.DestinationID, err)
}
}
return nil
}
func (p *HealthProber) probe(ctx context.Context, dst Destination) ProbeResult {
probeID := newProbeID()
payload := buildProbePayload(probeID)
start := time.Now()
// Deliver using the same forwarder used for live events —
// this ensures probe results reflect real delivery conditions.
resp, err := p.delivery.Send(ctx, dst, payload)
latency := time.Since(start).Milliseconds()
result := ProbeResult{
DestinationID: dst.ID,
ProbeID: probeID,
Timestamp: start,
LatencyMs: latency,
}
if err != nil {
result.Healthy = false
result.Error = err.Error()
return result
}
result.StatusCode = resp.StatusCode
result.Healthy = resp.StatusCode >= 200 && resp.StatusCode < 300
result.TLSExpiresDays = daysUntilTLSExpiry(dst.URL)
return result
}The critical line is using the same forwarder (p.delivery.Send) for probe delivery. If you implement probe delivery differently from live delivery, your health check results won't reflect real conditions. Network paths, connection pooling, TLS behavior, and header handling should all be identical between probes and live events.
Storing and Querying Probe Results
Probe results need to support two access patterns: "is this destination healthy right now?" and "what's this destination's health trend over the last 24 hours?"
CREATE TABLE destination_health_probes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
destination_id UUID NOT NULL REFERENCES destinations(id) ON DELETE CASCADE,
probe_id TEXT NOT NULL,
probed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
healthy BOOLEAN NOT NULL,
status_code INTEGER,
latency_ms INTEGER,
tls_expires_days INTEGER,
error_message TEXT,
UNIQUE (destination_id, probe_id)
);
CREATE INDEX idx_dhp_destination_time
ON destination_health_probes (destination_id, probed_at DESC);To compute current health status from recent probes:
-- Health window: last 5 probes
-- Destination is "degraded" if >= 2 of last 5 probes failed
-- Destination is "down" if all 5 of last 5 probes failed
SELECT
destination_id,
COUNT(*) FILTER (WHERE healthy = false) AS failures,
COUNT(*) AS total,
CASE
WHEN COUNT(*) FILTER (WHERE healthy = false) = COUNT(*) THEN 'down'
WHEN COUNT(*) FILTER (WHERE healthy = false) >= 2 THEN 'degraded'
ELSE 'healthy'
END AS health_status,
MAX(probed_at) AS last_probe_at
FROM (
SELECT *
FROM destination_health_probes
WHERE destination_id = $1
ORDER BY probed_at DESC
LIMIT 5
) recent
GROUP BY destination_id;A sliding window of five probes gives you fast detection without false positives from transient blips. Adjust the window size based on your probe interval — with 60-second probes, five probes covers five minutes.
Routing Around Unhealthy Destinations
Health check data is only useful if you act on it. There are three escalation levels you should implement:
| Status | Probe Results | Action |
|---|---|---|
healthy | 0–1 failures in last 5 probes | Route live events normally |
degraded | 2–4 failures in last 5 probes | Route live events; surface warning in dashboard |
down | 5/5 failures in last 5 probes | Hold events in a pending queue; alert the consumer team |
The "hold" behavior for down destinations deserves careful design. You have two options:
Option A: Queue events and retry when healthy returns. Events accumulate. When the destination recovers (two consecutive successful probes is a reasonable threshold), you flush the queue in order. The consumer sees no gaps — just a delayed batch. This is the right choice when events are time-sensitive business data.
Option B: Reject new events immediately and alert. The producer gets a 4xx response indicating the destination is unhealthy. This is appropriate for real-time alerting use cases where a stale alert delivered two hours late is worse than no alert.
Make this configurable per destination. A payment processing endpoint warrants Option A; a Slack notification endpoint might warrant Option B.
TLS Certificate Expiry Monitoring
Webhook endpoints with expired TLS certificates are a frequent source of delivery failures that are entirely preventable. Your probe worker already has the destination URL — checking certificate expiry adds roughly 20ms per probe and eliminates an entire class of preventable outages.
func daysUntilTLSExpiry(rawURL string) int {
u, err := url.Parse(rawURL)
if err != nil || u.Scheme != "https" {
return -1
}
conn, err := tls.Dial("tcp", u.Host+":443", &tls.Config{
InsecureSkipVerify: false,
})
if err != nil {
return 0 // can't connect or cert is already invalid
}
defer conn.Close()
certs := conn.ConnectionState().PeerCertificates
if len(certs) == 0 {
return 0
}
days := int(time.Until(certs[0].NotAfter).Hours() / 24)
if days < 0 {
return 0
}
return days
}Surface TLS expiry warnings at 30 days and 7 days. At 7 days, escalate to an email alert. At expiry, the destination transitions to down and you should surface the specific error — "TLS certificate expired" — not just "delivery failed."
Surfacing Health Data to Consumers
Raw probe results belong in your internal observability stack. What belongs in your customer-facing dashboard is a simplified health summary that answers one question: "Is my endpoint ready to receive events?"
A good consumer health view shows:
- ›Current status:
healthy/degraded/down - ›Last probe time and result
- ›Average latency over the last 24 hours (p50 and p99)
- ›TLS certificate expiry date
- ›A timeline of status changes (so they can see "went degraded at 14:32, recovered at 14:41")
GetHook exposes destination health status in the dashboard alongside delivery metrics. When a probe detects a degraded destination, the dashboard surfaces a banner with the failure detail before any live events are affected — giving your consumers a chance to investigate and fix the issue before it causes a gap in their event stream.
Avoiding Probe Noise in Consumer Metrics
One operational concern: probes generate HTTP traffic that will appear in your consumers' access logs, APM dashboards, and error budgets. If you're running probes every 60 seconds across 500 destinations, that's 500 requests per minute of synthetic traffic your consumers didn't ask for.
Mitigate this with two conventions:
- ›Use a consistent
User-Agentheader —GetHook-HealthProbe/1.0 (+https://docs.gethook.to/health-probes). Consumers can filter this in their metrics. - ›Document the
system.health_checkevent type in your event catalog and tell consumers to no-op on it. Add a code example showing how to skip it in one line.
func handleWebhook(w http.ResponseWriter, r *http.Request) {
var event WebhookEvent
if err := json.NewDecoder(r.Body).Decode(&event); err != nil {
http.Error(w, "bad request", http.StatusBadRequest)
return
}
// Acknowledge health probes immediately without processing
if event.Type == "system.health_check" {
w.WriteHeader(http.StatusOK)
return
}
// Normal event handling below
processEvent(event)
w.WriteHeader(http.StatusOK)
}This pattern is safe to ship. The probe validates that the consumer endpoint is reachable and responds to authenticated requests — the two things that matter for delivery reliability. Whether the consumer processes the probe payload is irrelevant.
When Probes Are Not Enough
Proactive probes catch infrastructure failures: the server is down, TLS is expired, the application isn't starting. What they don't catch is application-level correctness — a consumer that returns 200 but silently discards events it can't parse, or one that has a bug in its event handler that causes invisible data corruption.
For those problems, you need end-to-end integration tests running against a staging endpoint. Probes and integration tests are complementary, not substitutes. Use probes for continuous operational health; use integration tests for correctness guarantees at deploy time.
Proactive health checks are low-cost infrastructure with high-leverage outcomes. A 60-second probe interval means you detect failures within a minute, hold events automatically, and alert consumers — all before the on-call engineer gets paged. That's the difference between a 5-minute incident and a silent three-hour gap in event delivery.
If you want delivery infrastructure that monitors your consumers and routes around failures automatically, start with GetHook.