Every webhook delivery worker I've seen in the wild starts with a fixed timeout — typically 10 or 30 seconds — applied uniformly to every destination. That default is pragmatic when you're getting started. It becomes actively harmful once you have a diverse set of destinations with very different performance characteristics.
A fast internal service that normally responds in 80ms will be handled correctly by a 10-second timeout. But a partner integration that consistently takes 7 seconds to respond — because it's doing synchronous database lookups before acknowledging — will trip that same timeout sporadically, generating false delivery failures that land in your retry queue and eventually your dead-letter queue. Meanwhile, a timeout this generous for a fast destination means that when that destination actually goes down, you're holding an open connection for 10 seconds per attempt rather than failing fast.
The right approach is per-destination timeout policies derived from observed delivery latency. This post covers how to collect the data, how to compute adaptive thresholds, and the operational considerations involved in rolling them out.
The Cost of a Miscalibrated Timeout
Before getting into implementation, it's worth being precise about what a wrong timeout costs you.
Too-short timeouts (false failures):
- ›Events that would have succeeded get marked as failures
- ›Retry capacity is consumed by events that would have self-resolved
- ›Destination owners see a failure rate that doesn't reflect their actual reliability
- ›Alert thresholds that should indicate real outages get drowned in noise
Too-long timeouts (slow failure detection):
- ›Workers hold open connections to failed destinations for the full timeout duration
- ›Worker pool exhaustion when many destinations fail simultaneously
- ›Delayed detection of destination outages — your alert fires minutes late
- ›Unnecessary backpressure on the entire delivery pipeline
For a delivery worker with a concurrency limit of 100, a 30-second timeout against a destination that just went down means you can absorb at most ~3 failed destination events per second before worker capacity is fully consumed. At 200 events/second inbound, that's a 60-second window before your workers are backed up. With a 2-second timeout for that destination, the same scenario takes 25 seconds longer before workers become fully saturated — not dramatically different, but the head room matters when you're trying to shed load gracefully.
What to Collect
Your delivery attempts table is the data source. You need response time (wall clock from connection open to first byte or connection error) for every successful delivery attempt:
ALTER TABLE delivery_attempts
ADD COLUMN response_time_ms INTEGER;Populate this in your delivery worker at the point you close the response body:
start := time.Now()
resp, err := client.Do(req)
elapsed := time.Since(start)
// Store regardless of outcome — latency on failures is also useful signal
attempt.ResponseTimeMs = int(elapsed.Milliseconds())
if err != nil {
attempt.Outcome = "network_error"
// ...
return
}
defer resp.Body.Close()
// read body, record outcome, etc.You want response_time_ms for successes, timeouts, and network errors alike. Timeout events tell you the destination was slower than your current timeout — they're inputs to the calibration, not just noise to discard.
Computing Per-Destination Timeout Thresholds
The threshold you want is not the average response time. Averages are distorted by a small number of very slow responses. You want a high percentile of the response time distribution — typically P95 or P99 — with a safety margin on top.
SELECT
destination_id,
COUNT(*) AS sample_count,
PERCENTILE_CONT(0.50) WITHIN GROUP
(ORDER BY response_time_ms) AS p50_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP
(ORDER BY response_time_ms) AS p95_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP
(ORDER BY response_time_ms) AS p99_ms,
MAX(response_time_ms) AS max_ms
FROM delivery_attempts
WHERE
outcome = 'success'
AND created_at > now() - INTERVAL '7 days'
GROUP BY destination_id
HAVING COUNT(*) >= 100 -- require a minimum sample size before acting
ORDER BY p99_ms DESC;From this output, the recommended timeout for a destination is:
timeout = min(max_timeout, max(min_timeout, p99_ms * multiplier + buffer_ms))Where:
- ›
p99_ms— 99th percentile of successful delivery response times over the past 7 days - ›
multiplier— typically 1.5–2.0x; accounts for variance and slow days - ›
buffer_ms— flat buffer (e.g., 500ms) to absorb request overhead - ›
min_timeout— floor below which you never go (e.g., 1,000ms) - ›
max_timeout— ceiling above which you never go (e.g., 30,000ms)
For a destination with a P99 of 4,200ms, that computes to: min(30000, max(1000, 4200 * 1.5 + 500)) = min(30000, max(1000, 6800)) = 6,800ms. That destination gets a 6.8-second timeout instead of a uniform 10 or 30 seconds.
Where to Store and Apply the Policy
You need a place to store the computed timeout per destination, separate from the delivery attempt data:
CREATE TABLE destination_timeout_policies (
destination_id UUID PRIMARY KEY REFERENCES destinations(id),
timeout_ms INTEGER NOT NULL,
computed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
sample_count INTEGER NOT NULL,
p99_ms INTEGER NOT NULL,
method TEXT NOT NULL DEFAULT 'adaptive'
-- 'adaptive' | 'manual' | 'default'
);Your delivery worker reads the policy at delivery time, falling back to a system default if none exists for the destination:
func (w *Worker) timeoutForDestination(ctx context.Context, destID uuid.UUID) time.Duration {
policy, err := w.store.GetTimeoutPolicy(ctx, destID)
if err != nil || policy == nil {
return w.cfg.DefaultTimeoutSeconds * time.Second
}
return time.Duration(policy.TimeoutMs) * time.Millisecond
}
func (w *Worker) deliver(ctx context.Context, event *Event, dest *Destination) error {
timeout := w.timeoutForDestination(ctx, dest.ID)
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
req, err := http.NewRequestWithContext(ctx, http.MethodPost, dest.URL, bytes.NewReader(event.Payload))
if err != nil {
return err
}
// sign, set headers, execute...
}The policy read should be cheap — a single indexed lookup by destination_id. Cache the result in memory for a short TTL (e.g., 60 seconds) if you're worried about the database round-trip at high throughput.
The Recomputation Job
Policies should be recomputed on a regular cadence. A daily background job is usually sufficient:
| Trigger | When to use |
|---|---|
| Daily scheduled job | Standard operation; picks up gradual drift in destination behavior |
| After a detected timeout spike | Immediate recomputation when false failure rate rises suddenly |
| After manual override | Operator can pin a timeout and lock it from adaptive updates |
| New destination (first 24 hours) | Use the global default; insufficient sample data for computation |
For new destinations, resist the temptation to start computing a policy immediately. With fewer than 100 samples, your P99 is not meaningful — a single slow response can inflate it wildly. The HAVING COUNT(*) >= 100 clause in the query above enforces this: no policy computed until the destination has demonstrated sufficient delivery volume.
Handling the Transition: From Default to Adaptive
Rolling out adaptive timeouts to a destination that's been running on the default requires care. If your computed timeout is shorter than the current default, some deliveries that currently succeed will now time out — which looks like a regression.
The safe rollout pattern:
- ›Compute the adaptive timeout for all qualifying destinations
- ›For destinations where the new timeout is shorter than the current setting, apply gradually: set the new timeout to
max(computed, current * 0.75)for the first week, thencomputedthe following week - ›For destinations where the new timeout is longer (you're loosening), apply immediately — this is strictly better, since you're reducing false failures
- ›Monitor the timeout rate per destination in the 24 hours after each change
-- Monitor post-rollout: timeout rate by destination
SELECT
destination_id,
COUNT(*) FILTER (WHERE outcome = 'timeout') AS timeouts,
COUNT(*) AS total,
ROUND(
COUNT(*) FILTER (WHERE outcome = 'timeout')::numeric / COUNT(*) * 100,
1
) AS timeout_pct
FROM delivery_attempts
WHERE created_at > now() - INTERVAL '24 hours'
GROUP BY destination_id
HAVING COUNT(*) > 10
ORDER BY timeout_pct DESC;If timeout rate for a destination increases after shortening its policy, the computed threshold was too aggressive — either the destination's performance is more variable than the P99 suggests, or your multiplier is too low.
Manual Overrides and the Method Column
Not every timeout should be adaptive. There are legitimate cases where you want to pin a value:
- ›A destination that's a known slow processor (e.g., it does synchronous PDF generation) and the SLA is agreed upon with the destination owner
- ›A high-security destination where you'd rather fail fast and alert than wait for a slow response that might indicate a man-in-the-middle attack
- ›Internal endpoints under load tests where you want a predictable baseline
The method column ('adaptive', 'manual', 'default') tells your recomputation job to skip manually pinned destinations. This gives operators an escape hatch without losing the benefits of adaptive policies across the rest of the fleet.
In GetHook, destination timeout configuration is exposed through the API so you can set it explicitly or let the system derive it from historical delivery data — both workflows coexist in the same PATCH /v1/destinations/{id} endpoint.
Connecting Timeouts to Circuit Breakers
Adaptive timeouts reduce false failures, but they're not a substitute for circuit breakers. The two work at different levels:
| Mechanism | What it controls | Time scale |
|---|---|---|
| Adaptive timeout | How long to wait for a single response | Per-request (milliseconds to seconds) |
| Circuit breaker | Whether to attempt delivery at all | Per-destination (minutes to hours) |
An adaptive timeout that's well-calibrated means your circuit breaker sees a more accurate signal. When the destination genuinely degrades — going from P99 of 4 seconds to consistently timing out at 6.8 seconds — the circuit breaker opens faster because each failing attempt resolves in 6.8 seconds rather than 30 seconds. Your circuit breaker's error rate window fills with real failures instead of being diluted by the wait time of a too-generous global timeout.
What This Looks Like in Production
After running adaptive timeouts for a fleet of 200 destinations over 60 days, you'd typically expect:
| Metric | Before adaptive timeouts | After adaptive timeouts |
|---|---|---|
| False timeout rate (successful-if-waited) | ~1.8% | ~0.3% |
| Median configured timeout (fleet average) | 10,000ms (uniform) | 3,200ms |
| Worker capacity consumed by failed attempts | ~22% | ~11% |
| Mean time to detect destination outage | ~38 seconds | ~14 seconds |
The biggest gain is usually outage detection time, not the false failure rate. When your median timeout drops from 10 seconds to 3.2 seconds, your on-call team sees the alert almost three times sooner. At 3am, that matters.
A fixed timeout is a reasonable starting point. It is not a good long-term operating mode for a webhook infrastructure serving destinations with heterogeneous performance profiles. Per-destination adaptive policies, derived from your own delivery history, cost one database table, one background job, and a small change to your delivery worker. The operational return is faster outage detection, lower false failure rates, and less retry queue pressure — all from data you were already collecting.
If you'd like to explore how GetHook manages delivery policies including per-destination timeout configuration, start here →