Most webhook delivery failures happen during deployments, not during steady-state operation. You restart your worker process to ship a bug fix, three in-flight deliveries get interrupted mid-request, their queue locks are held until expiry, and those events sit in delivering status for the next two minutes until the lock timeout reschedules them. Meanwhile, your monitoring shows a delivery latency spike, a customer opens a ticket, and you spend an hour debugging something that should have been invisible.
Graceful shutdown is the mechanism that prevents this. Done correctly, it lets a worker process receive a shutdown signal, finish the delivery attempts already in progress, release any queue locks it holds, and exit cleanly — without dropping events or leaving the queue in a corrupted state.
This post walks through the implementation: signal handling, in-flight tracking, drain logic, and the edge cases that catch teams off guard.
Why Webhook Workers Are Especially Sensitive to Abrupt Shutdown
Graceful shutdown matters for most long-running services, but webhook workers have properties that make abrupt shutdown particularly damaging.
Exclusive queue locks. Workers that use FOR UPDATE SKIP LOCKED to claim jobs hold a lock on each row for the duration of delivery. If the process is killed before the lock is released, other workers cannot claim that job until the lock expires or the database connection is closed. The connection close is instant on a clean SIGTERM, but the application-level lock timeout is usually set to minutes.
HTTP requests in flight. A delivery attempt is an outbound HTTP request to a customer's endpoint. If you kill the process mid-request, the destination may have received and processed the webhook — triggering business logic on their side — while your system never recorded the outcome. From your perspective, the delivery failed. From theirs, it succeeded. That's a duplicate or phantom event waiting to happen.
State machine consistency. Events transition through statuses: queued → delivering → delivered or retry_scheduled. Abrupt shutdown during a transition can leave an event stuck in delivering with no worker actively handling it — invisible to your retry logic until a separate cleanup process runs.
The Signal Handling Foundation
The first requirement is intercepting SIGTERM and SIGINT before the OS forces a hard kill. In Go, this looks like:
func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
defer stop()
worker := NewDeliveryWorker(db, cfg)
go func() {
if err := worker.Run(ctx); err != nil && !errors.Is(err, context.Canceled) {
log.Printf("worker exited with error: %v", err)
}
}()
// Block until signal received
<-ctx.Done()
log.Println("shutdown signal received, draining in-flight deliveries...")
// Give the worker time to drain, then force exit
shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := worker.Shutdown(shutdownCtx); err != nil {
log.Printf("shutdown did not complete cleanly: %v", err)
os.Exit(1)
}
log.Println("worker shut down cleanly")
}The key here is signal.NotifyContext. When a signal arrives, ctx is cancelled. The worker's polling loop checks ctx.Done() to stop accepting new jobs. worker.Shutdown then waits for in-flight work to complete before returning.
The 30-second timeout on shutdownCtx is your safety valve. If in-flight deliveries haven't resolved by then — because a destination is slow or hanging — you exit anyway. Kubernetes default terminationGracePeriodSeconds is 30 seconds; align your shutdown timeout to slightly under that.
Tracking In-Flight Work
To drain correctly, the worker needs to know what's in flight. A sync.WaitGroup is the standard mechanism:
type DeliveryWorker struct {
db *sql.DB
cfg Config
wg sync.WaitGroup
mu sync.Mutex
inflight map[string]context.CancelFunc // job ID → cancel func
}
func (w *DeliveryWorker) dispatch(ctx context.Context, job DeliveryJob) {
w.wg.Add(1)
jobCtx, cancel := context.WithTimeout(ctx, time.Duration(w.cfg.DeliveryTimeoutSeconds)*time.Second)
w.mu.Lock()
w.inflight[job.ID] = cancel
w.mu.Unlock()
go func() {
defer w.wg.Done()
defer func() {
w.mu.Lock()
delete(w.inflight, job.ID)
w.mu.Unlock()
cancel()
}()
w.deliver(jobCtx, job)
}()
}
func (w *DeliveryWorker) Shutdown(ctx context.Context) error {
// Signal all in-flight deliveries to respect the shutdown deadline
done := make(chan struct{})
go func() {
w.wg.Wait()
close(done)
}()
select {
case <-done:
return nil
case <-ctx.Done():
return fmt.Errorf("shutdown timed out with %d deliveries in flight", w.activeCount())
}
}Every goroutine spawned to handle a delivery calls wg.Add(1) before launching and wg.Done() when it finishes. Shutdown blocks on wg.Wait() — it only returns once every goroutine has exited.
The inflight map is optional for basic drain logic, but it's useful if you want to cancel in-flight deliveries early when the shutdown deadline is close. Rather than waiting for slow endpoints to time out naturally, you can call their cancel functions once you're within, say, 5 seconds of the deadline.
The Poll Loop's Role in Drain
The polling loop itself needs to stop accepting new jobs as soon as the shutdown signal arrives. The pattern is to check ctx.Done() at the top of each iteration:
func (w *DeliveryWorker) Run(ctx context.Context) error {
ticker := time.NewTicker(w.cfg.PollInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
// Stop accepting new jobs; drain loop exits here
return ctx.Err()
case <-ticker.C:
jobs, err := w.claimJobs(ctx)
if err != nil {
if errors.Is(err, context.Canceled) {
return err
}
log.Printf("error claiming jobs: %v", err)
continue
}
for _, job := range jobs {
w.dispatch(ctx, job)
}
}
}
}When ctx is cancelled, Run returns immediately on the next tick. It does not attempt to claim any more jobs. The goroutines spawned by previous dispatch calls are still running — Shutdown handles waiting for those.
Lock Release Strategies
When a worker claims a job via FOR UPDATE SKIP LOCKED, the lock is held by the database connection. In most Postgres-backed queue implementations, the job is locked for the duration of the worker's transaction or until the worker explicitly marks it as in-progress with a separate locked_at timestamp.
If you use a locked_at timestamp pattern rather than a live transaction, abrupt shutdown leaves a stale lock. The standard fix is a cleanup query that resets jobs whose locks have expired:
-- Run periodically to recover stale locks (e.g., every 60 seconds)
UPDATE delivery_jobs
SET locked_at = NULL,
locked_by = NULL,
scheduled_at = now()
WHERE locked_at < now() - INTERVAL '2 minutes'
AND status = 'delivering';This is your backstop, not your primary mechanism. Graceful shutdown should release locks cleanly in the normal path. The stale lock cleanup handles the abnormal path: OOM kills, SIGKILL, infrastructure-level node failures.
For clean shutdown, explicitly update job status before exiting:
func (w *DeliveryWorker) deliver(ctx context.Context, job DeliveryJob) {
resp, err := w.forwarder.Deliver(ctx, job)
if err != nil {
if errors.Is(err, context.Canceled) {
// Shutdown in progress — release the lock so another worker can pick it up
w.releaseJob(context.Background(), job.ID)
return
}
w.recordFailure(ctx, job, err)
return
}
w.recordSuccess(ctx, job, resp)
}The context.Background() in releaseJob is intentional. By the time this code runs, ctx may already be cancelled. You need a fresh context to execute the release query — this is a pattern that bites teams who pass ctx all the way down and then wonder why their cleanup queries fail during shutdown.
What "Clean" Actually Means for Each Outcome
Graceful shutdown has different correct behaviors depending on where a delivery is in its lifecycle:
| State at shutdown | Correct behavior |
|---|---|
| Claimed, delivery not yet started | Release lock; reschedule for immediate retry by another worker |
| Delivery in progress, destination responding | Wait for response; record outcome normally |
| Delivery in progress, destination not responding | Wait up to shutdown deadline; cancel if deadline exceeded; release lock |
| Delivery complete, outcome not yet recorded | Use context.Background() to record outcome even if main ctx is cancelled |
| Outcome recorded, job not yet removed from queue | Use context.Background() to complete cleanup |
The hardest case is a delivery where the destination responded but you haven't recorded the outcome yet when shutdown is signalled. Always use a fresh context for the write path after a successful delivery. Losing the record of a successful delivery is worse than the delivery itself failing — it causes unnecessary retries to a destination that already processed the event.
Kubernetes Deployment Considerations
In Kubernetes, the shutdown sequence is:
- ›Pod receives
SIGTERM - ›Kubernetes removes pod from service endpoints (traffic stops routing to it)
- ›
terminationGracePeriodSecondscountdown begins (default: 30s) - ›After grace period,
SIGKILLis sent
Set terminationGracePeriodSeconds to match your maximum expected delivery time plus a buffer. If your delivery timeout is 10 seconds and your shutdown drain is 30 seconds, set terminationGracePeriodSeconds to 45.
spec:
terminationGracePeriodSeconds: 45
containers:
- name: webhook-worker
lifecycle:
preStop:
exec:
# Give the pod a moment to stop receiving new work before SIGTERM
command: ["/bin/sleep", "5"]The preStop hook is a common addition: it introduces a brief delay before the SIGTERM so that the load balancer has time to drain connections before the process begins shutting down. For a pure worker process that doesn't serve HTTP traffic, this is less critical — but if your worker also exposes a health endpoint, the preStop sleep prevents health check failures from triggering unnecessary restarts during rolling deploys.
Testing Graceful Shutdown
Graceful shutdown logic is easy to get wrong and hard to notice until production. A minimal test:
func TestWorkerGracefulShutdown(t *testing.T) {
db := setupTestDB(t)
worker := NewDeliveryWorker(db, testConfig())
// Enqueue a job that takes 200ms to deliver
enqueueSlowJob(t, db, 200*time.Millisecond)
ctx, cancel := context.WithCancel(context.Background())
done := make(chan error, 1)
go func() {
done <- worker.Run(ctx)
}()
// Let the worker claim the job
time.Sleep(50 * time.Millisecond)
// Signal shutdown
cancel()
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer shutdownCancel()
if err := worker.Shutdown(shutdownCtx); err != nil {
t.Fatalf("shutdown failed: %v", err)
}
// Verify the job was delivered (not left in 'delivering' status)
assertJobStatus(t, db, "delivered")
}Run this test with -race to catch any data races in your in-flight tracking. Run it with a shortened shutdown deadline to verify the timeout path behaves correctly (jobs return to the queue rather than staying stuck in delivering).
Graceful shutdown is one of those features that feels like operational polish until the day it prevents a production incident during a routine deploy. The implementation isn't complex — signal handling, a WaitGroup, and careful use of context — but getting the context propagation and lock release logic right requires deliberate attention.
GetHook's delivery worker implements this pattern, including stale lock cleanup for abnormal exits and per-destination concurrency limits that keep the drain window short even under load. If you want to skip the implementation and get reliable delivery out of the box, start with GetHook.