Health Checks¶
Relay Liveness Probe¶
The relay supports file-based liveness probes for Kubernetes:
After each batch, the relay touches this file. Your probe should check both:
- the file exists
- its mtime is still fresh
A plain test -f only proves the relay started once; it does not detect a stalled relay. A portable probe is to have Kubernetes run a short Python check in the same container:
livenessProbe:
exec:
command:
- python
- -c
- |
import os
import sys
import time
path = "/tmp/relay-alive"
max_age_seconds = 90
try:
stale_for = time.time() - os.path.getmtime(path)
except FileNotFoundError:
sys.exit(1)
sys.exit(0 if stale_for < max_age_seconds else 1)
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3
Choose max_age_seconds to exceed your normal healthy gap between touches. A good starting point is the larger of:
- roughly 2x
--idle-time --broker-outage-cooldown + --send-timeout- your worst-case healthy batch duration
During broker outage cooldown, queue depth may stop dropping for one cooldown window even while
the relay process is healthy. Liveness still depends on fresh batch touches to --liveness-file,
not on instantaneous queue drain.
With the example above, Kubernetes restarts the pod if the file is missing or has been stale for at least 90 seconds.
Queue Depth Check¶
Monitor queue depth via stats command:
Or via StatsD metric:
Health Endpoint¶
The package does not ship an HTTP health endpoint. If your platform requires one for load balancers, add your own view: