Metrics¶
The relay emits StatsD metrics for monitoring.
Configuration¶
# settings.py
MONITORING_METRICS_ENABLED = True
MONITORING_STATSD_HOST = 'localhost'
MONITORING_STATSD_PORT = 9125
MONITORING_STATSD_PREFIX = 'celery_outbox'
MONITORING_STATSD_TAGS = {
'env': 'production',
'service': 'myapp',
}
Set MONITORING_METRICS_ENABLED = False to disable all emission without removing the integration code.
Metric Reference¶
| Metric | Type | Tags | Description |
|---|---|---|---|
queue.depth |
gauge | - | Sampled live backlog: rows currently eligible for relay send or recovery (updated_at IS NULL, retryable by retry_after, or stale in-flight rows), refreshed at most once per --queue-snapshot-refresh-seconds window |
dead_letter.count |
gauge | - | Sampled dead-letter backlog, refreshed at most once per --queue-snapshot-refresh-seconds window |
oldest_pending_age_seconds |
gauge | - | Sampled age of the oldest live-backlog row, refreshed at most once per --queue-snapshot-refresh-seconds window |
batch.duration_ms |
timing | - | Processing time |
send_latency_ms |
timing | task_name |
Time from enqueue to publish (outbox queue latency) |
messages.enqueued |
counter | task_name |
Committed outbox rows only; rollback and excluded-task bypass do not emit it |
messages.published |
counter | task_name |
Successfully sent |
messages.failed |
counter | task_name, exception_type |
Failed (will retry) |
messages.exceeded |
counter | task_name, exception_type |
Dead-lettered |
Cardinality Control¶
High-cardinality task_name tags can overwhelm metrics backends. Control this with:
# Option 1: Disable task_name tags entirely
CELERY_OUTBOX_DISABLE_TASK_NAME_TAGS = True
# Option 2: Allowlist specific tasks (others become "other")
CELERY_OUTBOX_MONITORED_TASKS = {'orders.tasks.process_payment', 'orders.tasks.send_email'}
When CELERY_OUTBOX_MONITORED_TASKS is set, only listed tasks get their actual name in tags. All other tasks are tagged as other to limit cardinality.
Grafana Dashboard¶
Example PromQL queries (via StatsD exporter):
# Queue depth
celery_outbox_queue_depth
# Throughput (messages/sec)
rate(celery_outbox_messages_published_total[5m])
# Error rate
rate(celery_outbox_messages_failed_total[5m]) /
rate(celery_outbox_messages_published_total[5m])
# P95 batch duration
histogram_quantile(0.95, celery_outbox_batch_duration_ms_bucket)
StatsD Backend Compatibility
The histogram queries (e.g., histogram_quantile) require a StatsD backend that converts timing metrics to Prometheus histograms.
Datadog StatsD exporter and statsd-exporter with histogram mapping support this.
For other backends, use avg() or raw timing values instead.
Alerting¶
Recommended alerts:
| Condition | Severity | Action |
|---|---|---|
queue.depth > 1000 for 5m |
Warning | Check whether the live backlog is draining |
celery_outbox_oldest_pending_age_seconds > 60 for 10m |
Critical | Queue latency is above SLO; inspect relay throughput and broker health |
increase(celery_outbox_messages_exceeded_total[10m]) > 0 |
Critical | New dead letters were created; triage failures before replaying or purging |
rate(celery_outbox_messages_failed_total[5m]) > 0 for 10m |
Warning | Check broker connectivity or task-specific send failures |
Do not treat queue.depth as a count of only updated_at IS NULL rows. It is the same live-backlog summary used by the relay selector and the celery_outbox_stats command.
These queue-wide gauges are sampled snapshots, not exact per-batch recomputations.
celery_outbox_relay_breaker_open is a broker-unavailable condition, not proof that the relay process is dead. Page process-down incidents from your platform health checks or liveness file monitoring instead of from the bundled metric examples.