Skip to content

Metrics

The relay emits StatsD metrics for monitoring.

Configuration

# settings.py
MONITORING_METRICS_ENABLED = True
MONITORING_STATSD_HOST = 'localhost'
MONITORING_STATSD_PORT = 9125
MONITORING_STATSD_PREFIX = 'celery_outbox'
MONITORING_STATSD_TAGS = {
    'env': 'production',
    'service': 'myapp',
}

Set MONITORING_METRICS_ENABLED = False to disable all emission without removing the integration code.

Metric Reference

Metric Type Tags Description
queue.depth gauge - Sampled live backlog: rows currently eligible for relay send or recovery (updated_at IS NULL, retryable by retry_after, or stale in-flight rows), refreshed at most once per --queue-snapshot-refresh-seconds window
dead_letter.count gauge - Sampled dead-letter backlog, refreshed at most once per --queue-snapshot-refresh-seconds window
oldest_pending_age_seconds gauge - Sampled age of the oldest live-backlog row, refreshed at most once per --queue-snapshot-refresh-seconds window
batch.duration_ms timing - Processing time
send_latency_ms timing task_name Time from enqueue to publish (outbox queue latency)
messages.enqueued counter task_name Committed outbox rows only; rollback and excluded-task bypass do not emit it
messages.published counter task_name Successfully sent
messages.failed counter task_name, exception_type Failed (will retry)
messages.exceeded counter task_name, exception_type Dead-lettered

Cardinality Control

High-cardinality task_name tags can overwhelm metrics backends. Control this with:

# Option 1: Disable task_name tags entirely
CELERY_OUTBOX_DISABLE_TASK_NAME_TAGS = True

# Option 2: Allowlist specific tasks (others become "other")
CELERY_OUTBOX_MONITORED_TASKS = {'orders.tasks.process_payment', 'orders.tasks.send_email'}

When CELERY_OUTBOX_MONITORED_TASKS is set, only listed tasks get their actual name in tags. All other tasks are tagged as other to limit cardinality.

Grafana Dashboard

Example PromQL queries (via StatsD exporter):

# Queue depth
celery_outbox_queue_depth

# Throughput (messages/sec)
rate(celery_outbox_messages_published_total[5m])

# Error rate
rate(celery_outbox_messages_failed_total[5m]) /
rate(celery_outbox_messages_published_total[5m])

# P95 batch duration
histogram_quantile(0.95, celery_outbox_batch_duration_ms_bucket)

StatsD Backend Compatibility

The histogram queries (e.g., histogram_quantile) require a StatsD backend that converts timing metrics to Prometheus histograms. Datadog StatsD exporter and statsd-exporter with histogram mapping support this. For other backends, use avg() or raw timing values instead.

Alerting

Recommended alerts:

Condition Severity Action
queue.depth > 1000 for 5m Warning Check whether the live backlog is draining
celery_outbox_oldest_pending_age_seconds > 60 for 10m Critical Queue latency is above SLO; inspect relay throughput and broker health
increase(celery_outbox_messages_exceeded_total[10m]) > 0 Critical New dead letters were created; triage failures before replaying or purging
rate(celery_outbox_messages_failed_total[5m]) > 0 for 10m Warning Check broker connectivity or task-specific send failures

Do not treat queue.depth as a count of only updated_at IS NULL rows. It is the same live-backlog summary used by the relay selector and the celery_outbox_stats command. These queue-wide gauges are sampled snapshots, not exact per-batch recomputations.

celery_outbox_relay_breaker_open is a broker-unavailable condition, not proof that the relay process is dead. Page process-down incidents from your platform health checks or liveness file monitoring instead of from the bundled metric examples.