Runbook¶
Open this page when something is wrong. Jump to the Incident playbooks section, find the playbook that matches the symptom you are seeing, read it top to bottom, and execute. This page is not meant to be read cover-to-cover.
Every playbook below has the same shape: Detect → Triage → Fix → Verify. If you cannot find a matching playbook, the closest page for ad-hoc diagnosis is Troubleshooting.
Health signals¶
Use these tables as a reference while following any playbook.
Liveness file¶
| Signal | Healthy | Stale |
|---|---|---|
mtime of --liveness-file |
within your configured freshness threshold | older → relay stalled or dead → Relay hanging |
See Health Checks for the --liveness-file flag details.
Size that freshness threshold to cover healthy idle gaps and broker-outage cooldown windows, not
just steady-state queue drain.
celery_outbox_stats snapshot¶
python manage.py celery_outbox_stats prints a point-in-time snapshot. It is not a replacement for metrics over time.
| Field | Meaning | Abnormal → playbook |
|---|---|---|
queue_depth |
live backlog: rows eligible for relay send/recovery right now | trending up → Queue growing |
oldest_pending_seconds |
age of the oldest live-backlog row (delivery latency) | above your SLO → Queue growing |
dlq_count |
rows currently stored in celery_outbox_dead_letter |
new increases over time → Dead-letter queue growing |
top_failing |
current outbox task groups with the highest cumulative retry counts | one task dominating retries → Queue growing or Dead-letter queue growing |
Metrics for graphing and alerting¶
StatsD names are shown with the default MONITORING_STATSD_PREFIX = 'celery_outbox'. Gauge and counter metrics usually export to Prometheus by replacing dots with underscores. Timer metrics depend on your exporter configuration and often appear as histogram-style series such as _bucket, _sum, and _count.
| StatsD metric | Prometheus | Type | Use |
|---|---|---|---|
celery_outbox.queue.depth |
celery_outbox_queue_depth |
gauge | Sampled queue-wide live backlog, refreshed at most once per --queue-snapshot-refresh-seconds. Chart as a time series; monotonic rise means the queue is growing. |
celery_outbox.oldest_pending_age_seconds |
celery_outbox_oldest_pending_age_seconds |
gauge | Sampled queue-wide oldest live backlog age, refreshed at most once per --queue-snapshot-refresh-seconds. Alert on crossing your SLO. |
celery_outbox.dead_letter.count |
celery_outbox_dead_letter_count |
gauge | Sampled dead-letter backlog context, refreshed at most once per --queue-snapshot-refresh-seconds. Page from increase(celery_outbox_messages_exceeded_total[10m]) > 0 instead of raw table size. |
celery_outbox.batch.duration_ms |
e.g. celery_outbox_batch_duration_ms_bucket |
timing | Chart per-batch processing time. Absence of new samples means the relay has stalled. |
Full catalogue: Metrics.
Log events referenced during triage¶
celery_outbox_relay_startedcelery_outbox_batch_processed— absence during steady send is a stall signalcelery_outbox_relay_breaker_tripcelery_outbox_relay_breaker_open— relay alive, broker unavailablecelery_outbox_relay_iteration_failedcelery_outbox_relay_shutdown_deadline_exceededcelery_outbox_send_failedcelery_outbox_max_retries_exceeded
Full catalogue: Logging Events.
Explicit non-goals¶
- The library does not ship an HTTP health endpoint. File-based liveness is the only one provided. See Health Checks for a user-built Django view example if you need an HTTP probe.
- No auto-remediation. This runbook tells operators what to do; it does not run on its own.
Incident playbooks¶
Queue growing¶
Detect. celery_outbox_oldest_pending_age_seconds exceeds your SLO (suggested starting threshold: 60s). Secondary signal: celery_outbox_queue_depth trending up over 5-10 minutes.
Triage (cheapest first):
- Is the relay running? Check the relay pod status and the
--liveness-filemtime. - Is the broker reachable from the relay? From inside the relay container, run
celery -A <your_celery_app> inspect ping. -
Is a single task type dominating the pending set? Either:
or, from a DB shell:
-
Did the app's send rate spike? Cross-check with producer-side metrics on your service.
- Is the broker itself under load? Check the broker admin UI (CPU, consumer count, its own queue depth).
Fix (by triage result):
- Relay is down → follow Relay hanging.
- Broker unreachable → follow Broker unreachable / outage cooldown active.
- One task dominates → fix the producing code, or add the task name to
CELERY_OUTBOX_EXCLUDE_TASKStemporarily if the library is not a fit for that workload. - Legitimate throughput → scale relay replicas and/or increase
batch_size. See Relay Tuning.
Verify. celery_outbox_oldest_pending_age_seconds trending down; celery_outbox_queue_depth draining.
Broker unreachable / outage cooldown active¶
Detect. Any of:
celery_outbox_relay_breaker_tripin the relay log.- Repeated
celery_outbox_relay_breaker_openwhile queue depth is flat or rising. - Broker ping from the relay container fails.
Triage:
- Confirm the relay is still alive. Check the
--liveness-filemtime using a freshness threshold sized per Health Checks. - Confirm the broker outage. From inside the relay container, run
celery -A <your_celery_app> inspect pingor the broker's equivalent connectivity check. - Check whether retries are climbing. Broker-outage deferral should not increment
retriesor consume retry budget. - Look for scope. Is this one relay process, one AZ, or the whole broker fleet? The breaker is process-local, so different relay pods may trip independently.
Fix:
- Restore broker connectivity, authentication, or network reachability.
- Leave the selected outbox rows alone. The relay already deferred them by
--broker-outage-cooldown. - Do not interpret one cooldown window without queue drain as a dead relay if the liveness file is still fresh.
- Once the broker recovers, the relay resumes on the next eligible batch attempt after the cooldown expires.
Verify.
celery_outbox_relay_breaker_openstops repeating.celery_outbox_batch_processedresumes showing normal publish counts.celery_outbox_queue_depthandcelery_outbox_oldest_pending_age_secondstrend down.
Dead-letter queue growing¶
Detect. increase(celery_outbox_messages_exceeded_total[10m]) > 0 or another alert on new dead letters over time. Table size alone is secondary context, not the primary paging signal.
Triage:
- Group by
failure_reason— is this one error class or many? - Group by
task_name— is this scoped to one task or broad? - Time distribution of
dead_at— is this ongoing or a past spike that has already stopped? - Cross-reference recent deploys, config changes, and broker incidents.
The first three are visible in the Django admin (Admin Interface) by filtering on failure_reason, task_name, and dead_at. Item 4 comes from deploy history, config history, and broker incident history outside the package.
Fix (by cause):
-
Past broker outage, now recovered. Purge old records:
See Dead Letter Queue for the full flag surface.
-
Task name not registered on workers. Roll workers forward to include the task, or revert the producer deploy.
- Serialization errors. Fix the producing code and redeploy.
Replaying dead-lettered messages. Use either:
- Django admin:
CeleryOutboxDeadLetterhas aretry_selectedbulk action. - CLI automation:
python manage.py celery_outbox_replay_dead_letter <dead_letter_id_1> <dead_letter_id_2>.
Both paths preserve the stored payload, schema version, and tracing/context fields and remove the replayed rows from celery_outbox_dead_letter. See Admin Interface.
Verify. increase(celery_outbox_messages_exceeded_total[10m]) returns to zero; the top failure_reason values stop appearing in newly-inserted rows.
Relay iteration failed¶
Detect. celery_outbox_relay_iteration_failed appears in the relay log or your log-alerting stack.
Triage:
- Read
exception_typeandexception_messagefirst. This event is a catch-all; the exception payload tells you whether to route to broker, DB, or config investigation. - Correlate the previous and next relay events. If the surrounding logs show
celery_outbox_relay_breaker_triporcelery_outbox_relay_breaker_open, switch to Broker unreachable / outage cooldown active. - Check for schema/config drift. Recent deploys, unapplied migrations, or changed Celery settings are common causes because the relay loop retries instead of crashing hard.
- Check whether the failure repeats. A single iteration failure can be transient; repeated identical failures mean the relay is stuck in a retry loop.
Fix:
- Broker/auth/network error → restore connectivity and credentials, then follow Broker unreachable / outage cooldown active.
- Database/schema error → apply migrations or roll back the incompatible deploy before the relay retries the same failing code path.
- Bad package setting or broken callback → correct the setting/code and redeploy.
Verify.
celery_outbox_relay_iteration_failedstops repeating for the same root cause.celery_outbox_batch_processedresumes.celery_outbox_queue_depthorcelery_outbox_oldest_pending_age_secondsmove in the expected direction again.
Relay hanging¶
Detect — any of:
- Liveness probe failing (pod restart loop).
--liveness-filemtime older than your configured freshness threshold.celery_outbox_batch_processedlog event absent from the relay log, without matching breaker-open cooldown logs.celery_outbox_queue_depthflat but non-zero while the application is still producing.
Triage:
- Last log event and its timestamp from the relay pod — tells you where execution stalled. If the last events are
celery_outbox_relay_breaker_triporcelery_outbox_relay_breaker_open, switch to Broker unreachable / outage cooldown active. -
DB lock contention:
PostgreSQL:
MySQL 8:
If
performance_schema.data_locksis not enabled in your MySQL deployment, use your platform's lock-wait tooling instead. -
Broker send blocking — is the relay waiting on network I/O to the broker? Inspect the pod's network state (
ss -tnp, or platform equivalent) from inside the container. A healthy relay should still bound each publish by--send-timeout; repeated breaker logs point to outage handling rather than a Python hang. - Multiple-replica lock contention — see the note in Troubleshooting › Database Lock Contention.
Fix:
- Lock contention across multiple relay replicas → reduce replica count or
batch_size. See Relay Tuning. - Broker-blocked with breaker logs → follow Broker unreachable / outage cooldown active.
- Python-level hang → restart the pod. If recurring, capture a stack trace next time with
py-spy dump --pid <pid>so it can be diagnosed.
Verify. Liveness file is being touched again; celery_outbox_batch_processed log events resumed.
Zero-downtime upgrade¶
Principles¶
- The relay must never run against a schema it does not understand.
migrateruns before new relay pods start. - Migrations should be additive when possible — add columns, add tables, add indexes. Additive changes let old and new relay versions coexist during a rolling update. For destructive changes (drop column, change type, rename), use the two-release dance: the first release stops using the field, the second release removes it. Do not collapse this into a single release.
- SIGTERM must reach the relay.
SIGTERM/SIGINTstarts draining mode. The relay stops starting new sends after--shutdown-timeout, but an already-running publish is still bounded by--send-timeout. Whatever platform runs the relay must deliver SIGTERM and wait — not SIGKILL. - Grace period ≥
--shutdown-timeout + --send-timeout+ margin. If the orchestrator kills the relay earlier, already-selected rows recover later through stale-timeout reclaim, but operators can see spurious restarts and duplicate delivery.
Kubernetes worked example¶
This is a template, not a drop-in chart. Adapt to your Helm chart's values.
Run migrations in a pre-upgrade hook, either via an initContainer or a one-shot Job:
apiVersion: batch/v1
kind: Job
metadata:
name: myapp-migrate
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: migrate
image: myapp:{{ .Values.image.tag }}
command: ["python", "manage.py", "migrate", "--noinput"]
Relay Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-relay
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 120 # ≥ shutdown_timeout + send_timeout + margin
containers:
- name: relay
image: myapp:{{ .Values.image.tag }}
command: ["python", "manage.py", "celery_outbox_relay", "--liveness-file", "/tmp/relay-alive"]
livenessProbe:
exec:
command:
- python
- -c
- |
import os
import sys
import time
path = "/tmp/relay-alive"
max_age_seconds = 90
try:
stale_for = time.time() - os.path.getmtime(path)
except FileNotFoundError:
sys.exit(1)
sys.exit(0 if stale_for < max_age_seconds else 1)
initialDelaySeconds: 10
periodSeconds: 30
failureThreshold: 3
Deployment layout references: Kubernetes.
Verification after upgrade¶
--liveness-filemtime refreshes on every new pod.celery_outbox_batch_processedlog events appear from the new pod names.celery_outbox_queue_depthandcelery_outbox_oldest_pending_age_secondsstay within your SLO.
Rollback¶
Principles¶
- Rolling back code is cheap. Rolling back schema is not.
helm rollback(or equivalent) reverts the container image and config; it does not revert the database. The library's migrations use standard reversible Django operations, sopython manage.py migrate django_celery_outbox <previous_migration>can roll the schema back — but destructive reverses (dropping a column that now holds data) still lose rows. Verify the reverse is safe for your data before running it. - Three scenarios, three procedures.
- Bad image, schema is fine → standard rollback. Works because migrations are additive (see Zero-downtime upgrade).
- Bad schema, needs reversal → run
python manage.py migrate django_celery_outbox <previous_migration>after confirming it will not drop data you need. For non-trivial reverses (data transforms, dropping columns that have been written to) write a forward-fix migration instead — do not invent one during the incident. - Corruption or data loss → out of scope for this runbook. Use standard Postgres point-in-time recovery.
- Watch the DLQ during the rollback. A rollback that introduces incompatibility (e.g., workers on old code cannot deserialize tasks produced by the newer relay) shows up as DLQ growth. See Dead-letter queue growing.
Kubernetes worked example¶
# List revisions
helm history <release>
# Roll back to a specific revision
helm rollback <release> <revision>
Verification after rollback:
- Relay image tag reverted on all relay pods.
celery_outbox_batch_processedlog events continue.celery_outbox_dead_letter_countdoes not climb.
Schema changes are not rolled back by helm rollback
helm rollback reverts images and config. Schema reversals are a separate, manual decision: they are possible via manage.py migrate django_celery_outbox <previous_migration>, but only if the reverse does not lose data you need. For non-trivial reverses, write a forward-fix migration and deploy it as a normal release instead.