Skip to content

Runbook

Open this page when something is wrong. Jump to the Incident playbooks section, find the playbook that matches the symptom you are seeing, read it top to bottom, and execute. This page is not meant to be read cover-to-cover.

Every playbook below has the same shape: Detect → Triage → Fix → Verify. If you cannot find a matching playbook, the closest page for ad-hoc diagnosis is Troubleshooting.

Health signals

Use these tables as a reference while following any playbook.

Liveness file

Signal Healthy Stale
mtime of --liveness-file within your configured freshness threshold older → relay stalled or dead → Relay hanging

See Health Checks for the --liveness-file flag details. Size that freshness threshold to cover healthy idle gaps and broker-outage cooldown windows, not just steady-state queue drain.

celery_outbox_stats snapshot

python manage.py celery_outbox_stats prints a point-in-time snapshot. It is not a replacement for metrics over time.

Field Meaning Abnormal → playbook
queue_depth live backlog: rows eligible for relay send/recovery right now trending up → Queue growing
oldest_pending_seconds age of the oldest live-backlog row (delivery latency) above your SLO → Queue growing
dlq_count rows currently stored in celery_outbox_dead_letter new increases over time → Dead-letter queue growing
top_failing current outbox task groups with the highest cumulative retry counts one task dominating retries → Queue growing or Dead-letter queue growing

Metrics for graphing and alerting

StatsD names are shown with the default MONITORING_STATSD_PREFIX = 'celery_outbox'. Gauge and counter metrics usually export to Prometheus by replacing dots with underscores. Timer metrics depend on your exporter configuration and often appear as histogram-style series such as _bucket, _sum, and _count.

StatsD metric Prometheus Type Use
celery_outbox.queue.depth celery_outbox_queue_depth gauge Sampled queue-wide live backlog, refreshed at most once per --queue-snapshot-refresh-seconds. Chart as a time series; monotonic rise means the queue is growing.
celery_outbox.oldest_pending_age_seconds celery_outbox_oldest_pending_age_seconds gauge Sampled queue-wide oldest live backlog age, refreshed at most once per --queue-snapshot-refresh-seconds. Alert on crossing your SLO.
celery_outbox.dead_letter.count celery_outbox_dead_letter_count gauge Sampled dead-letter backlog context, refreshed at most once per --queue-snapshot-refresh-seconds. Page from increase(celery_outbox_messages_exceeded_total[10m]) > 0 instead of raw table size.
celery_outbox.batch.duration_ms e.g. celery_outbox_batch_duration_ms_bucket timing Chart per-batch processing time. Absence of new samples means the relay has stalled.

Full catalogue: Metrics.

Log events referenced during triage

  • celery_outbox_relay_started
  • celery_outbox_batch_processed — absence during steady send is a stall signal
  • celery_outbox_relay_breaker_trip
  • celery_outbox_relay_breaker_open — relay alive, broker unavailable
  • celery_outbox_relay_iteration_failed
  • celery_outbox_relay_shutdown_deadline_exceeded
  • celery_outbox_send_failed
  • celery_outbox_max_retries_exceeded

Full catalogue: Logging Events.

Explicit non-goals

  • The library does not ship an HTTP health endpoint. File-based liveness is the only one provided. See Health Checks for a user-built Django view example if you need an HTTP probe.
  • No auto-remediation. This runbook tells operators what to do; it does not run on its own.

Incident playbooks

Queue growing

Detect. celery_outbox_oldest_pending_age_seconds exceeds your SLO (suggested starting threshold: 60s). Secondary signal: celery_outbox_queue_depth trending up over 5-10 minutes.

Triage (cheapest first):

  1. Is the relay running? Check the relay pod status and the --liveness-file mtime.
  2. Is the broker reachable from the relay? From inside the relay container, run celery -A <your_celery_app> inspect ping.
  3. Is a single task type dominating the pending set? Either:

    python manage.py celery_outbox_stats
    

    or, from a DB shell:

    SELECT task_name, COUNT(*) FROM celery_outbox GROUP BY task_name ORDER BY 2 DESC LIMIT 10;
    
  4. Did the app's send rate spike? Cross-check with producer-side metrics on your service.

  5. Is the broker itself under load? Check the broker admin UI (CPU, consumer count, its own queue depth).

Fix (by triage result):

  • Relay is down → follow Relay hanging.
  • Broker unreachable → follow Broker unreachable / outage cooldown active.
  • One task dominates → fix the producing code, or add the task name to CELERY_OUTBOX_EXCLUDE_TASKS temporarily if the library is not a fit for that workload.
  • Legitimate throughput → scale relay replicas and/or increase batch_size. See Relay Tuning.

Verify. celery_outbox_oldest_pending_age_seconds trending down; celery_outbox_queue_depth draining.

Broker unreachable / outage cooldown active

Detect. Any of:

  • celery_outbox_relay_breaker_trip in the relay log.
  • Repeated celery_outbox_relay_breaker_open while queue depth is flat or rising.
  • Broker ping from the relay container fails.

Triage:

  1. Confirm the relay is still alive. Check the --liveness-file mtime using a freshness threshold sized per Health Checks.
  2. Confirm the broker outage. From inside the relay container, run celery -A <your_celery_app> inspect ping or the broker's equivalent connectivity check.
  3. Check whether retries are climbing. Broker-outage deferral should not increment retries or consume retry budget.
  4. Look for scope. Is this one relay process, one AZ, or the whole broker fleet? The breaker is process-local, so different relay pods may trip independently.

Fix:

  • Restore broker connectivity, authentication, or network reachability.
  • Leave the selected outbox rows alone. The relay already deferred them by --broker-outage-cooldown.
  • Do not interpret one cooldown window without queue drain as a dead relay if the liveness file is still fresh.
  • Once the broker recovers, the relay resumes on the next eligible batch attempt after the cooldown expires.

Verify.

  • celery_outbox_relay_breaker_open stops repeating.
  • celery_outbox_batch_processed resumes showing normal publish counts.
  • celery_outbox_queue_depth and celery_outbox_oldest_pending_age_seconds trend down.

Dead-letter queue growing

Detect. increase(celery_outbox_messages_exceeded_total[10m]) > 0 or another alert on new dead letters over time. Table size alone is secondary context, not the primary paging signal.

Triage:

  1. Group by failure_reason — is this one error class or many?
  2. Group by task_name — is this scoped to one task or broad?
  3. Time distribution of dead_at — is this ongoing or a past spike that has already stopped?
  4. Cross-reference recent deploys, config changes, and broker incidents.

The first three are visible in the Django admin (Admin Interface) by filtering on failure_reason, task_name, and dead_at. Item 4 comes from deploy history, config history, and broker incident history outside the package.

Fix (by cause):

  • Past broker outage, now recovered. Purge old records:

    python manage.py celery_outbox_purge_dead_letter --older-than-dead 7d
    

    See Dead Letter Queue for the full flag surface.

  • Task name not registered on workers. Roll workers forward to include the task, or revert the producer deploy.

  • Serialization errors. Fix the producing code and redeploy.

Replaying dead-lettered messages. Use either:

  • Django admin: CeleryOutboxDeadLetter has a retry_selected bulk action.
  • CLI automation: python manage.py celery_outbox_replay_dead_letter <dead_letter_id_1> <dead_letter_id_2>.

Both paths preserve the stored payload, schema version, and tracing/context fields and remove the replayed rows from celery_outbox_dead_letter. See Admin Interface.

Verify. increase(celery_outbox_messages_exceeded_total[10m]) returns to zero; the top failure_reason values stop appearing in newly-inserted rows.

Relay iteration failed

Detect. celery_outbox_relay_iteration_failed appears in the relay log or your log-alerting stack.

Triage:

  1. Read exception_type and exception_message first. This event is a catch-all; the exception payload tells you whether to route to broker, DB, or config investigation.
  2. Correlate the previous and next relay events. If the surrounding logs show celery_outbox_relay_breaker_trip or celery_outbox_relay_breaker_open, switch to Broker unreachable / outage cooldown active.
  3. Check for schema/config drift. Recent deploys, unapplied migrations, or changed Celery settings are common causes because the relay loop retries instead of crashing hard.
  4. Check whether the failure repeats. A single iteration failure can be transient; repeated identical failures mean the relay is stuck in a retry loop.

Fix:

  • Broker/auth/network error → restore connectivity and credentials, then follow Broker unreachable / outage cooldown active.
  • Database/schema error → apply migrations or roll back the incompatible deploy before the relay retries the same failing code path.
  • Bad package setting or broken callback → correct the setting/code and redeploy.

Verify.

  • celery_outbox_relay_iteration_failed stops repeating for the same root cause.
  • celery_outbox_batch_processed resumes.
  • celery_outbox_queue_depth or celery_outbox_oldest_pending_age_seconds move in the expected direction again.

Relay hanging

Detect — any of:

  • Liveness probe failing (pod restart loop).
  • --liveness-file mtime older than your configured freshness threshold.
  • celery_outbox_batch_processed log event absent from the relay log, without matching breaker-open cooldown logs.
  • celery_outbox_queue_depth flat but non-zero while the application is still producing.

Triage:

  1. Last log event and its timestamp from the relay pod — tells you where execution stalled. If the last events are celery_outbox_relay_breaker_trip or celery_outbox_relay_breaker_open, switch to Broker unreachable / outage cooldown active.
  2. DB lock contention:

    PostgreSQL:

    SELECT * FROM pg_locks WHERE relation = 'celery_outbox'::regclass;
    

    MySQL 8:

    SELECT *
    FROM performance_schema.data_locks
    WHERE OBJECT_NAME = 'celery_outbox';
    

    If performance_schema.data_locks is not enabled in your MySQL deployment, use your platform's lock-wait tooling instead.

  3. Broker send blocking — is the relay waiting on network I/O to the broker? Inspect the pod's network state (ss -tnp, or platform equivalent) from inside the container. A healthy relay should still bound each publish by --send-timeout; repeated breaker logs point to outage handling rather than a Python hang.

  4. Multiple-replica lock contention — see the note in Troubleshooting › Database Lock Contention.

Fix:

  • Lock contention across multiple relay replicas → reduce replica count or batch_size. See Relay Tuning.
  • Broker-blocked with breaker logs → follow Broker unreachable / outage cooldown active.
  • Python-level hang → restart the pod. If recurring, capture a stack trace next time with py-spy dump --pid <pid> so it can be diagnosed.

Verify. Liveness file is being touched again; celery_outbox_batch_processed log events resumed.

Zero-downtime upgrade

Principles

  1. The relay must never run against a schema it does not understand. migrate runs before new relay pods start.
  2. Migrations should be additive when possible — add columns, add tables, add indexes. Additive changes let old and new relay versions coexist during a rolling update. For destructive changes (drop column, change type, rename), use the two-release dance: the first release stops using the field, the second release removes it. Do not collapse this into a single release.
  3. SIGTERM must reach the relay. SIGTERM/SIGINT starts draining mode. The relay stops starting new sends after --shutdown-timeout, but an already-running publish is still bounded by --send-timeout. Whatever platform runs the relay must deliver SIGTERM and wait — not SIGKILL.
  4. Grace period ≥ --shutdown-timeout + --send-timeout + margin. If the orchestrator kills the relay earlier, already-selected rows recover later through stale-timeout reclaim, but operators can see spurious restarts and duplicate delivery.

Kubernetes worked example

This is a template, not a drop-in chart. Adapt to your Helm chart's values.

Run migrations in a pre-upgrade hook, either via an initContainer or a one-shot Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: myapp-migrate
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: migrate
          image: myapp:{{ .Values.image.tag }}
          command: ["python", "manage.py", "migrate", "--noinput"]

Relay Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-relay
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      terminationGracePeriodSeconds: 120   # ≥ shutdown_timeout + send_timeout + margin
      containers:
        - name: relay
          image: myapp:{{ .Values.image.tag }}
          command: ["python", "manage.py", "celery_outbox_relay", "--liveness-file", "/tmp/relay-alive"]
          livenessProbe:
            exec:
              command:
                - python
                - -c
                - |
                  import os
                  import sys
                  import time

                  path = "/tmp/relay-alive"
                  max_age_seconds = 90

                  try:
                      stale_for = time.time() - os.path.getmtime(path)
                  except FileNotFoundError:
                      sys.exit(1)

                  sys.exit(0 if stale_for < max_age_seconds else 1)
            initialDelaySeconds: 10
            periodSeconds: 30
            failureThreshold: 3

Deployment layout references: Kubernetes.

Verification after upgrade

  • --liveness-file mtime refreshes on every new pod.
  • celery_outbox_batch_processed log events appear from the new pod names.
  • celery_outbox_queue_depth and celery_outbox_oldest_pending_age_seconds stay within your SLO.

Rollback

Principles

  1. Rolling back code is cheap. Rolling back schema is not. helm rollback (or equivalent) reverts the container image and config; it does not revert the database. The library's migrations use standard reversible Django operations, so python manage.py migrate django_celery_outbox <previous_migration> can roll the schema back — but destructive reverses (dropping a column that now holds data) still lose rows. Verify the reverse is safe for your data before running it.
  2. Three scenarios, three procedures.
    • Bad image, schema is fine → standard rollback. Works because migrations are additive (see Zero-downtime upgrade).
    • Bad schema, needs reversal → run python manage.py migrate django_celery_outbox <previous_migration> after confirming it will not drop data you need. For non-trivial reverses (data transforms, dropping columns that have been written to) write a forward-fix migration instead — do not invent one during the incident.
    • Corruption or data loss → out of scope for this runbook. Use standard Postgres point-in-time recovery.
  3. Watch the DLQ during the rollback. A rollback that introduces incompatibility (e.g., workers on old code cannot deserialize tasks produced by the newer relay) shows up as DLQ growth. See Dead-letter queue growing.

Kubernetes worked example

# List revisions
helm history <release>

# Roll back to a specific revision
helm rollback <release> <revision>

Verification after rollback:

  • Relay image tag reverted on all relay pods.
  • celery_outbox_batch_processed log events continue.
  • celery_outbox_dead_letter_count does not climb.

Schema changes are not rolled back by helm rollback

helm rollback reverts images and config. Schema reversals are a separate, manual decision: they are possible via manage.py migrate django_celery_outbox <previous_migration>, but only if the reverse does not lose data you need. For non-trivial reverses, write a forward-fix migration and deploy it as a normal release instead.