Relay Tuning¶

Batch Size¶

Controls how many messages are processed per database round-trip.

Scenario	Recommended	Rationale
Low volume (<100/min)	10-50	Lower latency
Medium volume	100-200	Balance
High volume (>1000/min)	500-1000	Throughput

--batch-size 500

Idle Time¶

How long to sleep when the queue is empty.

Scenario	Recommended	Rationale
Real-time required	0.1-0.5	Sub-second latency
Standard	1.0-2.0	Balance
Background jobs	5.0-10.0	Reduce DB load

--idle-time 1.0

Backoff Time¶

Base seconds for exponential backoff on ordinary publish failures.

Formula: delay = min(backoff_time * 2^retries + jitter, max_backoff)

With the defaults, backoff_time=120, jitter is up to 10% of backoff_time, and max_backoff=3600.0.

Retries	Delay range (120s base)
0	120-132s
1	240-252s
2	480-492s
3	960-972s
4	1920-1932s

--backoff-time 120
--max-backoff 3600.0

If you raise --max-retries above the default, later failures still cap at --max-backoff instead of growing without bound.

Max Retries¶

After this many failures, the message moves to dead letter.

--max-retries 5

Broker Outage Cooldown¶

--broker-outage-cooldown is separate from retry backoff.

It applies only when the relay classifies the failure as a broker outage.
The relay defers the already-selected rows for the cooldown window.
It does not increment retries.
It does not consume retry budget from --max-retries.
The breaker is process-local, so each relay process tracks its own cooldown.

--broker-outage-cooldown 30.0

Stale Timeout¶

--stale-timeout-seconds controls when rows stamped as in-flight become reclaimable again. The default is 300 seconds.

This is recovery logic, not normal retry logic:

It is used for rows that were selected but never completed because a relay crashed or was interrupted.
It is how selected-but-not-yet-started rows become visible again after shutdown deadline aborts.
It does not change retries on its own.

--stale-timeout-seconds 300

Send Timeout and Shutdown Timeout¶

--send-timeout bounds a single Celery.send_task() publish attempt. --shutdown-timeout controls how long the relay may keep starting additional sends after SIGTERM or SIGINT.

Pick --shutdown-timeout to cover a healthy drain window, and size --send-timeout to the slowest broker publish you still consider healthy.

Publish Concurrency¶

--publish-concurrency enables a bounded parallel publish mode. The default is 1, which keeps the serial relay path and remains the recommended baseline.

Treat higher values as advanced tuning:

Start with 1.
Increase gradually only after verifying behavior against the supported live RabbitMQ smoke lane.
Remember that only broker publish I/O runs in worker threads. Result classification, signals, metrics, and database mutation still happen on the main relay thread.

--publish-concurrency 2

Queue Snapshot Refresh Cadence¶

--queue-snapshot-refresh-seconds controls how often the relay refreshes queue-wide snapshot data used by:

queue.depth
dead_letter.count
oldest_pending_age_seconds
celery_outbox_batch_processed summary fields

These values are sampled queue-wide gauges, not exact per-batch recomputations. The default is 5.0 seconds, which keeps hot-path DB work bounded while still updating dashboards and logs frequently enough for normal operations.

--queue-snapshot-refresh-seconds 5.0

Planner Behavior¶

The shipped selector and dead-letter purge shapes were verified against large synthetic PostgreSQL 15 and MySQL 8 tables during release hardening.

Observed planner behavior:

Sparse active outbox on PostgreSQL used a BitmapOr over celery_outbox_pending_idx, celery_outbox_retry_idx, and celery_outbox_stale_idx.
Sparse active outbox on MySQL used index_merge over the retry/stale selector indexes.
Dead-letter destructive chunks now order by the active retention field (dead_at, pk or created_at, pk), which lets both PostgreSQL and MySQL use the matching retention index on the chunk-selection path.

When almost every row matches the filter, planners may still prefer the primary key or a sequential scan for ORDER BY ... LIMIT .... That is expected for dense-match tables and does not invalidate the sparse-backlog fast path the package is optimized for.

Monitoring Metrics¶

The relay emits these StatsD metrics:

Metric	Type	Tags	Description
`queue.depth`	gauge		Sampled live backlog gauge refreshed at most once per `--queue-snapshot-refresh-seconds` window
`dead_letter.count`	gauge		Sampled dead-letter backlog gauge refreshed at most once per `--queue-snapshot-refresh-seconds` window
`oldest_pending_age_seconds`	gauge		Sampled age of the oldest live-backlog row
`batch.duration_ms`	timing		Batch processing time
`send_latency_ms`	timing	`task_name`	Time from enqueue to publish
`messages.published`	counter	`task_name`	Successfully sent
`messages.failed`	counter	`task_name`, `exception_type`	Failed (will retry)
`messages.exceeded`	counter	`task_name`, `exception_type`	Moved to dead letter