Job services architecture and tuning
Resource Manager uses the
Celery
package to manage jobs. Job data is stored in the Redis server (in
Control Center, Infrastructure > redis) and job queues are
managed by the RabbitMQ server (Infrastructure > RabbitMQ).
The zenjobs
command is a lightweight Python wrapper around the
celery
command and invoking zenjobs
launches a Celery worker. Always
use the zenjobs
command—just the celery
command does not work.
Each instance of the zenjobs service runs one job at a time. To
increase throughput, increase the number of instances. While there is no
limit to the number of instances you can run, each job needs to update
the model database (ZODB), so some level of ZODB contention is normal.
When a ZODB conflict occurs, the contending job is returned to the queue
with a delay of up to 30 seconds and marked with the Retry status.
Jobs are retried up to 5 times, which can be adjusted by setting
the zodb-max-retries
variable in zenjobs.conf
. Typically, too many
zenjobs instances increases contention and slows overall throughput,
but the exact number of instances that cause a slowdown varies by
deployment architecture and job types.
Jobs that run too long are marked as failed and shut down, and an event
is reported to the event processing service. The zenjobs.conf
file
includes two variables that determine when and how a job is shut down:
- The default value of
job-soft-time-limit
is 18000 seconds (5 hours). When a job exceeds this limit, Celery sends a SIGUSR1 signal to shut it down. - The default value of
job-hard-time-limit
is 21600 seconds (6 hours). When a job exceeds this limit, Celery sends a SIGKILL signal to shut it down.
The zenjobs service uses the zodb.conf
file to connect to the
model database. Its defaults rarely need tuning.
To diagnose issues, you can customize logging per container with
the zenjobs_log_levels.conf
file. You do not need to restart the
zenjobs service—changes take effect when the file is saved. When you
restart the service, the changes are reverted to the values stored in
the Resource Manager service definition.
The zenjobs scheduler service removes the log files of jobs that are
not in Redis, which is configured to delete jobs after 7 days. The
service uses the zenjobs_schedules.yaml
file to determine when to
search for job logs to remove. The default is once per hour, at the
beginning of the hour, using a cron
schedule. You can also use an
interval schedule. One instance of the zenjobs scheduler service
handles any deployment size.