Skip to content

Job services architecture and tuning

Resource Manager uses the Celery package to manage jobs. Job data is stored in the Redis server (in Control Center, Infrastructure > redis) and job queues are managed by the RabbitMQ server (Infrastructure > RabbitMQ). The zenjobs command is a lightweight Python wrapper around the celery command and invoking zenjobs launches a Celery worker. Always use the zenjobs command—just the celery command does not work.

Each instance of the zenjobs service runs one job at a time. To increase throughput, increase the number of instances. While there is no limit to the number of instances you can run, each job needs to update the model database (ZODB), so some level of ZODB contention is normal. When a ZODB conflict occurs, the contending job is returned to the queue with a delay of up to 30 seconds and marked with the Retry status. Jobs are retried up to 5 times, which can be adjusted by setting the zodb-max-retries variable in zenjobs.conf. Typically, too many zenjobs instances increases contention and slows overall throughput, but the exact number of instances that cause a slowdown varies by deployment architecture and job types.

Jobs that run too long are marked as failed and shut down, and an event is reported to the event processing service. The zenjobs.conf file includes two variables that determine when and how a job is shut down:

  • The default value of job-soft-time-limit is 18000 seconds (5 hours). When a job exceeds this limit, Celery sends a SIGUSR1 signal to shut it down.
  • The default value of job-hard-time-limit is 21600 seconds (6 hours). When a job exceeds this limit, Celery sends a SIGKILL signal to shut it down.

The zenjobs service uses the zodb.conf file to connect to the model database. Its defaults rarely need tuning.

To diagnose issues, you can customize logging per container with the zenjobs_log_levels.conf file. You do not need to restart the zenjobs service—changes take effect when the file is saved. When you restart the service, the changes are reverted to the values stored in the Resource Manager service definition.

The zenjobs scheduler service removes the log files of jobs that are not in Redis, which is configured to delete jobs after 7 days. The service uses the zenjobs_schedules.yaml file to determine when to search for job logs to remove. The default is once per hour, at the beginning of the hour, using a cron schedule. You can also use an interval schedule. One instance of the zenjobs scheduler service handles any deployment size.