Cleaning your Render Farm
Version: Deadline 9.0 and later
You may not realize it, but there are many background operations that are responsible for keeping your Deadline render farm running smoothly. Three very important ones are the Pending Job Scan, House Cleaning, and Repository Repair operations. In this blog entry, we'll be taking a peek behind the curtains at all three, and hopefully leave you with a better understanding of their role and impact on your farm.
PENDING JOB SCAN
This operation's name pretty much describes its role, as it's responsible for scanning the pending Jobs in your farm and determining which (if any) can be released for rendering. If a pending Job is dependent on one or more Jobs, it will be released if those Jobs are complete. If a pending Job is scheduled to start at a later time, it will be released if that time has come. This is the main task that this operation performs, and it's an important one if you are relying on Job dependencies or scheduled Jobs in your render pipeline.
A lesser known task that is handled by this operation is to process Asynchronous Job Events. By default, Job events (ie: Job submission, completion, deletion, etc) are processed synchronously as soon as the corresponding Job operation occurs, but there might be situations where this default behavior isn't ideal. For example, if you have a custom event plugin that triggers when Jobs are suspended and resumed, you might notice that they impact the performance of the Monitor when suspending or resuming large amounts of Jobs simultaneously. By enabling this feature, those events can be queued up and processed later by the Pending Job Scan operation. Just note that when this option is enabled, the Job submission event will still be processed synchronously.
Asynchronous Job Events can be enabled in the Pending Job Scan settings in the Repository Options.
Note that there are other Pending Job Scan settings available here, but they are very similar to some of the House Cleaning and Repository Repair settings. So we'll be covering them in a separate section below.
This operation has many tasks, but they all fall under the common goal of keeping the Deadline Repository clean and tidy. For example, a common task is to delete or archive Jobs that have been marked for automatic cleanup. Job cleanup is configurable on a per Job basis in the Job Properties, or it can be configured globally in the Job Cleanup settings in the Repository Options.
The Job Cleanup settings also have a section for purging deleted Jobs. When a Job is deleted, it isn't immediately removed from the Deadline Repository. Instead, it exists in a hidden location until it is ready to be purged. The amount of time it sits in this hidden location is based on these Deleted Job Purging settings. Note that during this window of time, it's actually possible to undelete the Job using the Tools menu in the Deadline Monitor. However, once that window of time closes, the House Cleaning operation will purge deleted Jobs from the Deadline Repository permanently.
House Cleaning can also clean up old Deadline Workers. This option is disabled by default, and can be enabled in the Worker Settings in the Repository Options. If enabled, Workers that have been Offline or Stalled for the specified number of days will automatically be purged from the Deadline Repository. This can be useful if you have a lot of temporary Workers (rental nodes, cloud nodes, etc) and you want them to be removed automatically if they haven't been used for a while.
Finally, there are House Cleaning settings that can be configured in the Repository Options.
These settings are used to control the maximum number of objects that House Cleaning can clean up during each House Cleaning session. For example, if you recently deleted 10,000 Jobs, and it is time to purge them from the Deadline Repository, it could take a while to purge all 10,000 Jobs at once. This can then delay other House Cleaning tasks as a result. By enabling the Maximum Deleted Jobs option, you can ensure that all 10,000 Jobs will eventually be purged, without impacting other House Cleaning related tasks.
As mentioned earlier, the common House Cleaning settings will be covered in a section below.
Like House Cleaning, this operation also has many tasks. The key difference is that the Repository Repair operation is responsible for detecting and fixing issues in the Deadline Repository that aren't part of Deadline's normal operation. One of these key tasks is Stalled Worker detection. A Deadline Worker is considered to be Stalled if it hasn't updated its state in the Deadline Repository for a certain period of time. This can happen if the machine that the Worker is running on loses power or its network connection. It can also happen if the Worker application crashes. Regardless of the reason, this situation will be detected by the Repository Repair operation, and more importantly, any tasks that the Worker was currently rendering will be requeued so that another Worker can pick them up.
By default, the number of minutes before a Worker is marked as Stalled is 10. This can be configured in the Worker Settings in the Repository Options. Note that Stalled application detection is also available for Pulse, Balancer, License Forwarder, and Proxy Server, and all are configurable in their respective pages in the Repository Options.
In addition to Stalled application detection, Pulse, Balancer, and License Forwarder have a redundancy system that allows a Secondary instance of the application to be promoted to Primary if the original Primary is marked as Stalled. The Repository Repair operation is responsible for handling these promotions. We actually have a blog entry on redundancy, so we suggest giving it a read if you're interested in this subject.
The Repository Repair operation can also repair Job Tasks and Limits if they get into a bad state in the Deadline Repository. For example, if a database issue causes a Worker to lose track of the Task it's rendering, and the Worker moves on to another Task, the original Task is orphaned because it's in the Rendering state and nothing is actually rendering it. The same problem could apply to any Limits that the Worker was originally holding. The Repository Repair operation will detect these problems and fix them so that the render farm continues to operate smoothly.
Finally, there are Repository Repair settings in the Repository Options for enabling the Primary Election process for the various Deadline applications that support it.
The next section will cover the general configuration settings that all three operations have in common.
GENERAL CONFIGURATION SETTINGS
As mentioned above, there are some general settings that the Pending Job Scan, House Cleaning, and Repository Repair operations have in common, which can be configured in the Repository Options.
These settings include:
- [OPERATION] Interval: How often each operation is performed. While the defaults for House Cleaning and Repository Repair are typically fine, you may want to consider reducing the Pending Job Scan interval if your render pipeline relies heavily on Job dependencies.
- Allow Workers to Perform [OPERATION] if Pulse is not Running: If Pulse is running, it will perform each operation at its respective interval. If Pulse isn't running, the Worker can perform these operations in between rendering tasks. As a result, the operations will still be performed, albeit in a less predictive manner. However, this might not always be ideal. For example, if all the Workers are running in a remote location, these operations can take much longer, so you can choose to disable this feature if you don't want your Workers performing these operations.
- Run [OPERATION] in a Separate Process: By default, each operation runs in a thread belonging to the Deadline application that is performing it. As a result, all tasks performed by the operation are logged in the Deadline application's log. For larger render farms, these operation logs can be quite dense, in which case you can choose to run the operation in a separate process, and log its tasks to a separate log. In addition, you can kill the operation process if it's taking too long to finish (if this happens, the operation will pick back up where it left off when it runs again).
It is also possible to use Event Plugins to perform "cron job" style tasks whenever the House Cleaning or Repository Repair operations are performed. Check out our cron job blog entry for more information!
RUNNING OPERATIONS FROM THE MONITOR
Before we wrap up, we should mention that it's also possible to run these operations from the Tools menu in the Deadline Monitor (if you're in Super User Mode or have the required permissions). This can be useful if you want to manually run these operations between their configured intervals, or if you're testing an Event Plugin that triggers on the House Cleaning or Repository Repair events.
In addition, you can use the Monitor to see the last time that each operation was performed. In the Monitor Options, simply enable the Show [OPERATION] Updates in Status Bar setting for each operation.
After saving your Monitor options, you'll now see this in the status bar in your Monitor. This information can be useful for administrators who want to know when these operations are being performed.
That pretty much covers everything you need to know about the Pending Job Scan, House Cleaning, and Repository Repair operations. While you can't see them in action, at least now you know they exist, and that they're working hard to keep your render farm running smoothly!