Demystifying Pools, Groups and Limits

  • Tutorial

INTRODUCTION

Deadline uses a decentralized approach to job scheduling which makes it unique and a bit different from other similar applications. Instead of using a Manager application that makes all decisions related to what job to be processed by which network machines (known as "Worker" in Deadline), Deadline uses smart Workers that poll the Repository and look for available work to pick up based on the job's settings. This removes a significant vulnerability - should the centralized manager crash, no work would be distributed, but with a decentralized organization, even if 99 out of 100 Workers would go offline, the last one standing would continue working on jobs if configured correctly.

Thus, understanding how to set up the various job properties that affect the scheduling is the key to getting the full power of Deadline.

In the following overview, we will look closer at the main Scheduling and Limiting options in an attempt to illuminate their application and unlock the full power of Deadline. Also, check out this related blog on Pools, Groups and You.

UNDERSTANDING DEADLINE POOLS

The main property affecting when and whether a job will be picked up by a Worker is the Pool. When Deadline is being set up, render nodes added to the Repository are automatically in the "none" Pool, but they should be added to one or more user-defined Pools to better organize the render farm.

In a typical production environment, Pools are mostly used to split the render farm between several projects while making sure that if one of the projects has less jobs to render, the Workers would be used on the other projects and won't sit idle.

The key to understanding Pools is that the order in which the pools are listed in a Worker's Pool list DOES MATTER. This means that if a Worker is added to, say, three Pools - "Project A", "Project B" and "Project C", in that particular order, if the Worker is looking for work and finds several jobs in the Repository, it will always prefer jobs assigned to "Project A" over the other two. If there is no "Project A" job, then it will fall back to the others.

The above important fact lets you split the Workers in your render farm between projects without wasting resources. For example, let's assume you have 40 Workers in total registered in the Repository. An important project called "Project A" that should use half of the farm is assigned as first (TOP PRIORITY) Pool to 20 machines. A second project called "Project B" should use at least 15 machines and "Project C" should use at least 5. So 15 machines are set to prefer "Project B" with "Project A" and "Project C" at second and third place on the Pools list, and 5 Workers get "Project C" on top with "Project A" and "Project B" on second and third position.

Let's assume that a user submits a "Project A" job to the Farm which is currently idle (no workers have started rendering yet). One by one, the Workers check the Repository and find a job that has "Project A" set as the Pool name. Since there are no other jobs on the farm and all 40 Workers have "Project A" on their list, albeit on a different place, all 40 machines start rendering tasks of this new job.

Now if another user working on "Project B" submits a job to the Farm, any Workers that are done with their current task will check the Repository for more work and if "Project B" happens to be higher on their list than "Project A", they will move to render the new "Project B" job instead. Once they are done with the "Project B" job, if the "Project A" job has still not finished rendering all its tasks, the Workers will move back to helping with it. Similarly, if all "Project A" jobs are finished and there are no "Project C" jobs either, all 40 machines will now work on "Project B".

Similarly, if a user submits a job to the "Project C" pool, the 5 machines that have that Pool on top of their list will start rendering it as soon as they have finished their previous task. If the Workers preferring "Project A" and "Project B" jobs cannot find any more tasks of jobs assigned to these two Pools, they will also gladly join the rendering of the "Project C" job because that Pool is also on their list, just not as top priority.

Obviously, when there are jobs of all three projects, the distribution will be 20/15/5 according to the top Job listed in each machine's Pools list.

As you can see, assigning multiple Pools to groups of Workers and making sure that portions of the Farm prefer one Pool over the others allows you to prioritize the resources by Project without having machines sitting idle - if the higher priority Pool jobs are not available, the Workers will go working on lower priority Pool jobs.

Submitting a Job to the "none" Pool will make the Job lowest priority for all Deadline Workers. Whenever there are no other jobs assigned to real Pools to work on, Workers will consider the "none" Pool jobs.

RESOLVING COMPETING JOB PRIORITIES

On big projects, there could be hundreds of jobs with hundreds or thousands of Tasks submitted to the SAME Pool. When a Worker looks for work, it could find a large number of possible jobs. By default, Deadline is set up so that Workers use the Pool Priority first. As we saw in the previous topic, if there are two or more jobs assigned to different Pools, the jobs assigned to the Pool that is highest on the Workers Pool List will be preferred.

But what if there are two or more such jobs that belong to the top Pool?

The default behaviour of Deadline is to use the Numeric Priority of the Job next. This is the value between 0 and 100 assigned to the job at submission time which defines how urgent the job is in the eyes of the user. So if there are three jobs assigned to Pool "Project A" and one of the 20 Workers from the previous example set to render "Project A" with highest priority has to decide which of the three to start working on, it will look at the Priority value. If one of the Jobs has a value of 50, another 55 and the third is 90, the last one will obviously be preferred.

What if two or all three of these jobs were submitted with the same Priority? This is more common than you might expect, since most production supervisors ask the artists to keep the variations to this value to a minimum. In that case, the Submission Time of the Job is used to resolve the conflict - a job submitted earlier will be rendered first. Given that the time stamp includes date and time with seconds precision, the probability of two jobs having exactly the same Pool, Priority and Submission Time is astronomically low.

While it is possible to tweak this behaviour to make Deadline resolve the priority using other combinations like Priority>Pool>Date or Date>Priority>Pool, this is not recommended and the majority of Deadline-managed farms use the default Pool>Priority>Date order.

Please see this topic for more details. 

LIMITING FACTORS

So far we looked at the factors that control the order of job processing - the Priority Factors. In addition to the Job Pools and Priorities, there are several properties that impose limits on the jobs, or prevent Workers from picking a job at all. There are several systems that were put in place to control these limits, and each has a different purpose.

GROUPS

The Groups were added relatively late in the history of Deadline and are typically used to limit a job to a specific subset of Workers with either similar hardware or similar software characteristics. Other than Pools, Groups do NOT affect the Job Scheduling, they are a limiting factor to filter out quickly a group of Workers that are best suited for the Job.

For example, all Workers with 16 cores / 32 GB of RAM could be listed in a group called "SimMachines", while machines with 4 cores and 8 GB could be in a group called "TheSlowOnes". When a user is submitting a heavy fluid simulation that can use all cores and memory it can get, he could select the "SimMachines" group. This would make sure that only Workers that have the correct hardware configuration will pick up the Job. If the Job was submitted as part of "Project A" assumed in the previous example, and another similar simulation job was submitted as part of "Project B" to the other pool, in addition to resolving the scheduling via Pools and Priorities, the possible Workers will also be checked against the "SimMachines" group and only if they were listed in it will they start working on these Jobs. As you can see, Groups give you another layer of control on top of the Pools. Another example would be adding all Workers that have Autodesk Maya installed to group "Maya" and all Workers having Autodesk 3ds Max installed to a group "3dsMax". If a machine has both 3D applications installed, it can appear in both Groups. When a user is submitting a job from inside of Maya, he could select the Group "Maya" to ensure that only machines that actually have that software installed will attempt to work on the Job.

LIMITS

The Limits (formerly referred to as "Limit Groups") are a special, stub-based system which was specifically designed to deal with managing software licenses. In short, Limits are defined to represent the available number of licenses for various software products or plugins. Whenever a Worker attempts to dequeue a Task from a Job that was assigned one or more Limits, it will have to acquire a stub from each Limit. As result, the total number of available stubs in each Limit will be reduced by one and when it reaches 0, the next Worker will be unable to acquire a stub and will not pick up a task until another Worker already using up a license finishes its task and returns its stub.

For example if you have 20 floating licenses of Maya but 40 machines with Maya installed, you want to limit the number of concurrently running Maya copies to 20. A "Maya" Limit would be set to a total number of 20 and prevent the 21st Worker from picking up a Task from a Job that was submitted with the "Maya" Limit assigned. This would prevent a license-related failure. You can exclude some Workers from this feature (for example if they have a node-locked license of the application in question), or black-list some Workers to prevent them from acquiring a stub (for example because they don't have the correct software installed).

BLACKLISTS AND WHITELISTS

The Blacklist is a Job Property which specifies explicitly by name which Workers may not render the Job in question. Other than the Pools, Groups and Limits, the Blacklist is not created ahead of time but assigned to the Job before or after it has been submitted. It is also useful during rendering - if a Worker appears to be causing an error or is not producing the desired output for some reason, you can easily right-click the Task in the Monitor and instruct the Job to Blacklist that machine and not allow it to touch a Task of that Job again.

The Whitelist is just a positive list of which machines are explicitly allowed to render the Job. It is useful when the number of machines that may render is much lower than the number of machines that are "bad for the job".

JOB MACHINE LIMIT

The Job Machine Limit is a very general Job Property which specifies how many Workers in total may work on a job at the same time. It does not control which machines exactly, it is a simple initial condition for a Worker to pick up a job. If the Machine Limit is reached, no more workers will pick up the job even if they meet all other conditions. When the Machine Limit is set to 0, any Workers that match all other requirements will pick up the job. On a Render Farm with hundreds of machines, not limiting a job at all could cause too many machines to pick up the job at the same time, and this could cause a significant hit on the network in some cases. For example, if a 3D application has to load the scene and all related external references like texture maps and caches from the network before rendering the first frame, having 100+ machines accessing these files simultaneously could be a bad idea. Thus, the Machine Limit also allows the release of the limit after a specific percentage of the Task completion. So it is possible to postulate that a Machine Limit of 10 would be applied to the job, but only until the task has reached, say, 5%. Once a Worker has reached that percentage, it will be excluded from the limit again and another one will be allowed to pick up the job. This way, you can reduce the network hit and allow hundreds of Workers to join the work, but a few at once.

TESTING THE SETTINGS

In earlier versions of Deadline, the Scheduling options and Limits were only applied by the Workers at render time. In some cases, especially on large Render Farms with hundreds of Workers and thousands of jobs, it wasn't always easy to determine how the job settings would affect the number of Workers that would be able to work on it.

Deadline 5 added a "Show Workers That Can Render The Selected Job" filter option to the upper left corner of the Workers list. When this option is checked, the rules described in this article will be applied to the available Workers and only those that would actually pick the highlighted Job due to its Pool, Groups, Limits and Black/Whitelists would appear on the list. This makes it easier to audit the Job's and Render Farm's settings even before the Job has actually started rendering, or figure out why Job hasn't started rendering yet.

CONCLUSION

As you can see, Deadline provides a powerful set of tools to organize your resources and prioritize your jobs in networks both small and large. You can use any combination of these properties and controls to get the most of your hardware and software. We hope that this overview will help you unlock Deadline's potential in your daily production work!