Table of Contents
In order to ensure that all research labs get their fair share of the cluster and to account for differences in hardware being used, we utilize Slurm's built-in job accounting and fairshare system. Every lab has a base Share of the community-wide system, which is governed by the Gratis Share purchased by the Faculty of Arts and Science and distributed equally to all labs. In addition, Shares purchased by individual labs by buying hardware are added to their base Share. The Fairshare score of a lab is then calculated based off of their Share versus the amount of the cluster they have actually used. This Fairshare score is then utilized to assign priority to their jobs relative to other users on the cluster. This keeps individual labs from monopolizing the resources, thus making it unfair to labs who have not used their fairshare for quite some time. Currently, we account for the fraction of the compute node used both with CPU usage and Memory usage using Slurm's Trackable RESources (TRES).
Fairshare is a portmanteau that pretty much expresses what it is. Essentially fairshare is a way of ensuring that users get their appropriate portion of a system. Sadly this term is also used confusingly for different parts of fairshare. This includes what fraction of the system users get, the score that the system assigns for users based off of your usage, and the priority that users are assigned based off of their usage. For the sake of the discussion below, we will use the following terms. Share is the portion of the system users have been granted. Usage is the amount of the system users have actually used. Fairshare score is the value the system calculates based off of user's usage. Priority score is the priority assigned based off of the user's fairshare score.
While Fairshare may seem complex and confusing, it is actually quite logical once you think about it. The scheduler needs some way to adjudicate who gets what resources. Different groups on the cluster have been granted different resources for various reasons. In order to serve the great variety of groups and needs on the cluster a method of fairly adjudicating job priority is required. This is the goal of Fairshare. Fairshare allows those users who have not fully used their resource grant to get higher priority for their jobs on the cluster, while making sure that those groups that have used more than their resource grant do not overuse the cluster. The cluster is a limited resource and Fairshare allows us to ensure everyone gets a fair opportunity to use it regardless of how big or small the group is.
Trackable RESources (TRES)
Slurm Trackable RESources (TRES) allows the scheduler to charge back users for how much they have used different features on the cluster. This is important as the usage of the cluster factors into the Fairshare calculation. These TRES charge backs vary from partition to partition. You can see what the TRES charge back is by running
scontrol show partition <partitionname> .
On Odyssey we set TRES for both CPU usage and Memory usage. For most partitions we charge back for CPU's based off of the type of CPU being used. We normalize TRES to 1.0 for Intel Broadwell chips. For other chips we calculate the TRES by taking the theoretical peak Floating Point OPerations (FLOPs) for a single core of that CPU and dividing it by the theoretic peak for the Intel Broadwell chips. With this weighting we end up with the following TRES per core:
|AMD Abu Dhabi||0.25|
|Intel Sandy/Ivy Bridge||0.5|
It may seem to be a penalty to charge more for the Broadwell than the Abu Dhabi, but it really is not in the end. The reason being is that jobs running on the Broadwell cores will run roughly 4 times faster than the Abu Dhabi chips. Thus the actual charge back to the user should be the same on a per job basis, it's just a question of picking the right resource for the job you are running.
In the case of memory we set the TRES based off of the following formula
NumCore*CoreTRES/TotalMem where NumCore is the number of cores per node, CoreTRES is the TRES score for that type of core, and TotalMem is the total available memory for the node. The reason we weight memory like this is that if a user uses up all the memory on the node the scheduler cannot schedule another job on that node even if there are available cores. The opposite is also true, if all the cores are used up the scheduler cannot schedule another job there even if there is free memory. Thus memory and CPU are exhaustible resources that impact each other. The above weighting allows us to ensure that memory costs the same as the CPU's on a given node. For instance, lets say you have a node that has 128 GB of RAM and 32 Intel Broadwell cores. In this case every 4 GB of RAM used should be equivalent to a single core being used. Thus we should charge a TRES of 1.0 for 4 GB used, or 0.25 for every GB used. In the case of a AMD Abu Dhabi node with 64 cores and 256 GB of RAM, you have the same scenario but now the Abu Dhabi chips are worth 4 times less, thus the memory also is worth 4 times less as so it is 0.0625 for every GB used.
There is one exception to the above TRES rules and those are the requeue partitions, such as
gpu_requeue. Since jobs in these partitions can be interrupted by higher priority jobs at any time, this means that there could be a loss of computation time. This is especially true for jobs who are not able to snapshot their progress and restart from where they left off. Studies have shown that to make this type of model break even in terms of cost you need to charge back roughly half of what you normally would. So for the requeue partitions we charge a flat rate of 0.5 for CPU and 0.125 per GB for Memory. Since the requeue partitions contain all our hardware, users can get access to normally very high cost CPU's for cheaper. Thus if a user needs to run a lot of jobs the best way to optimize throughput and usage is to build their jobs to leverage the cheap resources in the requeue partitions. One should be aware though that the available cores in this partition vary wildly depending on how active any given primary partition is.
On Odyssey each user is associated with their primary group. This lab group is what is called an Account in Slurm. Users belong to Accounts, and Accounts have Shares granted to them. These Shares determine how much of the cluster that group has been granted. Users when they run are charged back for their runs against the Account (i.e. lab) they belong to.
Shares granted an Account come in three types that are summed together. The first type is the Gratis Share. This Gratis Share is the Share given to all labs that are part of the cluster owing to the investment that Research Computing, via the Faculty of Arts and Sciences, has made in Odyssey. This Gratis Share is calculated by summing the CPU TRES for all the nodes in the public partitions, excepting the requeue partitions, and then dividing by the total number of Accounts on Odyssey. Thus the Gratis Share roughly corresponds to the number of cores each group has been granted. Currently the Gratis Share is set to 45.
The second type of Share is Lab Share. This Share is the Share given to those Labs who have purchased hardware for their own lab. The CPU TRES from that purchased hardware is summed and added to the Gratis Share for that Lab's Account.
The third type of Share is Communal Partition Share. This Communal Partition Share is the Share given to labs who have gone in with other labs and have purchased hardware to be used in common by the group of labs (e.g. a partition for the entire department, or for a school, or a collaboration of labs). In these cases the CPU TRES is summed and then divided amongst the labs, per their discretion, and added to the Lab's Account.
Thus the total Share an Account has is simply the addition of all of these types of Share. This Share is global to the whole cluster. So whether the Lab is running on their own dedicated partitions or on the public partitions, their Share is the same. The Share is simply the portion of the entire system they have been granted, and can be moved around as needed by the Lab to any of the resources available to them on the cluster.
Probably the easiest way to walk through how a Lab's Fairshare Score is calculated is to explain what the Slurm tool
sshare displays. This tool shows you all the components of your Fairshare calculation. Here is an example:
[root@holyitc01 ~]# sshare --account=test_lab -a
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- -----------
test_lab 244 0.001363 45566082 0.000572 0.747627
test_lab user1 parent 0.001363 8202875 0.000572 0.747627
test_lab user2 parent 0.001363 248820 0.000572 0.747627
test_lab user3 parent 0.001363 163318 0.000572 0.747627
test_lab user4 parent 0.001363 18901027 0.000572 0.747627
test_lab user5 parent 0.001363 18050039 0.000572 0.747627
The Account we are looking at is test_lab. The first line of the sshare output shows the summary for the whole lab, while the subsequent lines show the information for each user. test_lab has been granted 244 RawShares. Each user of that lab has a RawShare of parent, this means that all the users pull from the total Share of the Account and do not have their own individual subShares of the Account Share. Thus all users in this lab have full access to the full Share of the Account.
The next column after RawShares is NormShares. NormShares is simply the Account's RawShares divided by the total number of RawShares given out to all Accounts on the cluster. Essentially NormShare is the fraction of the cluster the account has been granted, in this case about 1.36%. Given the way we set up giving out RawShares on Odyssey, the total number of RawShares should be equivalent to the number of CPU TRES on Odyssey, that is 244 Broadwell cores.
Following NormShares we have RawUsage. RawUsage is the amount of TRES-sec the Account/User has used. Thus if a user used a single Broadwell core for one second, the user's account would be charged 1 TRES-sec in RawUsage. This RawUsage is also attenuated by the halflife that is set for the cluster, which is currently 4 weeks. Thus work done in the last 4 weeks counts at full cost, work done 8 weeks ago costs half, work done 12 weeks ago one fourth, and so on. So RawUsage is the aggregate of the Account's past usage with this halflife weighting factor. The RawUsage for the Account is the sum of the RawUsage for each user, thus sshare is an effective way to figure out which users have contributed the most to the Account's score.
The next column is EffectvUsage. EffectvUsage is the Account's RawUsage divided by the total RawUsage for the cluster. Thus EffectvUsage is the percentage of the cluster the Account has actually used. In this case, the test_lab has used 0.57% of the cluster.
Finally, we have the Fairshare score. The Fairshare score is calculated using the following formula.
f = 2^(-EffectvUsage/NormShares) From this one can see that there are five basic regimes for this score which are as follows:
1.0: Unused. The Account has not run any jobs recently.
1.0 > f > 0.5: Underutilization. The Account is underutilizing their granted Share. For example, when f=0.75 a lab has recently underutilized their Share of the resources 1:2
0.5: Average utilization. The Account on average is using exactly as much as their granted Share.
0.5 > f > 0: Over-utilization. The Account has overused their granted Share. For example, when f=0.25 a lab has recently overutilized their Share of the resources 2:1
0: No share left. The Account has vastly overused their granted Share. If there is no contention for resources, the jobs will still start.
Since the usage of the cluster varies, the schedule does not stop Accounts from using more than their granted Share. Instead, the scheduler wants to fill idle cycles, so it will take whatever jobs it has available. Thus an Account is essentially borrowing computing resource time in the future to use now. This will continue to drive down the Account's Fairshare score, but allow jobs for the Account to still start. Eventually, another Account with a higher Fairshare score will start submitting jobs and that labs jobs will have a higher priority because they have not used their granted Share. Fairshare only recovers as a lab reduces the workload to allow other Accounts to run. The half-life helps to expedite this recovery.
Given this behavior of Fairshare, Accounts can also bank time for large computations that are beyond their average Share. For instance say the Lab knows it has a large parallel run to do, or alternatively a deadline to meet. The Lab can in preparation for this not run for several weeks. This will drive up their Fairshare as they will have not used their fraction of the cluster for that time period. This banked capacity can then be expended for a large run or series of runs. On the other hand, to continue the financial analogy, a group that has exhausted their Fairshare is in debt to the scheduler as they have used up far more than their granted Share. Thus they have to wait for that debt to be paid off by not running, which allows their Fairshare to recover. Again, when there is not contention for resources, even jobs with low Faishare scores will continue to start.
Now that we have discussed Fairshare we can now discuss how an individual job's priority is calculated. Job Priority is an integer number that adjudicates the position of a job in the pending queue relative to other jobs. There are two components of Job Priority on Odyssey. The first is the FairShare score multiplied by a weighting factor to turn it into an integer, in this case 20,000,000. A Fairshare of 1 would give a priority of 20,000,000, while a Fairshare of 0.5 would give a value of 10,000,000. We pick large numbers so we have resolution to break ties between Accounts that are close in Fairshare score. This Fairshare Priority evolves dynamically as the Fairshare of the Account changes over time.
The second component is Job Age. This priority accrues over time gaining a maximum value at 7 days. As the job sits in the queue waiting to be scheduled, its priority is gradually increasing due to the Job Age. The maximum possible value for Job Age is 10,000,000. Thus a job that has been sitting for 3.5 days would have a Job Age Priority of 5,000,000. We set the Job Age Priority to a maximum of 10,000,000 so that a job from an Account with a Fairshare of 0 but has been pending for 7 days would have the same priority as a job that was just submitted from an Account that has a Fairshare of 0.5. Thus even jobs from Accounts that have low Fairshare will schedule eventually due to the growth in their Job Age Priority.
These two components are summed together to make up an individual Job's Priority. You can see this calculation for specific jobs by using the
sprio command. In addition you can see the Pending queue of a specific partition ordered by job priority by using
showq -o -p <partitionname>.
While most users are fine with having one Account they are associated with, some users do work for multiple Accounts. Slurm does have the ability to associate users with multiple Accounts, which allows users to charge back individual jobs to individual Accounts. Contact Research Computing if you are interested in this feature.
Research Computing keeps track of historic data for usage and Fairshare score. If you wish to see how your Account's Fairshare score has evolved over time please contact Research Computing.
scalc is a calculator available on the cluster for figuring out various questions about fairshare. It includes a calculator for projecting a new Fairshare score based on a new RawShare, a calculator for figuring out how long it will take to restore fairshare, and a calculator for figuring out how much a set of jobs will cost in terms of cluster utilization and fairshare. If you have additional calculations that you would like to see contact us.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.