by Paul Edmon June 3, 2022
A common pitfall of High Performance Computing (HPC) scheduling is cluster fragmentation. This is not unique to HPC mind you, any system where you have a limited amount of space that you try to fill with quasi random sized blocks of stuff will end up fragmented at some level (see your last game of Tetris). Within HPC, cluster fragmentation is caused by taking jobs of various shapes and sizes (e.g. core, memory, node, gpu, time) and trying to make sure that everyone gets their required allocation while also ensuring maximum utilization of the resource.
Initially when you start the problem everything ends up nice and neat. Large jobs end up being stacked together first with smaller jobs used to fill out the gaps. Overtime though the cluster gets more and more fragmented. Jobs exit at random times leading to gaps in the scheduler that are oddly shaped. You cannot put a larger job that has a lot of topology requirements in that space, so the scheduler throws in a smaller job that will fit in that gap. Alternatively a larger job will block of resources as it tries to get its allocation causing many smaller jobs that cannot quite fit to pend. Solving this fragmentation problem is in fact the essence and art of scheduling and the reason why schedulers are nontrivial to write.
So why is fragmentation such a bad thing? The main issue with fragmentation is that it makes it very hard to schedule jobs that require a lot of structure. Say for example you have a 8 core job. If the job has no structure requirements, the scheduler can give the job one core on a node over here, another couple of cores over there, and cobble together 8 cores worth of compute for the job in fairly short order. However, some jobs need those cores to be put together in a certain way. Maybe the job cannot span multiple nodes and needs all those cores on a single host. Maybe the job needs multiple nodes that are adjacent on the network fabric. If the cluster was not fragmented, finding larger structured blocks like this would be easier and thus jobs of this style would pend for a shorter amount of time.
Now if all the work for a cluster was of the same style and type, then fragmentation would not happen in the first place. It is the very fact that there is such a diversity of workloads on a HPC cluster that makes fragmentation an issue. You want to be able to serve not just the users submitting lots and lots of small jobs but also the users who want to submit very large well structured jobs, as well as everyone in between (which is frankly most users). Accomplishing this is part of the art of scheduler policy.
There are three general types of solutions to fragmentation. All have their advantages and disadvantages. The first is to enforce a specific style of job, typically that jobs must occupy at least a complete single node. This moves the atomic size of the job from individual cores to individual nodes. Fragmentation still occurs but it is at the level of nodes and their relation on the network topology. This is a fine solution for MPI workloads which typically want to work in this fashion but not as much for workloads which are smaller than the size of a node. Some codes do not scale well up to a full node or do not even need the full capacity of a node. As such you will end up with nodes being underutilized. The tradeoff then is less fragmentation overall but wasted resources at the individual node level, resources that could have been used if the user had been allowed to declare a more accurate job size.
The second solution is to privilege larger jobs over smaller jobs or visa versa. This type of scheduling would give priority to jobs that use up more space, mimicking how a person would pack a van to move. The obvious downside is this also means that work those users are doing with the larger jobs is given priority over other work on the cluster. On certain clusters this may be the entire point, after all if you build a giant supercomputer you want to run giant jobs that you cannot run anywhere else. However for a generic research cluster this is not ideal. Size or style of job is not an indicator of the importance of a users work. Lots of small jobs are just as valid as large jobs. You want everyone to get their work done with out preferring one type of job over another. At the end of the day this sort of priority shifting is a philosophical question of what is your cluster for and how it will allocate work to individual users and groups.
The third solution is to increase the job throughput and decrease the time to defragment. After all the real problem that you are trying to solve is how long a job pends for of any size. The average pending time will be approximately the average time it takes for a job to complete on the cluster, as jobs can only run when free spots open up. This can be done via expansion. However unlike cloud computing, for a variety of reasons HPC clusters cannot usually dynamically expand their resources to meet demand. The other method is to make it so that jobs exit faster. This can be done by setting a maximum time limit for any job. If you set a maximum time limit for a queue of a week then your queue will on average turnover in a week and defragment in a week. If you set it to a day it will do so in a day. Thus you can see that shorter runtimes will solve the pending issue that was caused by the fragmentation by subdividing the time axis of the work. The problem with this is that not all jobs are able to complete in a day or even a week. Many jobs do not have any ability to save their progress and restart. As such you cannot have an arbitrarily short run time but need to pick a time that is amenable to most of the types of jobs your cluster runs.
As you can see from the three options above there is no silver bullet for solving defragmentation. All have their specific hitches and tradeoffs. Thus a cluster administrator must have an understanding of the community that they are supporting and craft a queue policy that best serves the interest of their users.