How do I figure out how efficient my job is?

You can see your job efficiency by comparing Elapsed, CPUTime, and NCPUS in sacct. For example:

[user@boslogin01 home]# sacct -j 1234567 -o Elapsed,CPUTime,NCPUS
Elapsed  CPUTime     NCPUS
---------- ---------- ----------
13:22:35 35-16:05:20 64
13:22:35 17-20:02:40 32
13:22:41 35-16:11:44 64
13:21:39 1-02:43:18   2

In this job you see that the user used 64 cores and their job ran for 13 hours. However their CPUTime is 35.5 hours which is close to 64*13 hours. If your code is scaling effectively CPUTime = NCPUS * Elapsed. If it is not that number will diverge. The best way to test this is to do some scaling tests. There are two styles you can do. Strong scaling is where you leave the problem size the same but increase the number of cores. If your code scales well it should take less time proportional to the number of cores you use. The other is weak scaling where the amount of work per core remains the same but you increase the number of cores, so the size of the job scales proportionally to the number of cores. Thus if your code scales in this case the run time should remain the same.

Typically most codes have a point where the scaling breaks down due to inefficiencies in the code. Thus beyond that point there is not any benefit to increasing the number of cores you throw at the problem. That's the point you want to look for.  This is most easily seen by plotting log of the number of cores vs. log of the runtime.

The other factor that is important in a scheduling environment is that the more cores you ask for the longer your job will pend for as the scheduler has to find more room for you. Thus you need to find the sweet spot where you minimize both your runtime and how long you pend in the queue for. For example it may be the case that if you asked for 32 cores your job would take a day to run but pend for 2 hours, but if you ask for 64 cores your job would take half a day to run but would pend for 2 days. Thus it would have been better to ask for 32 cores even though the job is slower.

Posted in: c. Jobs and SLURM

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.