c. Jobs and SLURM

How do I know what memory limit to put on my job?

Add to your job submission:

#SBATCH --mem X

where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use seff to look at how much your job is actually using or used:

seff JOBID

where JOBID is the one you’re interested in.  This gives you a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit).

For more information see here.


How can I check OS version when loading modules?

If you need to check what OS version the node you are on is using before loading modules, you can add a test in your .bashrc for the value of the environment variableFASRCSW_OS .

So, for instance, you could source a different file for CentOS 7 by adding something like the following to your .bashrc :

if [ "$FASRCSW_OS" = "centos7" ]; then
source ~/.mycentos7.rc
source ~/.mynormal.rc



How do I figure out how efficient my job is?

You can see your job efficiency by using seff. For example:

[user@boslogin01 home]# seff 1234567
Job ID: 1234567
Cluster: odyssey
User/Group: user/user_lab
State: COMPLETED (exit code 0)
Nodes: 8
Cores per node: 64
CPU Utilized: 37-06:17:33
CPU Efficiency: 23.94% of 155-16:02:08 core-walltime
Job Wall-clock time: 07:17:49
Memory Utilized: 1.53 TB (estimated maximum)
Memory Efficiency: 100.03% of 1.53 TB (195.31 GB/node)

In this job you see that the user used 512 cores and their job ran for 7.5 hours. However their CPUTime is 894 hours which is close to 128*7 hours, which is about 25% of the actual amount of compute they request (i.e. 512*7).  If your code is scaling effectively CPUTime (CPU Utilized) = NCPUS * Elapsed (Wall-clock time). If it is not that number will diverge. The best way to test this is to do some scaling tests. There are two styles you can do. Strong scaling is where you leave the problem size the same but increase the number of cores. If your code scales well it should take less time proportional to the number of cores you use. The other is weak scaling where the amount of work per core remains the same but you increase the number of cores, so the size of the job scales proportionally to the number of cores. Thus if your code scales in this case the run time should remain the same.

Typically most codes have a point where the scaling breaks down due to inefficiencies in the code. Thus beyond that point there is not any benefit to increasing the number of cores you throw at the problem. That's the point you want to look for.  This is most easily seen by plotting log of the number of cores vs. log of the runtime.

The other factor that is important in a scheduling environment is that the more cores you ask for the longer your job will pend for as the scheduler has to find more room for you. Thus you need to find the sweet spot where you minimize both your runtime and how long you pend in the queue for. For example it may be the case that if you asked for 32 cores your job would take a day to run but pend for 2 hours, but if you ask for 64 cores your job would take half a day to run but would pend for 2 days. Thus it would have been better to ask for 32 cores even though the job is slower.


Will single core/thread jobs run faster on the cluster?

The cluster cores, in general, will not be any faster than the ones in your workstation, in fact they may be slower if your workstation is relatively new. While we have a variety of chipsets available on the cluster, most of the cores are AMD and will be slower than many Intel chips, which are most common in modern desktops and laptops. The reason we use so many AMD chips is that we could purchase a larger number of cores and RAM this way. This is the power of the cluster. The cluster isn't designed to run a single core code as fast as possible as the chips to do that are expensive. Rather you trade off raw chip speed for core count. Then you gain speed and efficiency via parallelism. So the cluster excels at multicore jobs (using threads or MPI ranks) or doing many jobs that take a single core (such as parameter sweeps or image process). This way you leverage the parallel nature of the cluster and the 60,000 cores available.

So if you have a single job, the cluster isn't really a gain. If you have lots of jobs you need to get done, or your job is too large to fit on a single machine (due to RAM or its parallel nature), the cluster is the place to go. The cluster can also be useful for offloading work from your workstation. That way you can use your workstation cores for other tasks and offload the longer running work onto the cluster.

In addition since the cluster cores are a different architecture from your workstation one needs to be aware that the code will need to be optimized differently. This is where compiler choice and compiler flags can come in handy. That way you can get the most out of both sets of cores. Even there you may not get the same performance out of the cluster as your local machine. The main processor we have on the cluster is now 4 years old, and if you are using serial_requeue you could end up on hardware bought today to stuff purchased 7 years ago. There is about a factor of 2-4 in performance in just the natural development of processor technology.


My login is slow or my batch commands are slow

Nine times out of ten, slowness at login, starting file transfers, failed SFTP sessions, or slow batch command starts is caused by un-needed module loads in your .bashrc

We do not recommend putting multiple module loads in your .bashrc as each and every new shell you or your jobs create will call those module loads. It is recommended that you put your module loads in your job scripts so that you are not loading un-needed modules and waiting on those module calls to complete before commencing the job. Alternately, you can create a login script or alias containing your frequently used modules that you can run when you need to use them.

Either way, try to keep any module loads in your .bashrc down to a bare minimum, calling only those modules that you absolutely need in each and every login or job.

Additionally, as time goes on modules change or are removed. Please ensure you remove any deprecated modules from your .bashrc or other scripts. For example, the legacy modules no longer exist. So if you have a call to module load legacy and any of the legacy modules, or if you have source new-modules.sh your login will be delayed as the module system searches for and then times out on those non-existent modules.


How do I run applications that want X11 without graphics?

Sometimes codes, applications, or python stacks really want X11 (i.e. a way to desplay graphics) to exist.  In some cases the lack of X11 will cause the code the crash making it unusuable on the cluster.  For applications like this we recommend using X Virtual Frame Buffer (XVFB).  XVFB spoofs a fake X11 session so that the code thinks X11 is there but it isn't, anything that is X11 related is sent to the void never to return.  XVFB is easy to use simply run it in front of the application you want to quash X11 for:

xvfb-run application options

If you actually need X11 we recommend using a virtual desktop.


Can I query SLURM programmatically?

I'm writing code to keep an eye on my jobs. How can I query SLURM programmatically?

We highly recommend that people writing meta-schedulers or that wish to interrogate SLURM in scripts do so using the squeue and sacct commands. We strongly recommend that your code performs these queries once every 60 seconds or longer. Using these commands contacts the master controller directly, the same process responsible for scheduling all work on the cluster. Polling more frequently, especially across all users on the cluster, will slow down response times and may bring scheduling to a crawl. Please don't.

SLURM also has an API that is documented on the website of our developer partners SchedMD.com.


How do I fairly manage dual/multiple lab affiliations for work on the FASRC cluster?

We're really glad you asked us this question! How you submit your jobs will determine what lab's fairshare is selected. There are two levels to this question, the first concerning filesystem rights and the second SLURM submissions.

For filesystem rights, your primary group ID should be set to your primary lab group, and request from us a secondary group membership in Active Directory. If you wish to switch to the other group for work (for example, when creating files in smith_lab shared storage, you would want to make sure your group is set to smith_lab, not jones_lab), use the newgrp 2NDGROUPNAME command.

In SLURM, ensure that your primary group membership is set for the appropriate lab, and request from us a secondary group affiliation in SLURM. When submitting SLURM jobs, all resource usage will be charged to your primary SLURM group. If you wish to submit jobs for the other group, using the --account=2NDGROUPNAME on the sbatch or srun command.


How do I submit a batch job to the FASRC cluster queue with SLURM?

Step 1: Login to cluster through your Terminal window. Please see the Access and Login page for login instructions.

Step 2: Run a batch job by typing: sbatch RUNSCRIPT. Replace RUNSCRIPT with the batch script (a text file) you will use to run your code.

The batch script should contain #SBATCH comments that tell SLURM how to run the job.


#SBATCH -n 1 #Number of cores
#SBATCH -t 5 #Runtime in minutes
#SBATCH -p serial_requeue #Partition to submit to
#SBATCH --mem-per-cpu=100 #Memory per cpu in MB (see also --mem)
#SBATCH -o hostname.out #File to which standard out will be written
#SBATCH -e hostname.err #File to which standard err will be written
#SBATCH --mail-type=END #Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=ajk@123.com #Email to which notifications will be sent


See the batch submission section of the Running Jobs page for detailed instructions and sample batch submission scripts.

Note: You must declare how much memory and how many cores you are using for your job. By default SLURM assumes you need 100 MB. The script assumes that it is running in the current directory and will load your .bashrc.


How do I submit an interactive job on the cluster?

Step 1: Log in to the cluster through your Terminal window. Please see here for login instructions.

Step 2: Run an interactive job by typing: srun -p interact --pty MYPROGRAM

This will open up an interactive run for you to use.  If you want a bash prompt, type: srun --mem 500 -p interact --pty bash

If you need X11 forwarding type: srun --mem 500 -p interact --pty --x11=first MYPROGRAM

This will initiate an X11 tunnel to the first node on your list.  –-x11 has additional options of batch, first, last and all.

See also the interactive jobs section of the Running Jobs page.


How do I view or monitor a submitted job?

Step 1: Login to the cluster through your Terminal window. Please see the Access and Login page for login instructions.

Step 2: From the command line type one of three options: smap, squeue, or showq-slurm

If you want more details about your job, from the command line type: sacct -j JOBID

You can view the runtime and memory usage for a past job by typing: sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed, where JobID is the numeric job ID of a past job.

See the Running Jobs page for more details on job monitoring.


My job is PENDING. How can I fix this?

How soon a job is scheduled is due to a combination of factors: the time requested, the resources requested (e.g. RAM, # of cores, etc), the partition, and one's FairShare score.

Quick solution? The Reason column in the squeue output can give you a clue:

  • If there is no reason, the scheduler hasn't attended to your submission yet.
  • Resources means your job is waiting for an appropriate compute node to open.
  • Priority indicates your priority is lower relative to others being scheduled.

There are other Reason codes; see the SLURM squeue documentation for full details.

Your priority is partially based on your FairShare score and determines how quickly your job is scheduled relative to others on the cluster. To see your FairShare score, enter the command sshare -u RCUSERNAME. Your effective score is the value in the last column, and, as a rule of thumb, can be assessed as lower priority ≤ 0.5 ≤ higher priority.

In addition, you can see the status of a given partition and your position relative to other pending jobs in it by entering the command showq-slurm -p PARTITION -o. This will order the pending queue by priority, where jobs listed at the top are next to be scheduled.

For both Resources and Priority squeue Reason output codes, consider shortening the runtime or reducing the requested resources to increase the likelihood that your job will start sooner.

Please see this document for more information and this presentation for a number of troubleshooting steps.

Last updated: August 21, 2018 at 11:39 am


SLURM Errors: Job Submission Limit (per user)

If you attempt to schedule more than 10,000 jobs (all inclusive, both running and pending) you will receive an error like the following:

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

For more info about being a good cluster neighbor, see:


SLURM Errors: Device or resource busy

What's up? My SLURM output file terminates early with the following error:

"slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup
path /cgroup/freezer/slurm/uid_57915/job_25009017/step_batch: Device or
resource busy"

Well, usually this is a problem in which your job is trying to write to a network storage device that is busy -- probably overloaded by someone doing high amounts of I/O (input/output) where they shouldn't, usually on low throughput storage like home directories or lab disk shares.

Please contact RCHelp about this problem, giving us the jobID, the filesystem you are working on, and additional details that may be relevant. We'll use this info to track down the problem (and, perhaps, the problem user(s)).

(If you know who it is, tap them on the shoulder and show them our Odyssey Storage page.)
Last updated: August 21, 2018 at 11:39 am


SLURM errors: Job cancelled due to preemption

If you've submitted a job to the serial_requeue partition, it is more than likely that your job will be scheduled on a purchased node that is idle. If the node owner submits jobs, SLURM will kill your job and automatically requeue it. This message will appear in your STDOUT or STDERR files you indicated with the -o or -e options. This is simply an informative message from SLURM.


SLURM Errors: Memory limit

Job <jobid> exceeded <mem> memory limit, being killed:

Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpuor, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.

For jobs that require truly large amounts of memory (>256 Gb), you may need to use thebigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp.

See this FAQ on determining how much memory your completed batch job used under SLURM.


SLURM Errors: Node Failure


This message may arise for a variety of reasons, but it indicates that the host on which your job was running can no longer be contacted by SLURM. Not a good sign. Contact RCHelp to help with this problem.


SLURM errors: Socket timed out. What?

If the SLURM master (the process that listens for SLURM requests) is busy, you might receive the following error:

[bfreeman@holylogin02 ~]$ squeue -u bfreeman
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

Since SLURM is scheduling 1 job every second (let alone doing the calculations to schedule this job on 1 of approximately 100,000 compute nodes), it's going to be a bit busy at times. Don't worry. Get up, stretch, pet your cat, grab a cup of coffee, and try again.
Last updated: August 21, 2018 at 11:39 am


SLURM Errors: Time limit

(or you may also see 'Job step aborted' when using srun)

Either you did not specify enough time in your batch submission script, or you didn't specify the amount of time and SLURM assigned the default time of 10 minutes. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30for 12.5 hours). Submit your job again with a longer time window.

Last updated: August 21, 2018 at 11:39 am


What is Fair-Share?

FairShare is a score that determines what priority you have in the scheduling queue for your jobs. The more jobs you run, the lower your score becomes, temporarily. A number of factors are used to determine this score -- please read this Fairshare document for more information.

To find out what your score is, enter `sshare -U` in your terminal session on the cluster to see a listing for your group (this is not your individual score, but an aggregate for your group). In general, a score of 0.5 or above means you have higher priority for scheduling.

Example of a fairly full Fairshare:

$ sshare -U
Account User RawShares NormShares RawUsage EffectvUsage FairShare
------------ ----- ------ -------- ------- ------------- ----------
jharvard2_lab jharv parent 0.000936 171281 0.000003 0.997620

Example of a depleted Fairshare:

$ sshare -U
Account User RawShares NormShares RawUsage EffectvUsage FairShare
------------ ----- ------ -------- ------- ------------- ----------
jharvard_lab johnh parent 0.000936 361920733 0.007145 0.005046

See also: Managing FairShare for Multiple Groups if you belong to more than one lab group

For further information, see the RC fair-share document. 



Billing FAQ