#

Running Jobs

 

Overview: The Odyssey cluster uses Slurm to manage jobs

Slurm (aka SLURM) is a queue management system and stands for Simple Linux Utility for Resource Management. Slurm was originally developed at the Lawrence Livermore National Lab, but is now primarily developed by SchedMD. Slurm is the scheduler that currently runs some of the largest compute clusters in the world.

Slurm is similar in many ways to most other queue systems. You write a batch script then submit it to the queue manager. The queue manager then schedules your job to run on the queue (or partition in Slurm parlance) that you designate. Below we will provide an outline of how to submit jobs to Slurm, how Slurm decides when to schedule your job, and how to monitor progress.

Slurm has a number of valuable features compared to other job management systems:

  • Kill and Requeue SLURM’s ability to kill and requeue is superior to that of other systems. It waits for jobs to be cleared before scheduling the high priority job. It also does kill and requeue on memory rather than just on core count.
  • Memory Memory requests are sacrosanct in SLURM. Thus the amount of memory you request at run time is guaranteed to be there. No one can infringe on that memory space and you cannot exceed the amount of memory that you request.
  • GRES Slurm has a concept called GRES (Generic Resource) that allows for fair scheduling on GPU's and other accelerators.  This is very handy in a dynamic research environment like RC's where various different hardware technologies can be put into the scheduler.
  • Accounting Tools SLURM has a back end database which stores historical information about the cluster. This information can be queried by the users who are curious about how much resources they have used.  It is used for adjudicating job priority on the cluster.

Odyssey jobs are generally run from the command line

Once you've gone through the account setup procedure, you can login to the Odyssey system via ssh and begin using the cluster. 

Odyssey computers run the CentOS distribution of the Linux operating system and commands are run under the "bash" shell. As with most supercomputers work is done via command line, typing commands into a prompt, and not via a GUI (graphical user interface).  There are a number of Linux and bash references, cheat sheets and tutorials available on the web. RC's own training are also available.

Odyssey applications should not be run from login or NX nodes

Once you have logged in to the Odyssey system, you will be on one of a handful of access nodes (e.g. rclogin04 - see the guides linked above). These nodes are shared entry points for all users and so cannot be used to run computationally intensive software. Think of them as front-ends for your work, not the place where you do your work. 

Simple file copies, light text processing or editing, etc. are fine, but you should not run large graphical applications like Matlab, Mathematics, R, or computationally intensive command line tools. A culling program runs on these nodes that will kill any application that exceeds memory and computational limits.

Entry nodes for NoMachine remote desktops (see below) like holynx01 and rcnx01 are also to be treated like login nodes.


Software modules are enabled via HELMOD

Because of the diversity of projects currently supported by FAS, thousands of applications and libraries are supported on the Odyssey cluster. Technically, it is impossible to include all of these tools in every user's environment.

Search available modules here
(https://portal.rc.fas.harvard.edu/apps/modules)

The Research Computing and Informatics departments have developed an enhanced Linux module system, Helmod, based on the hierarchical Lmod module system from TACC. Helmod enables applications much the same way as Linux modules, but also prevents multiple versions of the same tool from being loaded at the same time and separates tools that use particular compilers or MPI libraries entirely.

To start using the Helmod system, issue the command:

source new-modules.sh

You can also add this statement to your .bashrc login file so that you'll use the new system by default. Newer accounts will find that this already in their .bashrc

A module load command enables a particular application in the environment, mainly by adding the application to your PATH variable. For example, to enable the 3.2 version of the R package:

module load R/3.2.0-fasrc01


Loading more complex modules can affect a number of environment variables including PYTHONPATH, LD_LIBRARY_PATH, PERL5LIB, etc. Modules may also load dependencies.

To determine what has been loaded in your environment, the module list command will print all loaded modules.

The module purge command will remove all currently loaded modules. This is particularly useful if you have to run incompatible software (e.g. python 2.x or python 3.x). The module unload command will remove a specific module.

Finding the modules that are appropriate for your needs can be done in a couple of different ways. The module search page allows you to browse and search the list of modules that have been deployed to Odyssey.

There are a number of command line options for module searching, including the module avail command for browsing the entire list of applications and the module-query command for keyword searching. But the online module search is much more thorough and has additional information on each module.

Though there are many modules available by default, the hierarchical Helmod system enables additional modules after loading certain key libraries such as compilers and MPI packages. The module avail command output reflects this.

 

The Helmod module-query command supports more sophisticated queries and returns additional information for modules. If you query by the name of an application or library (e.g. hdf5), you'll retrieve a consolidated report showing all of the modules grouped together for a particular application.

 


A query for a single module, however, will return details about that build including module load statements and build comments (if any exist).


For more details about the Helmod module system, check out the Software on Odyssey page, or the Helmod tagged articles.


General Slurm resources

The primary source for documentation on Slurm usage and commands can be found at the Slurm site. If you Google for Slurm questions, you'll sometimes see the Lawrence Livermore pages as the top hits, but these tend to be outdated. Use the docs at the SchedMD site, though these are always for the latest version of Slurm. A great way to get details on the Slurm commands for the version of Slurm we run is the man pages available from the Odyssey cluster. For example, if you type the following command:

man sbatch

you'll get the manual page for the sbatch command.

Though Slurm is not as common as SGE or LSF, documentation is readily available.

Summary of Slurm commands

The table below shows a summary of Slurm commands. These commands are described in more detail below along with links to the Slurm doc site.

  SLURM SLURM Example
Submit a batch serial job sbatch sbatch runscript.sh
Run a script or application interactively srun srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app]
Start interactive session from a login/NX node srun srun --pty -p test -t 10 --mem 1000 /bin/bash
Kill a job scancel scancel 999999
View status of your jobs squeue squeue -u akitzmiller
Check current job by id number sacct sacct -j 999999

Slurm partitions

Partition is the term that Slurm uses for queues. Partitions can be thought of as a set of resources and parameters around their use (See also: Convenient Slurm Commands)

PartitionNodesCores per NodeCore TypesMem per Node (GB)Time LimitMax JobsMax CoresMPI Suitable?GPU Capable?
shared45632Intel1287 daysnone*none*YesNo
general13364AMD2567 daysnone*none*YesNo
serial_requeuevariesvariesAMD/Intelvaries7 daysnone*none*NoYes*
gpu_requeuevariesvariesIntelvaries7 daysnone*none*NoYes
gpu124Intel30none*none*none*NoYes (8 K20Xm)
test**832Intel1288 Hours564NoNo
bigmem664AMD512none*none*none*NoNo
unrestricted864AMD256none*none*none*Yes*No
PI/Lab nodesvariesvariesvariesvariesnonenonenonevariesvaries
* - Allocating resources may add to job scheduling time
** - formerly 'interact'
  • shared - The shared partition has a maximum run time of 7 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This queue is governed by backfill and FairShare (explained below). The shared partition is populated with hardware that RC runs at the MGHPCC data center in Holyoke, MA. This partition has 456 nodes connected by a FDR InfiniBand (IB) fabric, where each node configured with 2 Intel Xeon Broadwell CPUs, 128 GB of RAM, and 250 GB of local scratch space. Each Intel CPU has 16 Cores, and 40 MB of cache. Thus, the entire system allocated to this partition has 14,592 cores and 57 TB of RAM available for use.  When submitting MPI jobs on the shared partition, it maybe advisable to use the --contiguous option for best communication performance if your code is topology sensitive. Though all of the nodes are connected by Infiniband fabric, there are multiple switches routing the MPI traffic and Slurm will by default schedule you where ever it can find space.  Thus your job may end up scattered across the cluster. The --contiguous option will ensure that the jobs are run on nodes that are adjacent to each other on the IB fabric.  Be advised that using --contiguous will make your job pend longer, so only use it if you absolutely need it.
  • general  - The general partition has a maximum run time of 7 days. Serial, parallel, and interactive jobs are permitted on this queue, and this is the most appropriate location for MPI jobs. This partition is governed by backfill and FairShare (explained below). The general partition is populated with hardware that RC runs at the MGHPCC data center in Holyoke, MA. This queue has 133 nodes connected by a FDR InfiniBand (IB) fabric, where each node configured with 4 AMD Opteron Abu Dhabi CPUs, 256 GB of RAM, and 250 GB of local scratch space. Each AMD CPU has 8 Floating Point Units (FPU), 16 Integer Cores (IC), and 16 MB of cache. Thus, the entire system allocated to this partition has 8,512 integer cores and 33 TB of RAM available for use.
  • unrestricted  - Serial and parallel (including MPI) jobs are permitted on this partition and no restriction on run time. Given this, there is no guarantee of 100% uptime. Running on this partition is done at the users own risk. Users should understand that if the queue is full it could take weeks or up to months for your job to be scheduled to run. unrestricted is made up of 8 nodes (512 integer cores) of the same configuration as above for the general partition.
  • test - (replaces old 'interact' partition) This partition is dedicated for interactive (foreground / live) work and for testing (interactively) code before submitting in batch and scaling. Small numbers (1 to 5) of serial and parallel jobs with small resource requirements (RAM/cores) are permitted on this partition; large numbers of interactive jobs or those requiring large resource requirements should really be done on another partition. This partition is made up of 8 nodes of the same configuration as above for the shared partition. This smaller, 256 core queue has a 8 hour maximum run time.
  • bigmem This partition should be used for large memory work requiring greater than 250 GB RAM per job, like genome / transcript assemblies. Jobs requesting less than 250 GB RAM are automatically rejected by the scheduler. There is no time limit for work here. MPI or low memory work is not appropriate for the this partition, and inappropriate jobs may be terminated without warning. This partition has an allocation of 7 nodes with 512 GB of RAM
  • gpu  - This 1 node partition is for individuals wishing to test GPGPU resources. One will need to include #SBATCH --gres=gpu:n where n=1-8 in your SLURM submission scripts. This 1 node has 24 cores and is equipped with 8 x NVidia Tesla K20Xm. There are also private partitions that may have more GPU resources, but to which access may be controlled by the owners. See our GPU Computing doc for more info.
  • gpu_requeue  - This partition is appropriate for gpu jobs that require small periods of time (less than 1 day). The maximum runtime for this queue is 7 days. One will need to include #SBATCH --gres=gpu:1 in your SLURM submission scripts to get access to this partition. MPI jobs are not appropriate for this partition. As this partition is made up of an assortment of gpu nodes owned by other groups in addition to the public nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in. Because gpu_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared partition. Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for gpu_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it would be advisable to have breakpoints or branching instructions to bypass parts of work that have already been completed. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events.
  • serial_requeue  - This partition is appropriate for single core (serial) jobs or jobs that require up to 8 cores for small periods of time (less than 1 day). The maximum runtime for this queue is 7 days. MPI jobs are not appropriate for this partition.  If you do not specify a partition you will be sent to this partition by default. As this partition is made up of an assortment of nodes owned by other groups in addition to the general nodes, jobs in this partition may be killed but automatically requeued if a higher priority job (e.g. the job of a node owner) comes in. Because serial_requeue takes advantage of slack time in owned partitions, times in the PENDING state can potentially be much shorter than the shared partition.  Since jobs may be killed, requeued, and run a 2nd time, ensure that the jobs are a good match for this partition. For example, jobs that append output would not be good for serial_requeue unless the data files were zeroed out at the start to ensure output from a previous (killed) run was removed. Also, to ensure your job need not redo all its compute again, it would be advisable to have breakpoints or branching instructions to bypass parts of work that have already been completed. We do advise that you use the --open-mode=append to see the requeue status/error messages in your log files. Without this option, your log files will be reset at the start of each (requeued) run, with no obvious indication of requeue events.

Submitting batch jobs using the sbatch command

The main way to run jobs on Odyssey is by submitting a script with the sbatch command. The command to submit a job is as simple as:

sbatch runscript.sh

The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won't stop if you disconnect from Odyssey.

Tip: You can see your jobs on portal.rc.fas.harvard.edu

A typical submission script, in this case using the hostname command to get the computer name, will look like this:

#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 0-00:05 # Runtime in D-HH:MM
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o hostname_%j.out # File to which STDOUT will be written
#SBATCH -e hostname_%j.err # File to which STDERR will be written

hostname

In general, the script is composed of 3 parts.

  • the #!/bin/bash line allows the script to be run as a bash script
  • the #SBATCH lines are technically bash comments, but they set various parameters for the SLURM scheduler
  • the command line itself.

The #SBATCH lines shown above set key parameters. N.B. It is important to keep all #SBATCH lines together and at the top of the script; no bash code or variables settings should be done until after the #SBATCH lines. The Slurm system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./project/myfiles/myfile.txt).

#SBATCH -n 1

This line sets the number of cores that you're requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, Slurm assumes -n 1.

#SBATCH -N 1

This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. Slurm makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).

#SBATCH -t 5

This line specifies the running time for the job in minutes. You can also use the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be canceled. Jobs have a maximum run time of 7 days on Odyssey, though extensions can be done. There is no fairshare penalty for over-requesting time, though it will be harder for the scheduler to backfill your job if you overestimate. NOTE! If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.

#SBATCH -p serial_requeue

This line specifies the Slurm partition (AKA queue) under which the script will be run. The serial_requeue partition is good for routine jobs that can handle being occasionally stopped and restarted. PENDING times are typically short for this queue. See the partitions description above for more information.  If you don not specify this parameter you will be given serial_requeue by default.

#SBATCH --mem=100

The Odyssey cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. The --mem option specifies the total memory pool for one or more cores, and is the recommended option to use. If you must do work across multiple compute nodes (e.g. MPI code), then you must use the --mem-per-cpu option, as this will allocate the amount specified for each of the cores you're requested, whether it is on one node or multiple nodes. If this parameter is omitted, then you are granted 100 MB by default.  Chances are good that your job will be killed as it will likely go over this amount, so one should always specify how much memory you require.

#SBATCH -o hostname_%j.out

This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.

#SBATCH -e hostname_%j.err

This line specifies the file to which standard error will be appended. Slurm submission and processing errors will also appear in the file. The %j in the filename will be substituted by the JobID at runtime. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.err in the current directory.

It is important to accurately request resources, especially memory

Odyssey is a large, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs. If insufficient memory is allocated, your program may crash (often in an unintelligible way); if too much memory is allocated, resources that could be used for other jobs will be wasted. Additionally, your "fairshare", a number used in calculating the priority of your job for scheduling purposes, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores (-n) and memory (--mem or --mem-per-cpu).

Many scientific computing tools can take advantage of multiple processing cores, but many cannot. A typical R script, for example will not use multiple cores. On the other hand, RStudio, a graphical console for R is a Java program that is improved substantially by using multiple cores. Or, you can use the Rmpi package and spawn "slaves" that correspond to the number of cores you've selected.

The distinction between --mem and --mem-per-cpu is important when running multi-core jobs (for single core jobs, the two are equivalent). --mem sets total memory across all cores, while --mem-per-cpu sets the value for each requested core. If you request two cores (-n 2) and 4 Gb with --mem, each core will receive 2 Gb RAM. If you specify 4 Gb with --mem-per-cpu, each core will receive 4 Gb for a total of 8 Gb.  A good distinction between the two is that --mem-per-cpu is for MPI jobs and --mem is for all other types.

Monitoring job progress with squeue and sacct

squeue and sacct are two different commands that allow you to monitor job activity in SLURM. squeue is the primary and most accurate monitoring tool. sacct gives you similar information for running jobs, and can also report on previously finished jobs, but because it accesses the SLURM database, there are some circumstances when the information is not in sync with squeue.

Running squeue without arguments will list all your currently running, pending, and completing jobs:

squeue

or for a particular job

squeue -j 9999999

If you include the -l option (for "long" output) you can get useful data, including the running state of the job.



squeue long output using username (-u) filter.

The default squeue tool in your PATH (/usr/local/bin/squeue) is a modified version developed by FAS Informatics. To reduce the load on the SLURM scheduler (RC processes 2.5 million jobs each month), this tool actually queries a centrally collected result from the 'real' squeue tool, which can be found at /usr/bin/squeue. This data is collected approximately every 30 seconds. Many, but not all, of the options from the original tool are supported. Check this using the squeue --help command.

If you need to use all of the options from the real squeue tool, simply call it directly (/usr/bin/squeue).

The current state of jobs can also be monitored via the FAS RC/Informatics portal jobs page. You will need to login with your RC credentials. This draws from the same shared data as the squeue command line tool.

The sacct command also provides details on the state of a particular job. An squeue-like report on a single job is a simple command.

sacct -j 9999999

However sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory and CPU usage for an array job (see below for details about job arrays):



Listing of job details using sacct.

Both tools provide information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED.

PENDING Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.
RUNNING Job is running.
COMPLETED Job has finished and the command(s) have returned successfully (i.e. exit code 0).
CANCELLED Job has been terminated by the user or administrator using scancel.
FAILED Job finished with an exit code other than 0.

See broader queue with showq

The showq command can be used to show what the rest of the partition looks like.  Often your job is pending due to other people in the partition.  The showq command then shows you an overview of all the jobs for a specific partition.  showq is invoked by doing:

showq -o -p shared

Where -o orders the pending queue by priority, with the next job to be scheduled at the top.  -p specifies the partition that you want to look at.

Killing jobs with scancel

If for any reason, you need to kill a job that you've submitted, just use the scancel command with the job ID.

scancel 9999999

If you don't keep track of the job ID returned from sbatch, you should be able to find it with the squeue or sacct command described above.

Interactive jobs and srun

Though batch submission is the best way to take full advantage of the compute power in Odyssey, foreground, interactive jobs can also be run. These can be useful for things like:

  • Iterative data exploration at the command line
  • RAM intensive graphical applications like MATLAB or SAS.
  • Interactive "console tools" like R and iPython
  • Significant software development and compiling efforts

An interactive job differs from a batch job in two important aspects: 1) the partition to be used is the test partition (though any partition in Slurm can be used for interactive work) and, 2) jobs should be initiated with the srun command instead of sbatch. This command:

srun -p test --pty --mem 500 -t 0-06:00 /bin/bash

will start a command line shell (/bin/bash) on the test queue with 500 MB of RAM for 6 hours; 1 core on 1 node is assumed as these parameters (-n 1 -N 1) were left out. When the interactive session starts, you will notice that you are no longer on a login node, but rather one of the compute nodes dedicated to this queue. The --pty option allows the session to act like a standard terminal. In a pinch, you can also run an application directly though this is discouraged due to problems setting up bash environment variables. After loading a module for MATLAB, you can start the application with the following command:

srun -p test --pty --x11=first --mem 4000 -t 0-06:00 matlab

In this case, we've asked for more memory because of the larger MATLAB footprint. The --x11-first option allows XWindows to operate between the login and compute nodes.

The test partition requires that you actually interact with the session. If you go more than an hour without any kind of input, it will assume that you have left the session and will terminate it. If you have interactive tasks that must stretch over days, you may be able to use the GNU Screen or tmux utility to prevent the termination of a session.


Remote desktop access

As described in the Access & Login page, you can connect to the Odyssey system through NX-based remote desktops. Remote desktop access is particularly useful for heavy client applications like MATLAB, SAS, and Spyder where the performance of X11 forwarding is poor. Once you have connected via NX, though, you should start an interactive session or run batch jobs. The rcnx* and holynx* servers are just like Odyssey login nodes and cannot support direct computation.

Run an interactive session before starting your application on a login or NX node.

Storage and Scratch on Odyssey

Odyssey partitions have many owned and general purpose file systems attached for use by labs and individuals to store data long-term. These are shared filesystems and are typically located in a different datacenter from the compute nodes. As such high I/O (Input/Output) from jobs are not be the best use case for your lab storage, as lab storage is not designed for jobs that need to write large amounts of data or need quick access to storage.

For best performance while running jobs please use the regal temporary scratch storage found at /n/regal. This is a Lustre file system with 1.2 PB of storage and connected via Infiniband fabric. This temporary scratch space is available from all compute nodes.  In addition Lustre has the ability to stripe data across its servers to enhance performance.  See these handy guides for more details about how to do Lustre striping.

There are lab-based 50TB quota and a 90 day retention policy on regal scratch. Please review the scratch policy page here. If you have not moved your data after 90 days it will be deleted to make space for other users. Please use regal only for reading and writing data from the cluster. Please create a subdirectory in your lab group's folder here under /n/regal/[lab name] Please contact us if your lab does not have a Regal directory or you are unable to create a sub-directory for yourself.


Troubleshooting Jobs and Resource Usage

A number of factors, including fair-share are used for job scheduling

We use a multifactor method of job scheduling on Odyssey. Job priority is assigned by a combination of fair-share and length of time a job has been sitting in the queue.  You can find out the priority calculation for your jobs by using the sprio command, such as sprio -j JOBID.

You can find a description of how SLURM calculates Fair-share here.  Fairshare is shared on a lab basis, so usage by any member of the lab will impact the score of the whole lab as the lab is pulling from a common pool.  Fairshare has a 2 day halflife and naturally recovers if your lab does not run any jobs.  Thus it is wise to store up fairshare if you need to do significant runs, and plan your runs accordingly in order to maintain a good fairshare score.  You can learn more about your fairshare score and slurm usage by using the sshare command, such as sshare -U which shows your current score.  Contact RC if you want to get graphs of your usage and fairshare over time.

The other factor in priority is how long you have been sitting in the queue. The longer your job sits in the queue the higher its priority grows, out to a maximum of 7 days. If everyone’s priority is equal then FIFO is the scheduling method.  We have weighted the age of a job of a fully mature job to be equal to a fairshare score of 0.5.

We also have backfill turned on. This allows for jobs which are smaller to sneak in while a larger higher priority job is waiting for nodes to free up. If your job can run in the amount of time it takes for the other job to get all the nodes it needs, SLURM will schedule you to run during that period. This means knowing how long your code will run for is very important and must be declared if you wish to leverage this feature. Otherwise the scheduler will just assume you will use the maximum allowed time for the partition when you run.  The better your constrain your job in terms of CPU, Memory, and Time the easier it will be for the backfill scheduler to find you space and let your job jump ahead in the queue.

Troubleshooting common problems

A variety of problems can arise when running jobs on Odyssey. Many are related to resource mis-allocation, but there are other common problems as well

Error Likely cause
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit, being killed Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced. For jobs that require truly large amounts of memory (>256 Gb), you may need to use the bigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp.
SLURM_receive_msg: Socket timed out on send/recv operation This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to.
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE This message may arise for a variety of reasons, but it typically indicates that the host on which your job was running can no longer be contacted by SLURM.  Jobs that die from NODE_FAILURE are automatically requeued by the scheduler.

Using Threads such as OpenMP

One of the basic methods for parallelization is to use a threading library, such as OpenMP.  Slurm by default does not know what cores to assign to what process it runs, in addition for threaded applications you need to make sure that all the cores you request are on the same node.  Below is an example OpenMP script that both ensures all the cores are on the same node, and let's Slurm know which process gets the cores that you requested for threading.

#!/bin/bash
#SBATCH -N 1 # Number of Nodes
#SBATCH -c 8 # Number of threads
#SBATCH -t 0-00:30:00 # Amount of time needed DD-HH:MM:SS
#SBATCH -p shared # Partition to submit to
#SBATCH --mem-per-cpu=100 #Memory per cpu

module load intel/17.0.2-fasrc01
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -c $SLURM_CPUS_PER_TASK MYPROGRAM > output.txt 2> errors.txt

The most important aspect of the threaded script above is the -c option which tells Slurm how many threads you intend to run with.  The other important option is the -N option which tells slurm how many nodes to distribute the job over.  Without -N 1 Slurm could distribute your job over any number of nodes, leading to wasted resources and poor job performance.

Using MPI

MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. OpenMPI and MVAPICH2 are available as modules on Odyssey as well as an Intel specific library.

As described in the Helmod documentation, MPI libraries are a special class of module, called "Comp", that is compiler dependent. To load an MPI library, load the compiler first.

$ module load intel/15.0.0-fasrc01 openmpi/1.10.0-fasrc01

Once an MPI module is loaded, applications built against that library are made available. This dynamic loading mechanism prevents conflicts that can arise between compiler versions and MPI library flavors.

An example MPI script with comments is shown below:

#!/bin/bash
#SBATCH -n 128 # Number of cores
#SBATCH -t 5 # Runtime in minutes
#SBATCH -p shared # Partition to submit to
#SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem)

module load intel/15.0.0-fasrc01 openmpi/1.10.0-fasrc01
module load MYPROGRAM
srun -n $SLURM_NTASKS --mpi=pmi2 MYPROGRAM > output.txt 2> errors.txt

There are a number of important aspects to an MPI SLURM job.

  • MPI jobs must be run on a partition that supports MPI interconnects.  shared, test, general, unrestricted are MPI-enabled, but serial_requeue includes non-MPI resources and should be avoided.
  • Memory should be allocated with the --mem-per-cpu option instead of --mem so that memory matches core utilization.
  • The -np option for mpirun or mpiexec (when these runners are used) should use the bash variable $SLURM_NTASKS so that the correct number of cores is passed to the MPI engine at runtime.
  • If network topology and communications overhead is a concern for your code, try using the  --contiguous option which will ensure that all the cores you get will be adjacent to each other.  Use this with caution though as it will make your job pend longer, as finding contiguous blocks of compute is difficult.  Verify that the boost in performance is worth the extra wait time in the queue.  If you do not include this option you will be given cores and what ever nodes that Slurm can find, which may be scattered across the cluster.   Depending on your code this may or may not be a concern.  Test your code in both modes to see if it is an option that is worth including if you don't know off hand.  After all it may be worth not including --continguous as the aggregate time of waiting plus runtime is significantly less than would be the case with --contiguous.  The sbatch and srun documentation have more information on various fine tuning options.
  • The application must be MPI-enabled. Applications cannot take advantage of MPI parallelization unless the source code is specifically built for it. All such applications in the Helmod module system can only be loaded if an MPI library is loaded first.

Job arrays

SLURM allows you to submit a number of "near identical" jobs simultaneously in the form of a job array. To take advantage of this, you will need a set of jobs that differ only by an "index" of some kind.

For example, say that you would like to run tophat, a splice-aware transcript-to-genome mapping tool, on 30 separate transcript files named trans1.fq, trans2.fq, trans3.fq, etc. First, construct a SLURM batch script, called tophat.sh, using special SLURM job array variables:

#!/bin/bash
#SBATCH -J tophat # A single job name for the array
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # All cores on one machine
#SBATCH -p serial_requeue # Partition
#SBATCH --mem 4000 # Memory request (4Gb)
#SBATCH -t 0-2:00 # Maximum execution time (D-HH:MM)
#SBATCH -o tophat_%A_%a.out # Standard output
#SBATCH -e tophat_%A_%a.err # Standard error

module load tophat/2.0.13-fasrc02

tophat /n/regal/informatics_public/ref/ucsc/Mus_musculus/mm10/chromFatrans"${SLURM_ARRAY_TASK_ID}".fq

Then launch the batch process using the --array option to specify the indexes.

sbatch --array=1-30 tophat.sh

In the script, two types of substitution variables are available when running job arrays. The first, %A and %a, represent the job ID and the job array index, respectively. These can be used in the sbatch parameters to generate unique names. The second, SLURM_ARRAY_TASK_ID, is a bash environment variable that contains the current array index and can be used in the script itself. In this example, 30 jobs will be submitted each with a different input file and different standard error and standard out files.

More detail can be found on the SLURM job array documentation page.

Checkpointing

Slurm does not automatically checkpoint, i.e. create files that your job can restart from.  To protect against job failure (due to code error or node failure) and to allow your job to be broken up into smaller chunks it is always advisable to checkpoint your code so it can restart from where it left off.  This is especially valuable for jobs on partitions subject to requeue, but is also just generally useful for any type of job.  Checkpointing varies from code type to code type and needs to be implemented by the user as part of their code base.

Job dependencies

Many scientific computing tasks consist of serial processing steps. A genome assembly pipeline, for example, may require sequence quality trimming, assembly, and annotation steps that must occur in series. Launching each of these jobs without manual intervention can be done by repeatedly polling the controller with squeue / sacct until the State is COMPLETED. However, it's much more efficient to let the SLURM controller handle this using the --dependency option.



Example of submitting a job with a dependency on a previous job.

When submitting a job, specify a combination of "dependency type" and job ID in the --dependency option. afterok is an example of a dependency type that will run the dependent job if the parent job completes successfully (state goes to COMPLETED). The full list of dependency types can be found on the SLURM doc site in the man page for sbatch.  It is best not to create a chain of dependencies that is greater than 2-3 levels.  Any more than that and the scheduler will become significantly slower.  Dependencies should only be used if the resource requirements between each step are significantly different, or if you need to wait for an array to complete before you run a single job that processes all the array results.  Be sure to think about whether you truly need dependencies or not.


Last updated: November 17, 2017 at 14:52 pm

CC BY-NC 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.