Table of Contents
- 1 PREREQUISITES
- 2 Accessing the Cluster
- 2.1 Use a terminal to ssh to login.rc.fas.harvard.edu
- 2.2 Transfer any files you may need
- 2.3 Familiarize yourself with proper decorum on the cluster
- 2.4 Determine what software you'd like to load and run
- 2.5 Determine where your files will be stored
- 2.6 Run a batch job...
- 2.7 ... or an interactive job.
- 2.8 Getting further help
- 2.9 A note on requesting memory (--mem or --mem-per-cpu)
This guide will provide you with the basic information needed to get up and running on the FASRC cluster for simple command line access. If you'd like more detailed information, each section has a link to fuller documentation
1. Get a FASRC account using the account request tool.
Before you can access the cluster you need to request a Research Computing account.
See How Do I Get a Research Computing Account for instructions if you do not yet have an account.
See the account confirmation email for instructions on setting your password and getting started.
2. Setup OpenAuth for two factor authentication
Once you have your new FASRC account, you will need to set up our OpenAuth tool for two-factor authentication.
See the OpenAuth Guide for instructions if you have not yet set up OpenAuth.
For troubleshooting issues you might have, please see our troubleshooting page.
3. Use the FASRC VPN when connecting to storage, VDI, or other resources available only on our networks.
4. Review our introductory training
Accessing the Cluster
Use a terminal to ssh to login.rc.fas.harvard.edu
NOTE: If you did not request cluster access when signing up, you will not be able to log into the cluster or login node. See this doc for how to add cluster access.
For command line access to the cluster, connect to login.rc.fas.harvard.edu using ssh. If you are running Linux or Mac OSX, open a terminal and type
ssh USERNAME@login.rc.fas.harvard.edu, where USERNAME is the name you were assigned when you received your account. Enter the password you setup in the account request tool. When prompted for the Verification code, use the OpenAuth supplied number.
The OpenAuth application (upper right corner) displays the value to be used for the Verification code prompt.
-CY if you have an X11 server installed and desire graphics support (
ssh -CY firstname.lastname@example.org). For help with X11 forwarding, start with our Access and Login page.
For Windows users, we recommend PuTTy for SSH. HUIT (Harvard IT) also provides newer versions of SecureCRT (SSH) and SecureFX (SFTP). If you are in FAS and would like to try them, go to the HUIT download page (uses HarvardKey). Older versions of these programs will not work with modern SSH.
Transfer any files you may need
If you're using a Linux-y terminal like the Mac OSX Terminal tool or a Linux xterm, you'll want to use
scp for transferring data
This will transfer the data into the root of your home directory.
There are also graphical scp tools available. The Filezilla SFTP client is available cross-platform for Mac OSX, Linux, and Windows. See our SFTP file transfer using Filezilla document for more information. Windows users who prefer SCP can download it from WinSCP.net.
NOTE: If you are off campus or behind a firewall, you should first connect to the Research Computing VPN.
- See our data transfer page using SCP or our SFTP file transfer using Filezilla (Mac/Windows/Linux) pages for more details.
Familiarize yourself with proper decorum on the cluster
The FASRC cluster is a massive system of shared resources. While much effort is made to ensure that you can do your work in relative isolation, some rules must be followed to avoid interfering with other user's work.
The most important rule on the cluster is to avoid performing computations on the login nodes. Once you've logged in, you must either submit a batch processing script or start an interactive session (see below). Any significant processing (high memory requirements, long running time, etc.) that is attempted on the login nodes will be killed.
See the full list of Cluster Customs and Responsibilities.
Determine what software you'd like to load and run
An enhanced module system called Helmod is used on the cluster to control the run-time environment for individual applications. To find out what modules are available you can either look at the module list on the RC / Informatics portal, or use the
module avail command. By itself, module avail will print out the entire list of packages. To find a specific tool, use the module spider or module-query command.
Once you've determined what software you would like to use, load the module:
where MODULENAME is the specific software you want to use. You can use
module unload MODULENAME to unload a module. To see what modules you have loaded type
module list. This is very helpful information to provide when you submit help tickets.
For errors in loading modules after the O3 upgrade, see Modules on CentOS7 upgrade page.
For details on finding and using modules effectively, see Software on the cluster page.
For details on running software on the cluster, including graphical applications, see module section of the Running Jobs page.
Determine where your files will be stored
Users of the cluster are granted 100Gb of storage in their home directory. This volume has decent performance and is regularly backed up. For many, this is enough to get going. However, there are a number of other storage locations that are important to consider when running software on the FASRC cluster.
- /n/scratchlfs Scratchlfs is large, high performance temporary Lustre filesystem. We recommend that people use this filesystem as their primary working area, as this area is highly optimized for cluster use. Use this for processing large files, but realize that files will be removed after 90 days and the volume is not backed up. Create your own folder inside the folder of your lab group. If that doesn't exist, contact RCHelp.
- /scratch When running batch jobs (see below), /scratch is a large, very fast temporary store for files created while a tool is running. It is a good place for temporary files created while a tools is executing because the disks are local to the node that is performing the computation making access is very fast. However, data is only accessible from the node itself so you cannot directly retrieve it after calculations are finished.
- Lab storage Each lab that is doing regular work on the cluster can request an initial 4Tb of group accessible storage at no charge. Like home directories, this is a good place for general storage, but it is not high performance and should not be used during I/O intensive processing.
Do NOT use your home directory or lab storage for significant computation. This degrades performance for everyone on the cluster.
For details on different types of storage and how obtain more, see the Cluster Storage page
Run a batch job...
The cluster is managed by a batch job control system called SLURM. Tools that you want to run are embedded in a command script and the script is submitted to the job control system using an appropriate SLURM command.
For a simple example that just prints the hostname of a compute host to both standard out and standard err, create a file called
hostname.slurm with the following content:
#SBATCH -n 1 # Number of cores requested
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 15 # Runtime in minutes
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH --mem=100 # Memory per cpu in MB (see also --mem-per-cpu)
#SBATCH -o hostname_%j.out # Standard out goes to this file
#SBATCH -e hostname_%j.err # Standard err goes to this filehostname
Then submit this job script to SLURM
When command scripts are submitted, SLURM looks at the resources you've requested and waits until an acceptable compute node is available on which to run it. Once the resources are available, it runs the script as a background process (i.e. you don't need to keep your terminal open while it is running), returning the output and error streams to the locations designated by the script.
You can monitor the progress of your job using the
squeue -j JOBID command, where JOBID is the ID returned by SLURM when you submit the script. The output of this command will indicate if your job is PENDING, RUNNING, COMPLETED, FAILED, etc. If the job is completed, you can get the output from the file specified by the
-o option. If there are errors, the should appear in the file specified by the
If you need to terminate a job, the
scancel command can be used (JOBID is the number returned when the job is submitted).
SLURM-managed resources are divided into partitions (known as queues in other batch processing systems). Normally, you will be using the
serial_requeue partitions, but there are others for interactive jobs (see below), large memory jobs, etc.
For more information on the partitions on the cluster, please see the SLURM partitions page.
For more information and running batch jobs, including MPI code, please see the Running Jobs page.
For a list of useful SLURM commands, please see the Convenient SLURM Commands page.
... or an interactive job.
Batch jobs are great for long-lasting computationally intensive data processing. However, many activities like one-off scripts, graphics and visualization, and exploratory analysis do not work well in a batch system, but are too resource intensive to be done on a login node. There is a special partition on the cluster called "test" that is designed for responsive, interactive shell and graphical tool usage.
You can start an interactive session using a specific flavor of the
srun is like
sbatch, but it runs synchronously (i.e. it does not return until the job is finished). The example starts a job on the "test" partition, with pseudo-terminal mode on (
--pty), an allocation of 500 MB RAM (
--mem 500), and for 6 hours (
D-HH:MM format). It also assumes one core on one node. The final argument is the command that you want to run. In this case you'll just get a shell prompt on a compute host. Now you can run any normal Linux commands without taking up resources on a login node. Make sure you choose a reasonable amount of memory (
--mem) for your session.
Getting further help
If you have any trouble with running jobs on the cluster, first check the comprehensive Running Jobs page and our FAQ. Then, if your questions aren't answered there, feel free to contact us at RCHelp. Tell us the job ID of the job in question. Also provide us with what script you ran, the error and output files, and where they're located as well. The output of
module list is helpful, too.
A note on requesting memory (
In SLURM you must declare how much memory you are using for your job using the
--mem-per-cpu command switches. By default SLURM assumes you need 100 MB. If you don't request enough the job can be terminated, often times without very useful information (error files can show segfault, file write errors, etc. that are downstream symptoms). If you request too much, it can increase your wait time (it's harder to allocate a lot of memory than a little), crowd out jobs for other users, and lower your fairshare.
You can view the runtime and memory usage for a past job with
where JOBID is the numeric job ID of a past job:
JobID JobName ReqMeM MaxRSS Elapsed
531306 sbatch 00:02:03
531306.batch batch 750000K 513564K 00:02:03
531306.0 true 916K 00:00:00
.batch portion of the job is usually what you're looking for, but the output may vary. This job had a maximum memory footprint of about 500MB, and took a little over two minutes to run.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.