#

FAQ

a. Login and Authentication (12)

My login is slow or my batch commands are slow

Nine times out of ten, slowness at login, starting file transfers, failed SFTP sessions, or slow batch command starts is caused by un-needed module loads in your .bashrc

We do not recommend putting multiple module loads in your .bashrc as each and every new shell you or your jobs create will call those module loads. It is recommended that you put your module loads in your job scripts so that you are not loading un-needed modules and waiting on those module calls to complete before commencing the job. Alternately, you can create a login script or alias containing your frequently used modules that you can run when you need to use them.

Either way, try to keep any module loads in your .bashrc down to a bare minimum, calling only those modules that you absolutely need in each and every login or job.

Additionally, as time goes on modules change or are removed. Please ensure you remove any deprecated modules from your .bashrc or other scripts. For example, the legacy modules no longer exist. So if you have a call to 'module load legacy' and any of the legacy modules, your login will be delayed as the module system searches for and then times out on those non-existent modules.

Permalink.

My alternate shell (csh, tcsh, etc.) doesn’t work right

Having a non-standard default shell will cause problems and does not allow us to set global environmental defaults for everyone. As of Odyssey3 we will no longer change the default shell on any account or support the use of alternate shells as default login shell.

Users who do not have bash as their default login shell will need to change back to bash. Users can, of course, still launch an alternate shell once logged in.

Permalink.

SSH key error, DNS spoofing message

Table of Contents

Whenever nodes are updated (for instance, the May 2018 upgrade to CentOS 7), if there is a significant change to them then the SSH key fingerprint is likely to change. As you've already stored the fingerprint locally, you will receive a key mismatch error like "WARNING: POSSIBLE DNS SPOOFING DETECTED!" and "The RSA host key for login.rc.fas.harvard.edu has changed". 

Mac/Linux

To fix this, you will need to remove the key in question from your computer's local known_hosts file. If you are on a Mac or Linux, you can use the following command from a terminal window on your computer.

ssh-keygen -R login.rc.fas.harvard.edu

If the error was for a specific node, replace 'login.rc.fas.harvard.edu' with the full name of that host.

You can now log into the node and will receive an all new request to store the new SSH key.

The example in the screenshot above assumes that your username on your local machine (jharvard, in this case) matches your Odyssey account username. If this is not the case, you will have to login with your username, explicitly, such as: ssh jharvard@login.rc.fas.harvard.edu

Please note that there are several nodes behind the 'login.rc.fas.harvard.edu' hostname, so you may receive more errors like the above. Answering yes will allow you to continue.

Alternately, if you primarily only interact with the Odyssey cluster, you may find it easiest to simply remove the known_hosts file and let it be created from scratch at next login. Mac and Linux users can do so from a terminal on their computer with the following command:

rm ~/.ssh/known_hosts

 

Windows/PuTTY

PuTTY may prompt you to update the key in place, or it may require updating a registry entry to correct this. If the latter, you will need to remove the known_hosts from the registry:

  1. Open ‘regedit.exe’ by doing a search  or by pressing the "Windows Key + R" and type "regedit" and hitting enter or try opening C:\Windows\System32\regedt32.exe
  2. Find HKEY_CURRENT_USER\Software\[your username here]\PuTTY\SshHostKeys
  3. Remove all keys or find and delete the individual key you need to remove
  4. Restart your computer, changes won't take effect until after a restart.

Permalink.

SFTP exits after a few seconds

When connecting via a SFTP client like Filezilla, if you experience a short delay and then disconnection, this is most likely an issue caused by your .bashrc 

During SFTP connections, your .bashrc will be evaluated just as if you were logging in via SSH. If you've added anything to your .bashrc that attempts to echo to the terminal/standard out, this will cause your SFTP client to hang and then disconnect.

You can either remove the statement in your .bashrc that is writing output (an echo statement, a call to an app or module that sends a message to standard out, etc.) -or- you can put the offending statement into an evaluation clause that first checks to see if this is a interactive login, like so:

if [ “$SSH_TTY” ]
  then
     echo “SFTP connections won’t evaluate the things inside this clause."
     echo "Only real login sessions will.”
  fi

Permalink.

What happens to my account when I leave/graduate?

Your RC account is only valid while sponsored by an eligible faculty member. Once you leave the university, your account is subject to closure.

In order for a leaving user to continue to keep an open RC account, FAS RC will need a record of the sponsor's direct OK to continue.

If you need to continue access after leaving, please have your PI/sponsor contact us (via our Portal or by emailing rchelp@rc.fas.harvard.edu from their university email account.) Please understand that a forwarded email is not sufficient as this could be easily forged by someone attempting to game the system. 

Permalink.

How do I get a Research Computing account?

Before You Sign Up

If you are unsure whether you qualify for an RC account, please see Qualifications and Affiliations. More information on using the signup tool can be found here.

For more information on the billing process for labs and their members, please see: Billing FAQ.

Please Note: You may have only one RC account. If you need to add cluster access or membership in a different/additional lab group, please submit a help ticket. Please do not sign up for a second account. This is unnecessary and against our account policies.

The Process

To request an account to access resources operated by Research Computing. (Odyssey Cluster, Storage, Software Downloads, Workstation access, Instrument sign-up, etc.), please proceed to the

Account Request Tool.

PLEASE NOTE: Do not select FACULTY as your job type is you do not have a faculty appointment. If you are a researcher with additional rights (fellowship, PI-like rights, funding, etc.), please select STAFF or POSTDOC. If you are faculty from another university, please choose EXTERNAL. Faculty accounts are intended only for those holding an active Harvard Associate Professor or higher appointment.

Once you've submitted the request, the process is:

If You Selected: Internal/Using Harvard Key to verify your information and qualifications:

  1. The request is on hold while the PI is asked to approve or reject it.
  2. Once approved, the account is finalized and set up.
  3. Once finalized, you receive an automated email confirmation with your new account information and instructions for setting the password.

If You Selected: External/Not using Harvard Key to verify your information and qualifications:

  1. The request goes to RC personnel to check that it is complete and meets affiliation requirements.
  2. Once approved by RC, an email is sent to your PI to approve/reject the request.
  3. The request is on hold while the PI is asked to approve or reject it.
  4. Once approved, we finalize the account on our side (during business hours).
  5. Once finalized, you receive an automated email confirmation with your new account information and instructions for setting the password..

You can then proceed to set up your OpenAuth token and get connected to Odyssey. The turnaround time is directly related to the PI/Sponsor's approval of the account. External accounts are reviewed by RC staff during business hours and generally vetted and sent on to the PI/Sponsor for approval within one business day

NOTE! If you request "Odyssey Cluster Use" (the ability to run jobs on the cluster), you are required to complete the online Introduction to Odyssey course within 45 days of your account being issued.

Permalink.

Can I share an account? – Account Security Policies

The sharing of passwords or login credentials is not allowed under RC and Harvard information security policies. Please bear in mind that this policy also protects the end-user. Sharing credentials removes plausible deniability for the account holder in case of account misuse. Accounts which are in violation of this policy may be disabled or otherwise limited.

If you find that you need to share resources among multiple individuals, please contact us and we will be happy to assist you with finding a safe and secure way to do so.

Permalink.

How do I login to Odyssey?

Step 0: Ensure that you've requested an account, your PI has approved the account request, and that you've received an Account Approved notice from RC.

Step 1: Launch the OpenAuth application. For instructions on how to install and launch OpenAuth please see here.

Step 2: Launch a Terminal application.

Step 3: Using your Terminal application, connect through login.rc.fas.harvard.edu using ssh. If you are running Linux or Mac OSX it is as simple as running: ssh USERNAME@login.rc.fas.harvard.edu

USERNAME is the name you were assigned when you received your Research Computing account. (Add -Y if you have an X11 server installed and desire graphics support.) If you are on Windows, download PuTTY or your favorite ssh software and connect to login.rc.fas.harvard.edu.

You will be asked for your Research Computing password and OpenAuth Verification Code upon connecting. The hostname login.rc.fas.harvard.edu is a round-robin to some of our hosts named rclogin##.rc.fas.harvard.edu, so that is what you will see in your shell prompt once connected.

Note: In certain instances you will need to be logged on to the Research Computing VPN to access Odyssey. Please see the VPN setup page for instructions on how to logon to the Research Computing VPN.

For more details on access to the Odyssey cluster see the Access and Login page.

Permalink.

How do I reset my Research Computing account password?

Please click here to reset your Research Computing account password using your email address.

This will send an email to you with a one-time use link to set a new password.

Please note: Your username is not your email address. Your email address is used here only for password resets and to contact you.

Permalink.

How do I unlock my locked Research Computing account?

Once your account is locked, your account will automatically unlock after 15 minutes.

If your account is 'unavailable' or disabled, you will need to contact us so we can look at why and resolve any pending problems or questions regarding your account. If your account has not been used in a very long time (6 months or more), it may be disabled as a security caution.

Permalink.

How do I install and launch OpenAuth?

Please click here to set-up OpenAuth.

The site will prompt you for your Harvard FAS Research Computing username and password. If you don’t yet have an account, you can request one here. Since the site uses email verification to authenticate you, you must also have a valid email address on record with Research Computing. All OpenAuth tokens are software-based, and you will choose whether to use a smart phone or java desktop app to generate your verification codes. Java 1.6 is required for the desktop app.

You must close your browser in order to logout of the site when you’re done. Once you have logged out, launch OpenAuth like any other application. You will need to use OpenAuth when accessing the Research Computing VPN and Odyssey cluster.

Permalink.

View category→

b. Filesystems and Authorization (8)

Where is ftp?

Modern secure transfer protocols like SFTP and SCP secure data during transit and should be used when moving files from one place to another. However you may still need to use plain, un-secured FTP to download data sets or other files from remote locations while logged into Odyssey.

While we do not offer the largely outmoded 'ftp' program on the cluster, we do offer the feature-rich and largely command compatible 'lftp'. From any login or compute node type 'man lftp' to see its usage and options.

Permalink.

How do I access my Odyssey home directory from my laptop?

Odyssey home directories are available through SAMBA and so can be mounted as a network drive on Mac, Windows, and Linux computers. See the Access and Login page for specific instructions on how to mount the directory.

If you do not need a persistent connection to your home directory, you can also transfer files using SFTP.

Permalink.

How do I check how much space I’ve used?

The standard linux tool du shows how much disk space is being used by individual files and directories. For example, the command:

First cd to your home directory cd ~

du -x --max-depth 1 .

will print how much space is used by each directory in your current working directory, plus a total at the end. This will take a few minutes if you have a lot of files, so be patient.

Note: With the legacy home directories, the command df showed your personal quota details; this does not work with our current configuration of home directories on the Isilon filesystem.

Permalink.

How much space do I have in my home directory?

You are given 100 GB in your home directory. This is twice as much as with the legacy home directories. This size limit is referred to as your quota.

Sorry, but we cannot increase this allotment. Please use disk shares associated with your lab or one of our scratch files systems if you require more space.

Please see our Storage document for more information.

Permalink.

I accidentally deleted my data, how do I get it back?

Your home directory has periodic snapshots taken. These snapshots are of your home directory files from various recent points in time. They are in a hidden directory named .snapshot, within every other directory in your home directory. The command ls -a will not show these, but you can ls .snapshot directly, and cd .snapshot to go into the directory.

In the .snapshot folder you will see “hourly” “daily” “monthly” folders with the date of the snapshots. Traverse (cd) to the snapshot folder corresponding to the period you wish to restore data from. From there you can simply copy the relevant files back into your home folder using your favorite file copy tool (rsync, cp, etc.)

Lab directory backups are for disaster recovery only, as they are handled separately and do not have snapshot capabilities. As such, we cannot recover accidental file deletions. Please contact RC Help if you have any questions.

Please also see our Storage document for more info.

Last updated: May 22, 2018 at 17:19 pm

Permalink.

Why are all my files executable?

You may notice that the x (execute) bit is set on all your files:

[username@rclogin01 ~]# ls -l myfile.txt
-rwxr--r-- 1 username groupname 3029 Aug 20 03:10 myfile.txt

Furthermore, chmod does not remove it:

[username@rclogin01 ~]# chmod u-x myfile.txt
[username@rclogin01 ~]# ls -l myfile.txt
-rwxr--r-- 1 username groupname 3029 Aug 20 03:10 myfile.txt

This is a feature, a result of the storage system doing mixed Unix-style and Windows-style permissions. If this is causing a problem for you, please contact rchelp@fas.harvard.edu.

Permalink.

Why does my UMASK not work?

You may also notice that your UMASK environment variable does not work as expected:

[username@rclogin01 ~]# umask 002
[username@rclogin01 ~]# touch newfile.txt
[username@rclogin01 ~]# ls -l newfile.txt
-rwx------ 1 username groupname 3029 Aug 20 03:10 newfile.txt

Normally, the outcome would be -rw-rw-r--. If this is causing a problem for you, please contact rchelp@fas.harvard.edu.

Permalink.

Is my home directory available as a network filesystem share?

Yes, your cluster home directory is available as a network filesystem share to which you can directly connect your own desktop or laptop. The technical protocol for this is called CIFS or Samba, so you will often hear us refer to it in that way. On Windows, this is also referred to as mapping a network drive, and on a Mac it is called connecting to a server.

In all cases, you need your RC username, password, server name, and path. Please see the Mounting Storage document for detailed information. (short link)

Permalink.

View category→

d. Software (7)

I need access to Gaussian

Please contact us if you require Gaussian access. It is controlled on a case-by-case basis and requires membership in a security group.

To see all available versions of Gaussian, visit the Portal Module Search and Search for 'gaussian'.

Permalink.

Where is ftp?

Modern secure transfer protocols like SFTP and SCP secure data during transit and should be used when moving files from one place to another. However you may still need to use plain, un-secured FTP to download data sets or other files from remote locations while logged into Odyssey.

While we do not offer the largely outmoded 'ftp' program on the cluster, we do offer the feature-rich and largely command compatible 'lftp'. From any login or compute node type 'man lftp' to see its usage and options.

Permalink.

How do I load a module or software on Odyssey?

Step 1: Login to Odyssey through your Terminal window. Please see here for login instructions.

Step 2: Load a module/software by typing: module load MODULENAME. Replace MODULENAME with the specific software you want to use. A complete listing of modules can be found on the module list page. Only the modules that begin with centos6/ are supported on the current cluster.

To see what modules you have loaded type: module list

To unload a module type: module unload MODULENAME

Details can be found in the modules section of the Running Jobs page.

Permalink.

FileZilla: I have to enter my OpenAuth code every 30 seconds

If you are using Filezilla to transfer files to Odyssey, and you are prompted frequently (like every 30 seconds!) to enter your RCUsername and/or OpenAuth token code, then most likely you did not configure FileZilla according to our instructions.

Please see this document on how to avoid the OpenAuth challenge frustration while transferring files to and from Odyssey.

Last updated: May 22, 2018 at 17:19 pm

Permalink.

Git/Github: 403 Forbidden while accessing https://github.com…

If you issue a git push to a cloned repository, you might receive the following error:

error: The requested URL returned error: 403 Forbidden while accessing https://github.com/yourusername/planets.git/info/refs
fatal: HTTP request failed

Authorization to Github repositories on Odyssey is can be a little tricky. Please follow our instructions at https://rc.fas.harvard.edu/resources/documentation/software/git-and-github-on-odyssey/

Last updated: May 22, 2018 at 17:19 pm

Permalink.

How do I run a Matlab script on Odyssey?

To run a Matlab script (with no graphical interface component) on the Odyssey cluster, login using your preferred terminal application then activate the application by loading the module.

module load matlab
 
To load a particular version of Matlab, see: Module Search 'matlab'

Then, assuming your script is named calc.m, either run it through an interactive session

srun --pty --mem 1000 -p test matlab -nojvm -nodisplay -nosplash < calc.m

or use the matlab command in a batch script

#!/bin/bash
#SBATCH -o calc.out
#SBATCH -o calc.err
#SBATCH -p serial_requeue
#SBATCH -n 1
#SBATCH --mem 1000
#SBATCH -t 1000

matlab -nojvm -nodisplay -nosplash < calc.m

Make sure that `calc.m` finishes with an `exit` command. Otherwise, the process will hang waiting for further input.

Permalink.

Perl modules: Can’t locate XX.pm in @INC

Perl modules have been developed over the past 15 to 20 years, and the installation method has changed significantly. Unfortunately, you might run into a program that needs to install a really old Perl module, and its installation is just not behaving properly under the new installation methods. You might see something like the following:

[bfreeman@rclogin12 PfamScan]$ ./pfam_scan.pl --help
Can't locate Data/Printer.pm in @INC (@INC contains: /n/sw/fasrcsw/apps/Core/perl-modules.....

The remedy can be rather simple:
1. Follow our new lmod - Perl instructions here on setting up your home directory for installing Perl modules 'locally'.

Note that the export PERL5LIB command must include both $LOCALPERL and $LOCALPERL/lib/perl5 (it's subdirectory) as some installation routines honor one; some the other.

2. Sometimes, you might need to install the module manually. Try both the Makefile.PL build and the Build.PL build if one or the other doesn't work.

3. In CPAN, you can do this manual install method without the hassle of the download process:

cpan
look Data::Printer

This latter command will download the module and unpack it for you, and leave you at the shell, where you can try either the Makefile.PL or Build.PL build process.
 
 
Last updated: May 22, 2018 at 17:19 pm

Permalink.

View category→

c. Jobs and SLURM (20)

How do I know what memory limit to put on my job?

Add to your job submission:

#SBATCH --mem X

where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:

sacct -o MaxRSS -j JOBID

where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit).

For more information see here.

Permalink.

How can I check OS version when loading modules?

If you need to check what OS version the node you are on is using before loading modules, you can add a test in your .bashrc for the value of the environment variableFASRCSW_OS .

So, for instance, you could source a different file for CentOS 7 by adding something like the following to your .bashrc :

if [ "$FASRCSW_OS" = "centos7" ]; then
source ~/.mycentos7.rc
else
source ~/.mynormal.rc
fi

 

Permalink.

How do I figure out how efficient my job is?

You can see your job efficiency by comparing Elapsed, CPUTime, and NCPUS in sacct. For example:

[user@rclogin01 home]# sacct -j 1234567 -o Elapsed,CPUTime,NCPUS
Elapsed  CPUTime     NCPUS
---------- ---------- ----------
13:22:35 35-16:05:20 64
13:22:35 17-20:02:40 32
13:22:41 35-16:11:44 64
13:21:39 1-02:43:18   2

In this job you see that the user used 64 cores and their job ran for 13 hours. However their CPUTime is 35.5 hours which is close to 64*13 hours. If your code is scaling effectively CPUTime = NCPUS * Elapsed. If it is not that number will diverge. The best way to test this is to do some scaling tests. There are two styles you can do. Strong scaling is where you leave the problem size the same but increase the number of cores. If your code scales well it should take less time proportional to the number of cores you use. The other is weak scaling where the amount of work per core remains the same but you increase the number of cores, so the size of the job scales proportionally to the number of cores. Thus if your code scales in this case the run time should remain the same.

Typically most codes have a point where the scaling breaks down due to inefficiencies in the code. Thus beyond that point there is not any benefit to increasing the number of cores you throw at the problem. That's the point you want to look for.  This is most easily seen by plotting log of the number of cores vs. log of the runtime.

The other factor that is important in a scheduling environment is that the more cores you ask for the longer your job will pend for as the scheduler has to find more room for you. Thus you need to find the sweet spot where you minimize both your runtime and how long you pend in the queue for. For example it may be the case that if you asked for 32 cores your job would take a day to run but pend for 2 hours, but if you ask for 64 cores your job would take half a day to run but would pend for 2 days. Thus it would have been better to ask for 32 cores even though the job is slower.

Permalink.

Will single core/thread jobs run faster on the cluster?

The cluster cores, in general, will not be any faster than the ones in your workstation, in fact they may be slower if your workstation is relatively new. While we have a variety of chipsets available on the cluster, most of the cores are AMD and will be slower than many Intel chips, which are most common in modern desktops and laptops. The reason we use so many AMD chips is that we could purchase a larger number of cores and RAM this way. This is the power of the cluster. The cluster isn't designed to run a single core code as fast as possible as the chips to do that are expensive. Rather you trade off raw chip speed for core count. Then you gain speed and efficiency via parallelism. So the cluster excels at multicore jobs (using threads or MPI ranks) or doing many jobs that take a single core (such as parameter sweeps or image process). This way you leverage the parallel nature of the cluster and the 60,000 cores available.

So if you have a single job, the cluster isn't really a gain. If you have lots of jobs you need to get done, or your job is to large to fit on a single machine (due to RAM or its parallel nature), the cluster is the place to go. The cluster can also be useful for offloading work from your workstation. That way you can use your workstation cores for other tasks and offload the longer running work onto the cluster.

In addition since the cluster cores are a different architecture from your workstation one needs to be aware that the code will need to be optimized differently. This is where compiler choice and compiler flags can come in handy. That way you can get the most out of both sets of cores. Even there you may not get the same performance out of the cluster as your local machine. The main processor we have on the cluster is now 4 years old, and if you are using serial_requeue you could end up on hardware bought today to stuff purchased 7 years ago. There is about a factor of 2-4 in performance in just the natural development of processor technology.

Permalink.

My login is slow or my batch commands are slow

Nine times out of ten, slowness at login, starting file transfers, failed SFTP sessions, or slow batch command starts is caused by un-needed module loads in your .bashrc

We do not recommend putting multiple module loads in your .bashrc as each and every new shell you or your jobs create will call those module loads. It is recommended that you put your module loads in your job scripts so that you are not loading un-needed modules and waiting on those module calls to complete before commencing the job. Alternately, you can create a login script or alias containing your frequently used modules that you can run when you need to use them.

Either way, try to keep any module loads in your .bashrc down to a bare minimum, calling only those modules that you absolutely need in each and every login or job.

Additionally, as time goes on modules change or are removed. Please ensure you remove any deprecated modules from your .bashrc or other scripts. For example, the legacy modules no longer exist. So if you have a call to 'module load legacy' and any of the legacy modules, your login will be delayed as the module system searches for and then times out on those non-existent modules.

Permalink.

Can I query SLURM programmatically?

I'm writing code to keep an eye on my jobs. How can I query SLURM programmatically?

We highly recommend that people writing meta-schedulers or that wish to interrogate SLURM in scripts do so using the squeue and sacct commands. We strongly recommend that your code performs these queries once every 60 seconds or longer. Using these commands contacts the master controller directly, the same process responsible for scheduling all work on the cluster. Polling more frequently, especially across all users on the cluster, will slow down response times and may bring scheduling to a crawl. Please don't.

SLURM also has an API that is documented on the website of our developer partners SchedMD.com.

Permalink.

How do I fairly manage dual/multiple lab affiliations for work on Odyssey?

We're really glad you asked us this question! How you submit your jobs will determine what lab's fairshare and billing are selected. There are two levels to this question, the first concerning filesystem rights and the second SLURM submissions.

For filesystem rights, your primary group ID should be set to your primary lab group, and request from us a secondary group membership in Active Directory. If you wish to switch to the other group for work (for example, when creating files in smith_lab shared storage, you would want to make sure your group is set to smith_lab, not jones_lab), use the newgrp 2NDGROUPNAME command.

In SLURM, ensure that your primary group membership is set for the appropriate lab, and request from us a secondary group affiliation in SLURM. When submitting SLURM jobs, all resource usage will be charged to your primary SLURM group. If you wish to submit jobs for the other group, using the --account=2NDGROUPNAME on the sbatch or srun command.

Permalink.

How do I submit a batch job to the Odyssey queue SLURM?

Step 1: Login to Odyssey through your Terminal window. Please see the Access and Login page for login instructions.

Step 2: Run a batch job by typing: sbatch RUNSCRIPT. Replace RUNSCRIPT with the batch script (a text file) you will use to run your code.

The batch script should contain #SBATCH comments that tell SLURM how to run the job.

#!/bin/bash
#
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # Number of nodes for the cores
#SBATCH -t 0-00:05 # Runtime in D-HH:MM format
#SBATCH -p serial_requeue # Partition to submit to
#SBATCH --mem=100 # Memory pool for all CPUs
#SBATCH -o hostname.out # File to which standard out will be written
#SBATCH -e hostname.err # File to which standard err will be written
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=ajk@123.com #Email to which notifications will be sent
 
hostname

See the batch submission section of the Running Jobs page for detailed instructions and sample batch submission scripts.

Permalink.

How do I submit an interactive job to the Odyssey queue SLURM?

Step 1: Login to Odyssey through your Terminal window. Please see here for login instructions.

Step 2: Run an interactive job by typing: srun -p test --pty MYPROGRAM

This will open up an interactive run for you to use.  If you want a bash prompt, type: srun --mem 500 -p test --pty bash

If you need X11 forwarding type: srun --mem 500 -p test --pty --x11=first MYPROGRAM

This will initiate an X11 tunnel to the first node on your list.  –-x11 has additional options of batch, first, last and all.

See also the interactive jobs section of the Running Jobs page.

Permalink.

How do I view or monitor a submitted job?

Step 1: Login to Odyssey through your Terminal window. Please see the Access and Login page for login instructions.

Step 2: From the command line type one of three options: smap, squeue, or showq-slurm

If you want more details about your job, from the command line type: sacct -j JOBID

You can view the runtime and memory usage for a past job by typing: sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed, where JobID is the numeric job ID of a past job.

See the Running Jobs page for more details on job monitoring.

Permalink.

My job is PENDING. How can I fix this?

How soon a job is scheduled is due to a combination of factors: the time requested, the resources requested (e.g. RAM, # of cores, etc), the partition, and one's FairShare score.

Quick solution? The Reason column in the squeue output can give you a clue:

  • If there is no reason, the scheduler hasn't attended to your submission yet.
  • Resources means your job is waiting for an appropriate compute node to open.
  • Priority indicates your priority is lower relative to others being scheduled.

There are other Reason codes; see the SLURM squeue documentation for full details.

Your priority is partially based on your FairShare score and determines how quickly your job is scheduled relative to others on the cluster. To see your FairShare score, enter the command sshare -u RCUSERNAME. Your effective score is the value in the last column, and, as a rule of thumb, can be assessed as lower priority ≤ 0.5 ≤ higher priority.

In addition, you can see the status of a given partition and your position relative to other pending jobs in it by entering the command showq-slurm -p PARTITION -o. This will order the pending queue by priority, where jobs listed at the top are next to be scheduled.

For both Resources and Priority squeue Reason output codes, consider shortening the runtime or reducing the requested resources to increase the likelihood that your job will start sooner.

Please see this document for more information and this presentation for a number of troubleshooting steps.

Last updated: May 22, 2018 at 17:19 pm

Permalink.

SLURM Errors: Job Submission Limit (per user)

If you attempt to schedule more than 7,600 jobs (all inclusive, both running and pending) you will receive an error like the following:

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

For more info about being a good cluster neighbor, see:

Permalink.

SLURM Errors: Device or resource busy

What's up? My SLURM output file terminates early with the following error:

"slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup
path /cgroup/freezer/slurm/uid_57915/job_25009017/step_batch: Device or
resource busy"

Well, usually this is a problem in which your job is trying to write to a network storage device that is busy -- probably overloaded by someone doing high amounts of I/O (input/output) where they shouldn't, usually on low throughput storage like home directories or lab disk shares.

Please contact RCHelp about this problem, giving us the jobID, the filesystem you are working on, and additional details that may be relevant. We'll use this info to track down the problem (and, perhaps, the problem user(s)).

(If you know who it is, tap them on the shoulder and show them our Odyssey Storage page.)
 
 
Last updated: May 22, 2018 at 17:19 pm

Permalink.

SLURM errors: Job cancelled due to preemption

If you've submitted a job to the serial_requeue partition, it is more than likely that your job will be scheduled on a purchased node that is idle. If the node owner submits jobs, SLURM will kill your job and automatically requeue it. This message will appear in your STDOUT or STDERR files you indicated with the -o or -e options. This is simply an informative message from SLURM.

Permalink.

SLURM Errors: Memory limit

Job <jobid> exceeded <mem> memory limit, being killed:

Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpuor, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.

For jobs that require truly large amounts of memory (>256 Gb), you may need to use thebigmem SLURM partition. Genome and transcript assembly tools are commonly in this camp.

See this FAQ on determining how much memory your completed batch job used under SLURM.

Permalink.

SLURM Errors: Node Failure

JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE:

This message may arise for a variety of reasons, but it indicates that the host on which your job was running can no longer be contacted by SLURM. Not a good sign. Contact RCHelp to help with this problem.

Permalink.

SLURM errors: Socket timed out. What?

If the SLURM master (the process that listens for SLURM requests) is busy, you might receive the following error:

[bfreeman@rclogin12 ~]$ squeue -u bfreeman
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation

Since SLURM is scheduling 1 job every 2 seconds (let alone doing the calculations to schedule this job on 1 of approximately 1000 compute nodes), it's going to be a bit busy at times. Don't worry. Get up, stretch, pet your cat, grab a cup of coffee, and try again.
 
 
Last updated: May 22, 2018 at 17:19 pm

Permalink.

SLURM Errors: Time limit

JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT:
(or you may also see 'Job step aborted' when using srun)

Either you did not specify enough time in your batch submission script, or you didn't specify the amount of time and SLURM assigned the default time of 10 minutes. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30for 12.5 hours). Submit your job again with a longer time window.

Last updated: May 22, 2018 at 17:19 pm

Permalink.

What is Fair-Share?

FairShare is a score that determines what priority you have in the scheduling queue for your jobs. The more jobs you run, the lower your score becomes, temporarily. A number of factors are used to determine this score -- please read this Fairshare document for more information.

To find out what your score is, enter `sshare -u USERNAME` in your Odyssey terminal session. In general, a score of 0.5 or above means you have higher priority for scheduling.

See also: Managing FairShare for Multiple Groups if you belong to more than one lab group

For further information, see the RC fair-share document. See also our Billing FAQ for information on how usage is charged.

Permalink.

View category→

CentOS7 (1)

CentOS 7 transition FAQ

With the upgrade of Odyssey2 to Odyssey3 and CentOS 7, there are a few things you may need to know in order to resume your work on the cluster. We've assembled an FAQ of common questions/issues.

        • SSH key or 'DNS spoofing' errors when connecting to login or other nodes
          WARNING: POSSIBLE DNS SPOOFING DETECTED! and/or The RSA host key for login.rc.fas.harvard.edu has changed error messages.

          After an update of nodes, such as the May 2018 upgrade to CentOS 7, the SSH key fingerprint of a node may change. This will, in turn, cause an error when you next try to log into that node as your locally stored key will no longer match. SSH uses this as a way to spot man-in-the-middle attacks or DNS spoofing. However, when a key changes for a valid reason, this does mean you need to clear out the one on your computer in order to be able to re-connect.
          See this FAQ page for further instructions: SSH key error, DNS spoofing message

        • Modules in your .bashrc no longer work or give errors on login
          If you have edited your .bashrc file to include module loads at login, you may find that some CentOS 6 modules will not be found or may not work on CentOS 7. You will need to edit your .bashrc and comment out or remove any such lines going forward. If you can no longer log in because of something in your .bashrc, contact us and we can rename your .bashrc and copy in a default version for you.
          If you'd like to start from scratch, a default .bashrc contains the following:

          # .bashrc

          # Source global definitions
          if [ -f /etc/bashrc ]; then
          . /etc/bashrc
          fi

          # User specific aliases and functions below

        • Can I remove source new-modules.sh from my .bashrc ?
          Yes. It no longer serves a purpose and can be removed. It's not absolutely necessary for you to remove it, but it's recommended if you're comfortable with editing your .bashrc file as it makes for a cleaner login.
      •  
        • CentOS 7 modules and searching
          The Portal is the best place to search for modules, and provides far more information that module spider will. By default, the portal will now search for CentOS 7 modules (an option to switch to search old modules is also available): 
          Portal Module Search

          Our Software On Odyssey Intro has also been updated for the CentOS 7 transition and is worth re-viewing.

          As always, you have the ability to install your own software in many cases (especially true for R, Python, Anaconda).
          Please see our Installing Software Yourself page.

          For more information and information on legacy CentOS 6 modules, see the Running Jobs page.

        • Singularity
          New to Odyssey3 is the concept and implementation of containerized environments for jobs. We have standardized on Singularity for this and it is available now on Odyssey3. For more detailed information and instructions, see:

          Please Note: You cannot run Singularity on a login node (much like you should not run jobs or heavy software there). You will need to run it from an interactive session. More details in Singularity on Odyssey

        • If using NX/NoMachine or a CentOS 6 node, don't cross-submit jobs to CentOS 7 partitions/queues
          Due to some environmental differences between CentOS 6 and CentOS 7, if you are using an NX/NoMachine node or your lab has access to any nodes running CentOS 6 [which we are happy to upgrade for you], do not launch jobs for partitions/queues running CentOS 7 directly from that node or you may experience odd job failures (e.g. - failures from basic commands which have moved to a different location on 7). And vice-versa, if you use ATLAS or another partition which still has CentOS 6 nodes, don't launch jobs from CentOS 7 nodes; Use that partition's dedicated nodes.

          After you log in to the CentOS 6 node, or establish an NX session on a CentOS 6 NX server (holynx01, rcnx01, etc.), start an interactive session on the test partition. Once you are in the interactive session run source centos7-modules.sh, that will enable the CentOS7 specific modules and rectify any lingering issues with your run environment. Once that is complete you can submit jobs as per normal. Users should not submit jobs directly from the NX nodes or CentOS 6 nodes to Slurm as Slurm will map the current CentOS 6 environment into the job they submit, this can create problems for the job. Instead jobs should always be submitted from interactive sessions.

          Alternately, you can, from a terminal in your NoMachine session, ssh -CY login.rc.fas.harvard.edu (with X11 forwarding) to a login node. This will have largely the same effect.

        • What about old modules?
          While it is true that some older modules will work on CentOS 7 without recompiling or rebuilding, we do not offer any guarantees on that front. You are free to try, but we have no plans to try to go back and make changes to old modules or to support their use. Use at your own risk.

        • My alternate shell (csh, tcsh, etc.) doesn't work right
          Having a non-standard default shell will cause problems and does not allow us to set global environmental defaults for everyone. We will no longer change the default shell on any account or support the use of alternate shells as default login shell. Users who do not have bash as their default login shell will need to change back to bash. Users can, of course, still launch an alternate shell once logged in.

Still need help transitioning to CentOS 7?

You can always visit our regular Office Hours which run from roughly late August until the end of May/start of June.
See our Office Hours calendar.

Or send us a help ticket via our Portal (preferred) or by emailing rchelp@rc.fas.harvard.edu

 

Permalink.

View category→