Monthly Maintenance, July 10th 2017 7am-11am

Greetings from FAS Research Computing. We hope your summer is going well so far. 

Our next monthly maintenance is Monday, July 10th from 7am to 11am.  

Along with the usual login and NX node reboots we have a couple other important improvements we are making to the Odyssey environment:

  • Regal Quotas - Implementing a 50TB limit per group to help avoid full scratch issues
  • Slurm Upgrade to 17.02.4 - IMPORTANT: Please note that job IDs will roll over (starting in 24million range)


REGAL: We will be instituting a 50 TB quota for labs on Regal.

As you are aware our Regal scratch file system has been running near capacity for several months.  Well over 300 labs have space on Regal, which can fill up even a petabyte-scale system rather quickly.  In order to make sure that Regal is usable for all labs and users we will be instituting a 50 TB quota for all labs on Regal.  We have opted to do this rather than move to a 60 day retention period, although if Regal continues to be full we will be forced to do so in order to preserve usability.

In addition, as a reminder, Regal is temporary scratch space.  The data on Regal is not backed up and has no disaster recovery copy. It is assumed to be transient in nature. Any essential data should be copied and stored on a persistent filesystem.  Additionally, please be ware that using tactics to circumvent retention is a violation of policy and administrative action will occur (https://www.rc.fas.harvard.edu/policy-scratch/).  While retention is typically run monthly, we will run retention more frequently if Regal is near capacity.  We want Regal to continue to be highly-available for jobs and other high input/output (I/O) work.

SLURM: We will be upgrading to Slurm 17.02.4 during this maintenance. Job IDs will roll over as a result.

This is a major version upgrade, and thus there are numerous changes and bug fixes that are part of this release.  During the Slurm upgrade all jobs will be paused and all partitions will be set to down.  Full release notes are shown below.  Most of these changes will not impact user interaction with Slurm, but rather introduce new features.

The major change that is user-facing is that the Maximum Job Id is being reduced from 4 billion down to 67 million to enable cluster federation (multiple sSurm masters).  As we are well beyond that number, JobIDs will roll over after this update and new jobs will start over (update: actual rollover reset to the 24000000 range). Jobs that are running and pending with JobID's higher than 67 million will continue to run with their current JobID and will show up as that ID in the Slurm database.  Given our current rate of jobs, 2 million a month, and assuming rollovers to zero, we will roll over JobID approximately every 33 months.

We hope you have an excellent 4th of July and a great summer!
FAS Research Computing


