2023 MGHPCC downtime, and major OS and software changes

NOTE: This is a live draft document. This post will be updated several times before the June downtime.  Please note date and time changes.

The annual multi-day power downtime at MGHPCC is approaching.  This will take place June 5th-8th with return to full service 9AM June 9th.

During this downtime we will be performing major updates to the cluster's operating system, public partition time limits, and significant changes to our software offerings and procedures. This will affect all cluster users.

OS Update - Rocky Linux

Currently the cluster nodes (Cannon and FASSE), and indeed most of our infrastructure, are built on CentOS 7, the non-commercial version of RedHat's Enterprise Linux. CentOS is being discontinued by RedHat and new development ceased at the end of 2021. As a result, we are moving to Rocky 8 Linux, created by the same people who began CentOS and which much of the HPC community is also getting behind. Given the wide adoption indicated by other HPC sites, we feel confident Rocky Linux is the right choice so that we will be part of a large community where we can both find common support and contribute back.

Rocky Linux will be a major update with significant changes in the same way that our move from CentOS 6 to CentOS 7 was.  As such, there will be issues that will affect some pipelines, software, or codes. Additionally, we will necessarily be revamping our software offerings and, concurrently, also giving end-users more control over their own software with new build tools such as Spack.

Public Partition Time Limits on Cannon

During the summer of 2022 we did an analysis of job run times on the Cannon cluster with the goal of reassessing our existing partition time limits (which are 7 days). A reduced time limit has many benefits such as reduced cluster fragmentation, lower wait times, and short times to drain nodes for service.  As a result of this analysis we found that over 95% of jobs complete within 3 days on all the public partitions excluding unrestricted.

We will be changing all the public partitions on Cannon, excluding unrestricted, to a 3 day time limit. To accommodate the 5% of jobs longer than 3 days we will be adding an intermediate partition for jobs that need to run between 3 to 14 days; with unrestricted handling the rest. Also for new partitions owned by specific groups we will be instituting a default 3 day time limit, existing partitions are not impacted and groups may change this default to suit their needs.

Software Changes

FASRC will reduce the number of precompiled software packages that it hosts, distilling this down to the necessities of compilers, any commercial packages, apps needed for VDI, etc. 

We will reduce our dependence on the Lmod modules and provide end-users with more and expanded options for building and deploying the software they need. We will incorporate the use of Spack to give users more power to build the software and versions they need. Those who came to FASRC from other HPC sites may recognize this as the norm at many sites these days. 

More information and docs to come and links added here as these changes progress.

Training and Consultation

Starting in March, FASRC will, in addition to our regular New User training and other upcoming classes, will be offering training sessions on:

  • From CentOS7 to Rocky 8: How the new operating system will affect FASRC clusters
  • Installing and using software on the FASRC cluster

Additionally, we will be providing opportunities for labs and other groups to meet with us to discuss your workflows vis-a-vis the upgrade and changes. (TBD)

FASRC will have a Rocky Linux test environment available very soon where groups can begin to test their software, codes, jobs, etc.