#

2025 Compute OS Upgrade

It is time for our annual operating system (OS) upgrade for our two clusters, Cannon and FASSE. Unlike previous years we have separated the upgrade from our annual power outage. This was to make it easier for FASRC staff to deal with any unforeseen issues coming out of the power outage and to simplify the power on process. In addition we wanted to experiment with a new method of doing the upgrade live. This rolling upgrade will allow us to keep the cluster up and operational, while allowing us to move the compute nodes to a newer operating system.

We do operating system upgrades on an annual basis for several reasons. The first is to pick up security patches, new threats come along all the time and we need to stay on top of them to protect you and your research. The second is bug fixes, software is complicated and new bugs are found all the time which can lead to system instability. The third is performance, over time new methods of doing things can lead to incremental improvements in efficiency and speed. The fourth is new features, new ideas and ways of doing things allow for more cutting edge research to be done.

This year's upgrade will be a point upgrade from Rocky 8.9 (current OS) to Rocky 8.10 (new OS). By and large point upgrades are minor upgrades that fix various bugs but do not create substantial change in the OS organization or structure. Your codes should still work as expected after the upgrade with no recompilation necessary.

That said a few changes of note are worth highlighting:

kernel: We will be upgrading to kernel 4.18.0-553.44.1.el8_10. Several people have asked to upgrade to kernel 5+ to enable their workflows. Unfortunately we cannot upgrade to that kernel with out a major OS upgrade. kernel 5 will become available when we upgrade to Rocky 9.

DOCA: Nvidia is consolidating its networking offerings after it purchased Mellanox. What formerly was Mellanox Open Fabrics Enterprise Distribution (OFED), which is used to run the Infiniband fabric, has been merged into Nvidia's Datacenter On a Chip Architecture (DOCA). Due to this we will be switching from OFED to DOCA for Infiniband support.

UCX: As part of the DOCA upgrade we will be getting a new version of UCX (Unified Communication X) which is the interface between your code and the Infiniband stack. The new version will be version 1.18.0.

CUDA: The version of CUDA (Compute Unified Device Architecture) supported by our GPU drivers is getting pretty long in the tooth. This upgrade will bring us to the latest Nvidia driver which supports CUDA 12.9. We upgrade our GPU drivers annually as driver upgrades are involved and require a system reboot.

The upgrade process itself will be phased over 4 days, from July 7-11. Each phase will start at 9am and will run for 24 hours. No jobs will be canceled as we have placed maintenance reservations to block off the nodes for servicing. If we get done early with our work we will return the nodes to service, but users should plan for the outage to last for the full duration. You will want to check the status page and use slurm commands such as lsload and sinfo to see current state.

Below is a table outlining each phase:

  Phase 1: July 7th 9am - July 8th 9am Phase 2: July 8th 9am - July 9th 9am Phase 3: July 9th 9am - July 10th 9am Phase 4: July 10th 9am - July 11 9am
Impacted Nodes Login Nodes (Cannon and FASSE)
All of FASSE Compute
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Cannon Compute Nodes:
holy2c09
holy2c11
holy7c02
holy7c04
holy7c06
holy7c08
holy7c09
holy8a28
holy8a29
holy8a30
holy8a31
holy8a32
 
 
 
 
 
 
 
 
 
 
Cannon Compute Nodes:
holy7c10
holy7c12
holy7c16
holy7c18
holy7c20
holy7c22
holy7c24
holy8a14
holy8a24
holy8a25
holy8a26
holy8a27
holygpu2c07
holygpu2c09
holygpu2c11
holygpu7c09
holygpu7c13
holygpu7c26
holygpu8a11
holygpu8a12
holygpu8a13
holygpu8a15
Cannon Compute Nodes:
holygpu8a16
holygpu8a17
holygpu8a18
holygpu8a19
holygpu8a22
holygpu8a24
holygpu8a25
holygpu8a26
holygpu8a27
holygpu8a29
holygpu8a30
holygpu8a31
holyolveczkygpu01
 
 
 
 
 
 
 
 
 

We have done our best to spread out the servicing such that none of the main production partitions are completely closed on Cannon. For FASSE, the cluster is so small that it made sense to do it all at once during our normal maintenance window. That said specific PI partitions may be closed off by this work. To find out if your partition is impacted run scontrol show partition <PARTITIONNAME> and look at the Nodes category.

One final note of warning. Since this is a rolling upgrade it means that during the work the cluster will be in a split state with some nodes being on the old OS and some being on the new OS. We have done testing on our end and have not seen any issues with this. That said we cannot test all the possible codes that are used on the cluster. If your code is sensitive to versions it may crash or produce errant results. One should always validate your code results, but one should especially be careful during this window of work. If your code crashes during this window we recommend trying again after the upgrade work passes and the cluster is a single consistent OS version. If it still has problems after the upgrade please contact us at rchelp@rc.fas.harvard.edu so that we can assist you.