It is time for our annual operating system (OS) upgrade for our two clusters, Cannon and FASSE. Unlike previous years we have separated the upgrade from our annual power outage. This was to make it easier for FASRC staff to deal with any unforeseen issues coming out of the power outage and to simplify the power on process. In addition we wanted to experiment with a new method of doing the upgrade live. This rolling upgrade will allow us to keep the cluster up and operational, while allowing us to move the compute nodes to a newer operating system.
We do operating system upgrades on an annual basis for several reasons. The first is to pick up security patches, new threats come along all the time and we need to stay on top of them to protect you and your research. The second is bug fixes, software is complicated and new bugs are found all the time which can lead to system instability. The third is performance, over time new methods of doing things can lead to incremental improvements in efficiency and speed. The fourth is new features, new ideas and ways of doing things allow for more cutting edge research to be done.
This year's upgrade will be a point upgrade from Rocky 8.9 (current OS) to Rocky 8.10 (new OS). By and large point upgrades are minor upgrades that fix various bugs but do not create substantial change in the OS organization or structure. Your codes should still work as expected after the upgrade with no recompilation necessary.
That said a few changes of note are worth highlighting:
kernel: We will be upgrading to kernel 4.18.0-553.44.1.el8_10. Several people have asked to upgrade to kernel 5+ to enable their workflows. Unfortunately we cannot upgrade to that kernel with out a major OS upgrade. kernel 5 will become available when we upgrade to Rocky 9.
DOCA: Nvidia is consolidating its networking offerings after it purchased Mellanox. What formerly was Mellanox Open Fabrics Enterprise Distribution (OFED), which is used to run the Infiniband fabric, has been merged into Nvidia's Datacenter On a Chip Architecture (DOCA). Due to this we will be switching from OFED to DOCA for Infiniband support.
UCX: As part of the DOCA upgrade we will be getting a new version of UCX (Unified Communication X) which is the interface between your code and the Infiniband stack. The new version will be version 1.18.0.
CUDA: The version of CUDA (Compute Unified Device Architecture) supported by our GPU drivers is getting pretty long in the tooth. This upgrade will bring us to the latest Nvidia driver which supports CUDA 12.9. We upgrade our GPU drivers annually as driver upgrades are involved and require a system reboot.
The upgrade process itself will be phased over 4 days, from July 7-11. Each phase will start at 9am and will run for 24 hours. No jobs will be canceled as we have placed maintenance reservations to block off the nodes for servicing. If we get done early with our work we will return the nodes to service, but users should plan for the outage to last for the full duration. You will want to check the status page and use slurm commands such as lsload
and sinfo
to see current state.
Below is a table outlining each phase:
 | Phase 1: July 7th 9am - July 8th 9am | Phase 2: July 8th 9am - July 9th 9am | Phase 3: July 9th 9am - July 10th 9am | Phase 4: July 10th 9am - July 11 9am |
Impacted Nodes |
Login Nodes (Cannon and FASSE) All of FASSE Compute |
Cannon Compute Nodes: holy2c09 holy2c11 holy7c02 holy7c04 holy7c06 holy7c08 holy7c09 holy8a28 holy8a29 holy8a30 holy8a31 holy8a32 |
Cannon Compute Nodes: holy7c10 holy7c12 holy7c16 holy7c18 holy7c20 holy7c22 holy7c24 holy8a14 holy8a24 holy8a25 holy8a26 holy8a27 holygpu2c07 holygpu2c09 holygpu2c11 holygpu7c09 holygpu7c13 holygpu7c26 holygpu8a11 holygpu8a12 holygpu8a13 holygpu8a15 |
Cannon Compute Nodes: holygpu8a16 holygpu8a17 holygpu8a18 holygpu8a19 holygpu8a22 holygpu8a24 holygpu8a25 holygpu8a26 holygpu8a27 holygpu8a29 holygpu8a30 holygpu8a31 holyolveczkygpu01 |
We have done our best to spread out the servicing such that none of the main production partitions are completely closed on Cannon. For FASSE, the cluster is so small that it made sense to do it all at once during our normal maintenance window. That said specific PI partitions may be closed off by this work. To find out if your partition is impacted run scontrol show partition <PARTITIONNAME>
and look at the Nodes
category.
One final note of warning. Since this is a rolling upgrade it means that during the work the cluster will be in a split state with some nodes being on the old OS and some being on the new OS. We have done testing on our end and have not seen any issues with this. That said we cannot test all the possible codes that are used on the cluster. If your code is sensitive to versions it may crash or produce errant results. One should always validate your code results, but one should especially be careful during this window of work. If your code crashes during this window we recommend trying again after the upgrade work passes and the cluster is a single consistent OS version. If it still has problems after the upgrade please contact us at rchelp@rc.fas.harvard.edu so that we can assist you.