#

2024 MGHPCC power downtime May 21-24, 2024

POWERUP COMPLETE 6:20PM See Status Page

 

The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled.  This year the power outage will take place on a Wednesday/Thursday the week before the Memorial Day long weekend (Mon. 5/27/24). Ordinarily this is happens during a Monday/Tuesday or Tuesday/Wednesday timeframe, but due to scheduling conflicts with other universities that also use MGHPCC, the target days have been advanced. Given the long weekend ahead, this is not ideal but as we are just one of the schools involved in MGHPCC, we must adapt our schedule.

Because of this we will need to adjust our own maintenance schedule, as shown below, and will do the bulk of our work - work we cannot do during normal operations - Tuesday/Wednesday before the power goes down. It is important that we return to service before the long weekend begins. We do not expect any issues with power-up, but please be aware that should any severe unforeseen problems occur, keeping us from normal operation by Friday 5pm, we will determine a course of action and communicate that effectively as soon as possible.

Regular FASRC monthly maintenance will not be held for May and June.

A graphic showing 
Tuesday May 21st - 9am cluster shutdown begins 
Wednesday May 22nd - All day maintenance 
Thursday May 23rd - All day power is out at MGHPCC 
Friday May 24th - Begin power-up of systems 9am. Return to full service by 5pm

        • Tuesday May 21st 9am - Cluster shutdown begins at 9am ET (Eastern Time)
          • Any running jobs will be cancelled and will need to be resubmitted after the outage. 
            All pending jobs will remain in the queue. The scheduler (Slurm) will be stopped. Compute and GPU will be unavailable
          • Login and OoD nodes will be unavailable
          • Storage in Holyoke will be unavailable. Storage in Boston may be unavailable at times
        • Wednesday May 22nd - All day - Maintenance ahead of the power shutdown later that day
        • Thursday May 23rd - All day - Power is out at MGHPCC
        • Friday May 24th - We begin power-up of systems at 9am.
          • This process takes several hours.
          • Expected return to full service by 5pm

Updates will be posted on our status page: https://status.rc.fas.harvard.edu/
Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.

MAJOR TASK OVERVIEW

  • OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance
  • Infiniband (network) upgrades
  • BIOS updates (various)
  • Storage firmware updates
  • Network Maintenance
  • Decommission old nodes (targets contacted)
  • Additional minor one-off updates and maintenance (cable swap, reboots, etc.)

 


 

Notices sent: 4/19 5/6, 5/13 "May 21-24 Annual MGHPCC/Holyoke data center power downtime"