NOTE: Shutdown of systems prior to power-off begins Monday 6/10/19 6PM
Each year our primary data center, MGHPCC (Holyoke), performs a full power shutdown for electrical maintenance. This requires us to power down all FASRC systems at MGHPCC starting the evening before. It also allows us a window to fit in maintenance that would otherwise require us shutting off various resources during normal operations. Note that this power event will mean the termination of all running jobs as power to the entire facility will be out.
- June 10th - Evening Before 6PM: All running jobs will be terminated and we will begin powering down all devices at MGHPCC/Holyoke.
- June 11th - Day Of: Power will be out the entire day as MGHPCC performs their work.
- June 12th - Following Day: We will perform yearly maintenance tasks after power-up begins and expect to be back to normal operations by approximately 8PM.
- - No Office Hours 6/11 (HCSPH) or 6/12 (Main Campus)
WHAT IS AFFECTED
- Resources in Holyoke will be affected for the duration of the event. This includes the compute cluster, scheduler, scratchlfs, storage, and other devices housed at MGHPCC/Holyoke.
- Resources in Boston and Cambridge, including storage, will also be affected during yearly maintenance work. Please plan accordingly as all resources will be affected at some point during the event.
- Software modules: See below. (If you run jobs, please read!)
- NO office hours on Wednesday 6/12
- The help ticket system will be updated 6/11 and will be down periodically . Please see: https://status.rc.fas.harvard.edu on 6/11
SOFTWARE MODULES - !! IF YOU SUBMIT JOBS, PLEASE READ !!
For best interoperability of EasyBuild based modules with existing software modules, please use complete module names and versions to make sure the correct software modules are loaded in your user environment.
module load intel/17.0.4-fasrc01
module load intelit will load intel from EasyBuild space and break your workflow.
COMPUTE OS UPDATE
During this event, once basic power is available to us, we will also be upgrading all compute nodes to the latest CentOS. The is a minor version update. No impact after upgrade is expected.
- LNET router rebuild and Lustre upgrade - Lustre (LFS) filesystems affected across the board
- Infiniband (fiber networking) updates
- OS update on all compute nodes
- Add Easybuild modules to default module path
- Cuda drivers upgrade, DGX1 firmware updates
- Physical move of several servers - Transparent to users once complete
- Firewall upgrade - Network affected, transparent to users once complete
- Re-cabling of various storage systems
Reminder: During the downtime, all jobs still running will be terminated on the evening of 6/10/19. As power will be out at MGHPCC, jobs cannot be paused. They must be stopped before we begin power-down.
We will notify the community via our email lists when we are back to normal operations. You can also check back here or on our Status Page