#

MGHPCC data center shutdown 6/11/19 & 6/12/19 – begins 6/10/19 6PM

Tuesday June 11, 2019
Wednesday June 12, 2019
 

NOTE: Shutdown of systems prior to power-off begins Monday 6/10/19 6PM

Each year our primary data center, MGHPCC (Holyoke), performs a full power shutdown for electrical maintenance. This requires us to power down all FASRC systems at MGHPCC starting the evening before. It also allows us a window to fit in maintenance that would otherwise require us shutting off various resources during normal operations. Note that this power event will mean the termination of all running jobs as power to the entire facility will be out.

Timeline showing Jun 10th 6PM shutdown begins, through June 11th, then back up June 12th by 8PM

SCHEDULE

  • June 10th - Evening Before 6PM: All running jobs will be terminated and we will begin powering down all devices at MGHPCC/Holyoke. 
  • June 11th - Day Of: Power will be out the entire day as MGHPCC performs their work.
  • June 12th - Following Day: We will perform yearly maintenance tasks after power-up begins and expect to be back to normal operations by approximately 8PM.
  • - No Office Hours 6/11 (HCSPH) or 6/12 (Main Campus)

WHAT IS AFFECTED

  • Resources in Holyoke will be affected for the duration of the event. This includes the compute cluster, scheduler, scratchlfs, storage, and other devices housed at MGHPCC/Holyoke.
  • Resources in Boston and Cambridge, including storage, will also be affected during yearly maintenance work. Please plan accordingly as all resources will be affected at some point during the event.
  • Software modules: See below. (If you run jobs, please read!)
  • NO office hours on Wednesday 6/12
  • The help ticket system will be updated 6/11 and will be down periodically . Please see: https://status.rc.fas.harvard.edu on 6/11

SOFTWARE MODULES - !! IF YOU SUBMIT JOBS, PLEASE READ !!

After June 12th, EasyBuild will be added to all user environments. This requires your attention as your job scripts may fail if module calls do not use the full name.

For best interoperability of EasyBuild based modules with existing software modules, please use complete module names and versions to make sure the correct software modules are loaded in your user environment.

Example: module load intel/17.0.4-fasrc01 
If you are currently using module load intel it will load intel from EasyBuild space and break your workflow.
 

COMPUTE OS UPDATE

During this event, once basic power is available to us, we will also be upgrading all compute nodes to the latest CentOS. The is a minor version update. No impact after upgrade is expected.

OTHER TASKS

  • LNET router rebuild and Lustre upgrade - Lustre (LFS) filesystems affected across the board
  • Infiniband (fiber networking) updates
  • OS update on all compute nodes
  • Add Easybuild modules to default module path
  • Cuda drivers upgrade, DGX1 firmware updates
  • Physical move of several servers - Transparent to users once complete
  • Firewall upgrade - Network affected, transparent to users once complete
  • Re-cabling of various storage systems

 

Reminder: During the downtime, all jobs still running will be terminated on the evening of 6/10/19. As power will be out at MGHPCC, jobs cannot be paused. They must be stopped before we begin power-down. 

We will notify the community via our email lists when we are back to normal operations. You can also check back here or on our Status Page

Event Types: