Winter Maintenance Dec. 3rd, 2019 (all day)

Tuesday December 3, 2019 7:00 AM to 12:27 AM

Post-maintenance wrap-up

All tasks have been completed. The scheduler has re-opened.

Changes of note:

  • The global variable $SCRATCH is now available and will always point to the preferred cluster scratch filesystem. So instead of hard-coding the path in your scripts, you can instead use this variable plus the path to your folder. Example: $SCRATCH/jharvard_lab/jharvard
    Any underlying scratch filesystem changes will still contain the same folders, so this should always point to a valid directory.
  • Home directories have moved to a new server. Data was synced between the two and re-sync'd overnight. Please note that if you wrote any files to your home directory in the early hours of Dec 3rd, there is the possibility that they were written after the sync ended. If you notice any missing data, please let us know and we can re-run a sync on your directory.
  • SLURM has been upgraded. This should resolve a couple of new bugs some of you have run into, including srun not honoring --mem-per-cpu 
  • All seasfs0x servers are upgraded and running new Samba protocols
  • seas_dgx1 partition now has a maximum time limit of 1 day. Please use the gpu partition for jobs requiring more time. The seas_dgx1 partition may not be available at 5pm due to OS rebuilds, but will be available this evening.

Just a reminder that we have a Summer power downtime at MGHPCC every May/June (June this year) and this Winter downtime in December before the holiday break. Our next basic monthly maintenance will take place the morning of January 6th 7am-11am.

This is an all day maintenance event and involves all data centers.

FASRC's Winter maintenance will take place on Tuesday December 3rd, 2019 starting at 7am with an end ETA of 5pm. This maintenance downtime is an all day maintenance event and involves all data centers and nearly every service. Please plan accordingly.

All running jobs will be terminated by 7am start of maintenance.
Any pending jobs will remain pending until the scheduler resumes at end of maintenance.
Home directories, login servers, and the help ticket system will be unavailable all day.

List of maintenance tasks:

  • User home directories - We will cut over to a new home directory server. This will be transparent to users once complete, but will require several hours for final sync.
  • Login nodes will be offline due to home directories mounts moving (home directories are necessary for login)
  • ALL custromer/lab VMs will be rebooted
  • SLURM and Slurm master server upgrade - Scheduler will be offline until all work is completed.
  • Lustre filesystem upgrades - Affects numerous LFS storage shares during the day
  • Add $SCRATCH global variable - See details at https://www.rc.fas.harvard.edu/policy-scratch/
  • HDR Infiniband updates - Affects numerous network connections in data centers
  • Authentication and Domain Controller updates and changes
  • Upgrade seasfs01, 02, 03 servers and upgrade networking on SeasDGX
  • seas_dgx1 partition max timelimit reduced from 3days to 1day - Use the gpu partition for longer jobs
  • New module search added to Portal
  • Ticket system database move - Ticket system will be unavailable most of the day

Please follow our Status Page on the day of the downtime for updates and progress.

Event Types: