Monthly Maintenance, November 5th, 2018 7am-11am

Monday November 5, 2018 7:00 AM to 11:00 AM

Monthly maintenance will occur on November 5th, 2018, from 7am to 11am.

Please be aware that jobs will be paused during much of the following.

    • Slurm upgrade to 17.11.12
    • Login and NX node will be rebooted
    • FAS-owned GPU consolidation - new queue fas_gpu for FAS users
    • Upgrade Lnet routers to latest Lustre (jobs will be paused)
    • rclic1 license server updates - new license checkouts affected (including geneious, lumerical, MATLAB, etc.)
    • VPN maintenance (tentative)

See our status page day-of for current status: https://status.rc.fas.harvard.edu

GPU resources owned by the FAS will be consolidated into a new queue called fas_gpu. This queue will be available to FAS users after the Nov. 5th maintenance. The queue will contain 64 nodes with 2xK80s, 16 nodes with 2xK20s, and 1 node with 8xK20. Max job time for this queue is the standard 7 days. 
We will be having an all-day maintenance event on December 17th, 2018 to cover maintenance that cannot be performed during the annual MGHPCC shutdown. This is primarily targeted at our Boston data center, but as many systems are inter-dependent, this will necessarily have some effect cluster-wide. We will provide more specific time and task details in the coming weeks. We expect this to be a yearly event which will allow us time to perform maintenance on systems which are powered on 24/7 and do not get the benefit of a forced downtime as MGHPCC systems do.
The Regal scratch filesystem has served us venerably since 2013, processing billions of jobs and numerous petabytes of data. But as recent issues involving object identifier errors and aging hardware have shown, it is time to retire it and implement a new fast scratch system. To that end we are in the planning stages of a Regal replacement for the coming months. We are taking knowledge gained from Regal and from other technologies that we've since had the opportunity to evaluate and will be designing, and later building, a replacement. We thank you for your patience and want to assure you there is light at the end of the tunnel . We look forward to presenting the next step in fast scratch storage to replace this venerable but aging filesystem as soon as possible.

Please note: Regal scratch retention policy runs happen regularly, not just at maintenance. Please plan accordingly. 

Event Types: