THIS MAINTENANCE IS COMPLETE AS OF 2/3/2020 11am
Monthly maintenance will occur on Monday, February 3rd from 7am to 11am.
- GENERAL MAINTENANCE
- SCRATCH UPDATE AND NEW SYSTEM
- HOLYLFS USAGE - PLEASE DO NOT RUN JOBS ON HOLYLFS
- Login and VDI nodes will be rebooted
- New FASRC-built scratch system comes online (see description below)
- User documentation will move to https://docs.rc.fas.harvard.edu - better predictive search, document rating, cleaner look,
- Status page will be upgraded - better maintenance and incident reporting, ability to subscribe (coming soon) - Note that older history will be lost
- Network DHCP and NPS updates - Should be transparent to end-users
- NCF: ncf_holy queue decommissioned, please use ncf queue (on new hardware)
- un-mount network scratch from Boston nodes (should use local scratch)
- /n/scratchlfs will be unmounted and shut down - ensure no needed data remains
(this is not scratchlfs02)
- HUCE: huce_cascade queue added, huce_amd partitions decommissioned.
SCRATCH UPDATE AND NEW SYSTEM
- The original appliance /n/scratchlfs (and /n/scratchssdlfs) will be shut down during this maintenance
- /n/scratchlfs02 remains online until March 2nd when it will be decommissioned
- The new FASRC-built scratch will be available for use February 3rd and $SCRATCH will point to /n/holyscratch01
- During the next two weeks we will take nodes in and out of service to upgrade Infiniband networking.
We recognize and understand the frustration that the scratch situation has caused for many users. We continue to give this issue the highest priority. Our intention was to move to an appliance-based vendor solution with DDN, as the OpenSource filesystem development group Whamcloud moved from Intel to DDN. The original DDN appliance, scratchlfs, had severe performance issues which could not be resolved by their engineering team. The vendor then deployed a new system, scratchlfs02, and fixed the performance issue, however, the stability of this storage in our environment did not improve. We believe that this is due to the mismatch in the Cannon cluster size increase and the storage back-end server compute/memory capacity.
As such, we have settled on Plan B and built our own scratch, which includes 6 servers and storage controllers. This adds higher overall throughput and capacity. We are running more real-world 'chaos' testing now via fs-drift.py for multiple days from ~1-2k threads and feel confident this will alleviate the individual storage servers from being overwhelmed. This system is named holyscratch01, and should be online February 3rd at the end of maintenance. The $SCRATCH environment variable will be updated and point to /n/holyscratch01.
In addition, there have been a number of HDR Infiniband issues, which we are also trying to rectify with new firmware that was released this month. In order to fix the firmware on HDR connected compute nodes, we will be taking groups of these down at a time to update continuing through next week.
Fast scratch for such a large and varied cluster is a challenge in general, and we appreciate your understanding and patience as we work through this issue. We will continue to add new storage that is robust and dependable, we will work to regain your trust in FASRC.
HOLYLFS USAGE - PLEASE DO NOT RUN JOBS ON HOLYLFS
In recent months we’ve been having stability issues with holylfs. This is due to the fact that new Cannon cluster is able to overwhelm holylfs. We ask that users on holylfs not run production jobs on holylfs and only use it for long term storage. We will replace holylfs with a faster system later this year, but this does not change its use case. Please use a scratch filesystem for production jobs. If you need assistance with re-architecting your workflow please contact us.
Thanks for your patience and understanding.
FAS Research Computing