System Updates Archive
-
Grace Maintenance Update
Thursday, February 6, 2020 - 5:00pmWe have had to extend the Grace maintenance period due to issues with the upgraded network configuration. YCRC staff have been working continuously to investigate and resolve these issues and we are planning to make the cluster available again by midday tomorrow, Friday February 7. The Loomis storage will remain unavailable on Farnam until that time.
-
Grace Maintenance Extended
Wednesday, February 5, 2020 - 5:00pmDue to issues with the upgraded network configuration and ongoing discussions with vendors, we are continuing to work on the Grace cluster which means that we need to extend the maintenance period. A further email notification will be sent when the maintenance has been completed and the cluster is available. The Loomis storage will remain unavailable on Farnam.
-
Scheduled Maintenance on Grace
Dear Grace and Farnam Users,As a reminder, scheduled maintenance will be performed on Grace beginning Monday, February 3, 2020, at 8:00 am. Maintenance is expected to be completed by the end of day, Wednesday, February 5, 2020.
During this time, logins will be disabled and connections via Globus will be unavailable. The Loomis storage will not be available on the Farnam cluster. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the clusters are available.
As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on February 3, 2020). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail”. (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.
During this maintenance the directory hierarchy will be “flattened”. This is part of a larger campaign to standardize the clusters and make it easier for everyone to know where data are. The current paths are of the form
/gpfs/loomis/[project or scratch60 or home.grace]/[metagroup]/[group]/[netid]and, after the maintenance, will be
/gpfs/loomis/home.grace/[netid]
/gpfs/loomis/[project or scratch60]/[group]/[netid]Symlinks (shortcuts) will be left in place to make the old paths work, but it is recommended that you change any paths in your scripts to the new form as soon as possible after the maintenance. During the next maintenance window in August, the symlinks will be removed.Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
-
Scheduled Maintenance on Milgram
Milgram Users,
The Milgram cluster will be unavailable due to scheduled maintenance until the end of day, Thursday, December 12th, 2019. A communication will be sent to users when the cluster is available.
Please visit the status page on research.computing.yale.edu for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
-
Data Center Maintenance - All Clusters.
Dear Farnam, Grace, and Ruddle Users,
All HPC clusters (including storage) will be unavailable starting at 4:00pm, Friday, December 6, 2019. We expect that Farnam, Grace, and Ruddle will be returned to service by the end of the day on Tuesday, December 10, 2019.
During this time, logins to the clusters will be disabled and connections via Globus will be unavailable. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and each cluster is available.
If you also have an account on Milgram, you will receive a second communication as the schedule for that cluster is different. Please visit the status page on research.computing.yale.edu for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
-
All Clusters Unavailable
Monday, November 25, 2019 - 9:00am to 11:30amAll clusters are unavailable due to an ongoing networking issue on the Science Network. We are working with ITS to resolve the problem as quickly as possible. We apologize for the inconvenience.
-
Scheduled Maintenance on Ruddle
Dear Ruddle Users,As a reminder, scheduled maintenance will be performed on Ruddle beginning Tuesday, November 5, 2019, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, November 7, 2019. Please note that this maintenance begins and ends one day later than previously stated on the YCRC website.During this time, logins will be disabled, running jobs will be terminated and connections via Globus will be unavailable. The GPFS storage, including all YCGA sequencing data, will not be available. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.Sincerely,Paul Gluhosky -
Scheduled Maintenance on Grace
Dear Grace, Omega and Farnam Users,
Out-of-cycle scheduled maintenance will be performed on Grace and Omega beginning Monday, November 4, 2019, at 8:00 am. Maintenance is expected to be completed by the end of the day Monday.
This maintenance is required in order to prepare the Infiniband network for the upcoming deployment of additional common and PI-purchased nodes. As such, there are no changes that impact users and no new functionality is being introduced. We plan, however, to bring online 174 new common nodes and 76 Pi-purchased nodes soon after the maintenance window, once necessary validation testing has been completed.
During this time, logins will be disabled and connections via Globus will be unavailable. The Loomis storage will not be available on the Farnam cluster but the cluster itself will remain available. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the clusters are available.
As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on November 4, 2019). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail”. (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.
Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
-
Scheduled Maintenance on Ruddle
-
Farnam Maintenance Extended
UPDATE- Farnam Maintenance will be extended until Friday morning, September 13. 2019
Dear Farnam Users,
We will be performing preventative maintenance to ensure stable operation of the cluster. During this time logins will be disabled and Farnam storage will be unavailable on all clusters. An email notification will be sent when the maintenance has been completed and the cluster is available. We have engaged the storage vendor to investigate performance issues during the maintenance. Depending on what they discover, It is possible that the return to service will be delayed. If that happens, we will inform you.
Please note any job whose requested time limit implies that it will still be running when the maintenance begins at 8:00 am on 9/9 will wait with a reason that starts with “ReqNodeNotAvail”. These jobs will be eligible for running after maintenance completes if left in the queue. You can run the command htnm (short for hours_to_next_maintenance) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins.
Summary of major maintenance changes
- Decommission compute nodes with Sandy Bridge (Intel Xeon E5-2670) and Bulldozer (AMD Opteron 6276) codename CPUs.
- Bring online Cascade Lake (Intel Xeon Gold 6240) compute nodes. Please see the compute nodes table on the Farnum page for details.
- Move Project and Scratch directories to make them consistent across clusters. We will create symlinks to maintain compatibility with old paths.
Example:
/gpfs/ysm/project/netid and /gpfs/ysm/scratch60/netid
Will soon be
/gpfs/ysm/project/group/netid and /gpfs/ysm/scratch60/group/netid
Please visit the status page on research.computing.yale.edu for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.