System Updates Archive
-
ALL CLUSTERS AND STORAGE ARE OPERATIONAL
Tuesday, June 24, 2025 - 12:00pmPlease contact research.computing@yale.edu with questions or to report any issues.
-
Grace, Milgram, Misha & McCleary Currently Unavailable For Scheduled Maintenance.
Monday, June 9, 2025 - 8:00amScheduled maintenance on the Grace, Milgram, Misha & McCleary cluster has begun. Maintenance is expected to be complete on 6/9/2025.
YCRC-managed clusters will be down for planned maintenance as follows:
-
Grace, Milgram and Misha and their associated storage will be down for maintenance from Monday, June 9, 2025 through Thursday, June 12, 2025.
-
McCleary and its associated storage will be down for maintenance from Tuesday June 10, 2025 through Thursday June 12, 2025.
-
-
Bouchet Currently Unavailable For Scheduled Maintenance.
12:30 - 6/2/2025 - Scheduled maintenance on the Bouchet cluster has begun. Maintenance is expected to be complete on 6/5/2025.
All YCRC-managed clusters will be down for planned maintenance in early June as follows:
-
Bouchet and its associated storage will be down for maintenance from Monday, June 2, 2025 through Thursday, June 5, 2025.
-
Grace, Milgram and Misha and their associated storage will be down for maintenance from Monday, June 9, 2025 through Thursday, June 12, 2025.
-
McCleary and its associated storage will be down for maintenance from Tuesday, June 10, 2025 through Thursday, June 12, 2025.
Please note that the proposed timeframes are conservative; we recognize the disruptive impact and will do our best to streamline and shorten to minimize impact.
This represents the first instance of a new approach to YCRC system maintenance. Until now, each cluster has had two full-downtime maintenance periods per year, each lasting three days. With the new approach, each cluster will still be updated twice a year. However, only one of these two annual maintenance periods will be a full downtime. The other will involve rolling updates to a live cluster. We are working toward a system in which the annual full-downtime maintenance will take place on all YCRC-managed clusters simultaneously. This year, however, Bouchet maintenance will take place first, to coincide with the annual data-center downtime at Massachusetts Green High Performance Computing Center (MGHPCC), and then the remaining clusters, which are located at West Campus, will undergo maintenance the following week. Approximately six months after this full downtime, the clusters will be patched with minor updates on a rolling basis with minimal disruption.
The new approach has several advantages. By consolidating the major cluster updates, YCRC will be able to focus on the preparation for and execution of those major updates once a year, instead of the nearly once a month, freeing more time for supporting researchers. Performing maintenance on all clusters within a data center simultaneously facilitates maintenance on subsystems that affect multiple clusters. All clusters will be kept on the same major version of the image throughout the year, making for a more consistent and easier-to-support environment. For any given cluster, the number of days per year of planned total downtime will be somewhat reduced. Also, for each cluster, the number of planned total-downtime periods will be reduced from two to one. There will be a second period each year of rolling updates, but these will entail limited disruption.
-
-
Milgram Scheduled Maintenance Feb 4
Tuesday, February 4, 2025 - 8:00amDear Milgram Users,Maintenance will be performed on the Milgram cluster starting on February 4, 2025, at 8:00am.Due to the limited updates needed on Milgram at this time, the maintenance will not be a full 3-day downtime but will rather have limited disruptions during one business day. The Milgram cluster and storage will remain online with brief interruptions throughout the maintenance period. Compute jobs will not run from the beginning of maintenance until after the compute nodes have been upgraded and returned to service, and certain services will be unavailable for short periods during the maintenance window.Maintenance will be performed on the following sets of nodes. Each set will be down briefly and then returned to service.Login nodes (there are two nodes but only one will be down at a time)Open OnDemand (Web portal)XNATCIFS serverGlobusTransfer node (milgram.ycrc.yale.edu)All compute nodesAs the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of maintenance (8:00 am on February 4, 2025). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.An email notification will be sent when the maintenance is completed.Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have any questions, comments, or concerns, please contact us at hpc@yale.edu. -
Grace Scheduled Maintenance Dec 3-4
Dear Grace Users,
Scheduled maintenance will be performed on the Grace cluster starting on December 3-4, 2024, at 8:00am.
Due to the limited updates needed on Grace at this time, the upcoming December maintenance will not be a full 3-day downtime but will rather have limited disruptions. The Grace cluster and storage will remain online and available throughout the maintenance period and there will be no disruption to running or pending batch jobs. However, certain services will be unavailable for short periods during the maintenance window. There will be reduced availability of compute nodes at times, so users might experience temporality increased wait times.Maintenance will be performed on sets of nodes, in the following order. Each set will be down briefly and then returned to service.
Tuesday December 3:
Login nodes (there are two nodes but only one will be down at a time)
Globus
Transfer node (transfer-grace.ycrc.yale.edu)
Half of commons nodesWednesday December 4:
The remaining commons nodes
All PI nodesNote to groups with PI nodes:
As the maintenance window approaches, the Slurm scheduler will not start any job Submitted to a PI partition if the job’s requested wallclock time extends past the start of the downtime for PI nodes (8:00 am on December 4, 2024). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.The Message of the Day (MOTD) will be updated throughout the maintenance period to report the current status. An email notification will be sent when the maintenance is completed.
Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have any questions, comments, or concerns, please contact us at hpc@yale.edu.
-
McCleary Scheduled Maintenance
Scheduled maintenance will be performed on the McCleary cluster, starting on Tuesday, October 15, 2024, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, October 17, 2024.
During the maintenance, logins to the cluster will be disabled. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance. An email notification will be sent when the maintenance has been completed, and the cluster is available.
As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on October 15, 2024). You can run the command “htnm” (short for “hours_to_next_maintenance”) to determine the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins. If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order. All running jobs will be terminated at the start of the maintenance period. Please plan accordingly.
Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
-
Milgram Scheduled Maintenance
Dear Milgram Users,
Scheduled maintenance will be performed on the Milgram cluster starting on Tuesday, August 20, 2024, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, August 22, 2024.
During the maintenance, logins to the cluster will be disabled. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance. An email notification will be sent when the maintenance has been completed, and the cluster is available.
Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
-
7/10/2024 11:40am RESOLVED: Grace fully operational after power issue
7/10/24 - 11:40am - An an electrical supply issue at the HPC Data Center that started midday on 7/8/2024 brought down Grace nodes with names beginning with r801 through r806. Repairs are now complete and normal operation has been restored. However, any jobs running on those nodes at the start of the outage terminated. Please check the status of your HPC jobs.
-
Milgram Globus Maintenance 6/10 @ 9am
Milgram Globus services will be unavailable Monday, 6/10/24, beginning at 9am, while an upgrade is performed. The service is expected to be available by the end of the day.
If you have any questions or concerns, please contact hpc@yale.edu
-
Grace Scheduled Maintenance
Dear Grace Users,
Scheduled maintenance will be performed on the Grace cluster starting on Tuesday, June 4, 2024, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, June 6, 2024.During this time, logins will be disabled, running jobs will be terminated, and connections via Globus will be unavailable. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance. An email notification will be sent when the maintenance has been completed, and the cluster is available.
Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.