Status and Maintenance

Scheduled Maintenance Currently in Progress

Infrastructure Status details as of June 17, 2026, 2 pm:

Grace, McCleary - the clusters and storage are back online and operational as of 2 pm on Wednesday, 6/17.  McCleary nodes with names beginning with r813 are down while the vendor continues to work on cooling issues.

Bouchet, Hopper - clusters and storage are undergoing maintenance and are unavailable during a mandatory shutdown of the MGHPCC facility due to electrical work. Expected return to service - EOD tomorrow, Thursday, 6/18

Misha, Milgram - the clusters and storage are back online and fully operational as of 4 pm on Tuesday, 6/16

Computing Systems Status

Updated 6/15/2026:  Due to a cooling issue in one of the McCleary racks at West Campus, the nodes that have names beginning with r813 (except for r813u29n09 and r813u29n11) are currently down. This outage is caused by a faulty Liquid Cooling Distribution Unit, causing equipment in the rack to overheat and shutdown. We are working with the CDU vendor to resolve the issue.

This issue impacts the following McCleary scheduler partitions:

  • 26 of the 33 nodes in the day partition
  • 14 of the 16 nodes in the week partition
  • 1 of the 20 nodes in the gpu partition
  • 1 of the 3 nodes in the pi_gerstein_gpu partition
  • 2 of the 10 nodes in the pi_jetz partition
  • 4 of the 4 nodes in the pi_ohern partition
  • 2 of the 2 nodes in the pi_sestan partition
  • 4 of the 4 nodes in the pi_tsang partition

Slurm jobs can be submitted as usual, however any jobs that would start on the affected nodes won’t start until after the cooling issue is resolved.  

The vendor will be onsite Wednesday, 6/17/2026 to resolve the issue.  Their work is expected to complete by the end of the day.

We realize this is an impact to your work and apologize for the inconvenience.  If you have any questions, comments, or concerns, please contact us at research.computing@yale.edu.

Resolved past issues:

6/4/2026 13:00: Hopper experienced a system issue that impacted availability. We engaged the storage vendor to resolve the issue. Hopper availability was restored by 15:30.

6/1/2026 8am-3pm:   Reduced Capacity on McCleary:  To address a cooling issue in one of the McCleary racks at West Campus, we needed to temporarily take down the McCleary nodes that have names beginning with r813 (except for r813u29n09 and r813u29n11).  The work took place from 8am-3pm on Monday, June 1, 2026.

5/11/2026:  There was a power drop at the West Campus Data Center on the morning of 5/11/2026.  This impacted running jobs on McCleary, Grace, Misha and Milgram.  Please check your jobs to see how they were impacted.

2/13/2026: Bouchet scheduler issues have been resolved. All systems are operational. 

Cluster Maintenance

To perform critical updates and minimize downtime, regular maintenance will be performed on each cluster on a rotating schedule. During maintenance, logins will be disabled, jobs will not run, and cluster storage may be unavailable. Communication will be sent to users four weeks and one week before the maintenance period and in case of any changes.

Starting in the Spring of 2025, YCRC updated our approach to system maintenance.  Until then, each cluster had two full-downtime maintenance periods per year, each lasting three days.  With the new approach, each cluster is updated twice a year.  However, only one of these two annual maintenance periods is a full downtime.  The other involves rolling updates to a live cluster.  We are working toward a system in which annual full-downtime maintenance is performed on all YCRC-managed clusters simultaneously.  Approximately six months after this full downtime, the clusters will be patched with minor updates on a rolling basis with minimal disruption. 

The new approach has several advantages.  By consolidating the major cluster updates, YCRC is able to focus on the preparation for and execution of those major updates once a year, instead of the nearly once a month, freeing more time for supporting researchers.  Simultaneously performing maintenance on all clusters within a data center enables maintenance of subsystems that affect multiple clusters.  All clusters are kept on the same major version of the image throughout the year, resulting in a more consistent, easier-to-support environment.  For any given cluster, the number of days per year of planned total downtime is reduced.  Also, for each cluster, the number of planned total-downtime periods is reduced from two to one.  There is a second period each year of rolling updates, but these will entail limited disruption.

If you have any questions, comments, or concerns, please contact us at hpc@yale.edu

Maintenance Schedule

Upcoming:

All YCRC-managed clusters will be down for planned maintenance in mid-June as follows:

Please note that the proposed timeframes are estimates; we recognize the disruptive impact and will do our best to streamline and shorten the maintenance activities to minimize it. 

  • Bouchet and Hopper, along with their associated storage, will be down for maintenance from 2:30 pm Monday, June 15, 2026, through the end of the day on Thursday, June 18, 2026.

  • Grace, McCleary, Milgram, and Misha, along with their associated storage, will be down for maintenance from 8:00 am on Tuesday, June 16, 2026, through the end of the day on Wednesday, June 17, 2026.


Past Scheduled Maintenance:

  • Hopper - March 10th, 2026
  • Grace, Bouchet, Milgram, Misha, McCleary - December 17-18, 2025
  • Bouchet and its associated storage  - June 2 - June 5, 2025
  • Grace, Milgram and Misha and their associated storage  - June 9 - June 12, 2025
  • McCleary and its associated storage - June 10 - June 12, 2025