Status and Maintenance | Yale Center for Research Computing

Computing Systems Status

7/9/2026: Reduced capacity on McCleary. Due to a cooling issue in one of the McCleary racks at West Campus, McCleary nodes that have names beginning with r813 have again shut down. This outage is caused by a faulty Cooling Distribution Unit (CDU), causing equipment in the rack to overheat and shut down. We are working with the CDU vendor to resolve the issue as quickly as possible.

This issue impacts the following McCleary scheduler partitions:

26 of the 33 nodes in the day partition
14 of the 16 nodes in the week partition
3 of the 20 nodes in the gpu partition
1 of the 3 nodes in the pi_gerstein_gpu partition
2 of the 10 nodes in the pi_jetz partition
4 of the 4 nodes in the pi_ohern partition
2 of the 2 nodes in the pi_sestan partition
4 of the 4 nodes in the pi_tsang partition

Slurm jobs can be submitted as usual, however any jobs that would start on the affected nodes won’t start until after the cooling issue is resolved.

We realize this is an impact to your work and apologize for the inconvenience. If you have any questions, comments, or concerns, please contact us at research.computing@yale.edu.

Resolved past issues:

6/15/2026 - 6/18/2026: Scheduled all-cluster maintenance complete.

6/4/2026 13:00: Hopper experienced a system issue that impacted availability. We engaged the storage vendor to resolve the issue. Hopper availability was restored by 15:30.

5/11/2026: There was a power drop at the West Campus Data Center on the morning of 5/11/2026. This impacted running jobs on McCleary, Grace, Misha and Milgram. Please check your jobs to see how they were impacted.

2/13/2026: Bouchet scheduler issues have been resolved. All systems are operational.

Cluster Maintenance

To perform critical updates and minimize downtime, regular maintenance will be performed on each cluster on a rotating schedule. During maintenance, logins will be disabled, jobs will not run, and cluster storage may be unavailable. Communication will be sent to users four weeks and one week before the maintenance period and in case of any changes.

Starting in the Spring of 2025, YCRC updated our approach to system maintenance. Until then, each cluster had two full-downtime maintenance periods per year, each lasting three days. With the new approach, each cluster is updated twice a year. However, only one of these two annual maintenance periods is a full downtime. The other involves rolling updates to a live cluster. We are working toward a system in which annual full-downtime maintenance is performed on all YCRC-managed clusters simultaneously. Approximately six months after this full downtime, the clusters will be patched with minor updates on a rolling basis with minimal disruption.

The new approach has several advantages. By consolidating the major cluster updates, YCRC is able to focus on the preparation for and execution of those major updates once a year, instead of the nearly once a month, freeing more time for supporting researchers. Simultaneously performing maintenance on all clusters within a data center enables maintenance of subsystems that affect multiple clusters. All clusters are kept on the same major version of the image throughout the year, resulting in a more consistent, easier-to-support environment. For any given cluster, the number of days per year of planned total downtime is reduced. Also, for each cluster, the number of planned total-downtime periods is reduced from two to one. There is a second period each year of rolling updates, but these will entail limited disruption.

If you have any questions, comments, or concerns, please contact us at hpc@yale.edu

Maintenance Schedule

Upcoming:

All YCRC-managed clusters will be down for planned maintenance in mid-June as follows:

Please note that the proposed timeframes are estimates; we recognize the disruptive impact and will do our best to streamline and shorten the maintenance activities to minimize it.

Bouchet and Hopper, along with their associated storage, will be down for maintenance from 2:30 pm Monday, June 15, 2026, through the end of the day on Thursday, June 18, 2026.
Grace, McCleary, Milgram, and Misha, along with their associated storage, will be down for maintenance from 8:00 am on Tuesday, June 16, 2026, through the end of the day on Wednesday, June 17, 2026.

Past Scheduled Maintenance:

Hopper - March 10th, 2026
Grace, Bouchet, Milgram, Misha, McCleary - December 17-18, 2025
Bouchet and its associated storage - June 2 - June 5, 2025
Grace, Milgram and Misha and their associated storage - June 9 - June 12, 2025
McCleary and its associated storage - June 10 - June 12, 2025

7/9/2026 - Reduced Capacity on McCleary

Computing Systems Status

Resolved past issues:

Cluster Maintenance

Maintenance Schedule

Affiliations

Training