Computing Systems Status
Updated 6/15/2026: Due to a cooling issue in one of the McCleary racks at West Campus, the nodes that have names beginning with r813 (except for r813u29n09 and r813u29n11) are currently down. This outage is caused by a faulty Liquid Cooling Distribution Unit, causing equipment in the rack to overheat and shutdown. We are working with the CDU vendor to resolve the issue.
This issue impacts the following McCleary scheduler partitions:
- 26 of the 33 nodes in the day partition
- 14 of the 16 nodes in the week partition
- 1 of the 20 nodes in the gpu partition
- 1 of the 3 nodes in the pi_gerstein_gpu partition
- 2 of the 10 nodes in the pi_jetz partition
- 4 of the 4 nodes in the pi_ohern partition
- 2 of the 2 nodes in the pi_sestan partition
- 4 of the 4 nodes in the pi_tsang partition
Slurm jobs can be submitted as usual, however any jobs that would start on the affected nodes won’t start until after the cooling issue is resolved.
The vendor will be onsite Wednesday, 6/17/2026 to resolve the issue. Their work is expected to complete by the end of the day.
We realize this is an impact to your work and apologize for the inconvenience. If you have any questions, comments, or concerns, please contact us at research.computing@yale.edu.
Resolved past issues:
6/4/2026 13:00: Hopper experienced a system issue that impacted availability. We engaged the storage vendor to resolve the issue. Hopper availability was restored by 15:30.
6/1/2026 8am-3pm: Reduced Capacity on McCleary: To address a cooling issue in one of the McCleary racks at West Campus, we needed to temporarily take down the McCleary nodes that have names beginning with r813 (except for r813u29n09 and r813u29n11). The work took place from 8am-3pm on Monday, June 1, 2026.
5/11/2026: There was a power drop at the West Campus Data Center on the morning of 5/11/2026. This impacted running jobs on McCleary, Grace, Misha and Milgram. Please check your jobs to see how they were impacted.
2/13/2026: Bouchet scheduler issues have been resolved. All systems are operational.