Starting in the Spring of 2025, YCRC updated our approach to system maintenance. Until then, each cluster had two full-downtime maintenance periods per year, each lasting three days. With the new approach, each cluster is updated twice a year. However, only one of these two annual maintenance periods is a full downtime. The other involves rolling updates to a live cluster. We are working toward a system in which annual full-downtime maintenance is performed on all YCRC-managed clusters simultaneously. Approximately six months after this full downtime, the clusters will be patched with minor updates on a rolling basis with minimal disruption.
The new approach has several advantages. By consolidating the major cluster updates, YCRC is able to focus on the preparation for and execution of those major updates once a year, instead of the nearly once a month, freeing more time for supporting researchers. Simultaneously performing maintenance on all clusters within a data center enables maintenance of subsystems that affect multiple clusters. All clusters are kept on the same major version of the image throughout the year, resulting in a more consistent, easier-to-support environment. For any given cluster, the number of days per year of planned total downtime is reduced. Also, for each cluster, the number of planned total-downtime periods is reduced from two to one. There is a second period each year of rolling updates, but these will entail limited disruption.
If you have any questions, comments, or concerns, please contact us at hpc@yale.edu