System Updates Archive
-
9/27/2022: One-day maintenance will affect some Grace nodes and all Milgram compute nodes
In order to perform maintenance to the electrical supply providing power to part of the HPC Data Center at West Campus in preparation for adding additional hardware, some compute nodes will be unavailable starting on Tuesday, September 27, 2022, at 8:00 am. Maintenance is expected to be completed by the end of the day and nodes will then be reenabled.
The impacted nodes are all compute nodes on Milgram and those with a node name starting “p08” on Grace. This affects the following commons and PI partitions, but in some cases not all nodes in the partition are affected:
Milgram All compute nodes Grace bigmem 3 nodes (5 nodes unaffected) day 66 nodes (233 nodes unaffected) gpu 4 nodes with V100 GPUs
5 nodes with RTX 2080 ti GPUs
(22 nodes with a100, k80, p100, rtx5000 GPUs unaffected)gpu_devel 1 node mpi 88 nodes (44 nodes unaffected) transfer 2 nodes affected week 17 nodes (8 nodes unaffected) pi_balou 9 nodes (44 nodes unaffected) pi_berry 1 nodes pi_econ_io 6 nodes pi_econ_lp 5 nodes (8 nodes unaffected) pi_esi 36 nodes pi_gelernter 1 node (1 node unaffected) pi_hodgson 1 node pi_howard 1 node pi_jorgensen 3 nodes pi_levine 20 nodes pi_lora 4 nodes pi_manohar 4 nodes (11 nodes unaffected) pi_ohern 2 nodes (20 nodes unaffected) pi_polimanti 2 nodes The system will automatically start using the nodes again once they are available. An email notification will be sent when the maintenance has been completed, and the nodes are available.
As the maintenance window approaches, the Slurm scheduler will not start any job on the impacted nodes if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on September 27, 2022). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.)
-
Tuesday, September 6: Yale network issues impacting Ruddle logins.
Tuesday, September 6, 2022 - 9:00amYale network issues are currently preventing logins via ssh to Ruddle. ITS does not have an ETA for resolution at this time. We apologize for the inconvenience. -
Resolved - Milgram Network Interruption
09:00 9/1/2022 - RESOLVED: ITS has fixed the issue with the VPN. If you are still having issues connecting to Milgram, please disconnect from the VPN, then reconnect to the VPN, and then try logging in to Milgram again.
15:10 8/31/2022 - YCRC staff are currently working on identifying and resolving an issue that is affecting access to the Milgram cluster.
-
8/17/2023: Gibbs filesystem performance issue impacting McCleary
Thursday, August 17, 2023 - 7:30amWe are currently experiencing a performance issue with the gibbs filesystem. This issue is impacting the the McCleary cluster. We working with the vendor to resolve the issue as quickly as possible, and we will post updates as we learn more.
-
Performance issues on the Palmer filesystem
Tuesday, August 9, 2022 - 12:00amWe aware of performance issues on the Palmer filesystem. We are working with vendor to resolve the issues as quickly as possible. We apologize for the inconvenience.
-
Grace Scheduled Maintenance
Tuesday, August 2, 2022 - 8:00am to Thursday, August 4, 2022 - 5:00pmScheduled maintenance will be performed on Grace beginning Tuesday, August 2, 2022, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, August 4, 2022.
During this time, logins will be disabled and connections via Globus will be unavailable. The Loomis storage will not be available on the Farnam cluster. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.
-
7/22/2022: RESOLVED: Power outage at West Campus impacted Grace and Milgram clusters.
A large portion of Grace and all nodes on Milgram went offline on the morning of Friday 7/22/2022 due to a power issue in the data center. The system has been restored, however, many jobs died due to the issue. Please check on the status of any jobs that may have been running at that time.
-
Grace and Milgram Clusters Unavailable
Friday, July 22, 2022 - 9:00amA power issue at the West Campus data center has impacted the Grace and Milgram clusters. Facilities, ITS, and YCRC are working to restore service.
-
COMPLETED: Planned Maintenance: Cluster access interruption
Thursday, July 14, 2022 - 7:30amYale ITS will perform maintenance that will interrupt the connection between Yale’s Science Network (where the HPC clusters are hosted) and the Campus Network on Thursday, July 14, 2022, at 7:30am. The interruption should last for about 30 seconds. During this brief interruption, new attempts to login to the clusters will fail, and existing connections may be dropped. Existing interactive jobs may be impacted.
Before the maintenance, please save your work, quit interactive applications, exit any interactive jobs, and logoff of the clusters. Batch jobs should not be impacted.
-
Milgram Scheduled Maintenance
Dear Milgram Users,
We will perform scheduled maintenance on Milgram starting on Tuesday, June 7, 2022, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, June 9, 2022.
During this time, logins to the cluster will be disabled. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.
As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on June 7, 2022). You can run the command “htnm” (short for “hours_to_next_maintenance”) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins. If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order. All running jobs will be terminated at the start of the maintenance period. Please plan accordingly.
Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
Sincerely,
Paul Gluhosky