System Updates Archive

  • Milgram Scheduled Maintenance

    Dear Milgram Users,

    We will perform scheduled maintenance on Milgram starting on Tuesday, June 7, 2022, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, June 9, 2022.

    During this time, logins to the cluster will be disabled. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on June 7, 2022). You can run the command “htnm” (short for “hours_to_next_maintenance”) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins. If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order. All running jobs will be terminated at the start of the maintenance period. Please plan accordingly.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

    Sincerely,

    Paul Gluhosky
     

  • Grace Login Issue 5/18/2022 Update 2:05pm - Resolved

    5/18/2022 - 2:05pm - YCRC staff and our storage vendor have identified and resolved the issue that caused login issues. Users are encouraged to check the status of their jobs.

    5/18/2022 - 1:30pm YCRC staff are currently investigating an issue affecting Grace login and compute nodes. Updates will be posted as more information becomes available.

  • 5/14/2022 5:25am - Power disruption affected all clusters. Please check the status of your jobs.

    05/14/2022: Due to a power outage at West Campus at about 5:25am most compute nodes rebooted.  Please check the status of your jobs.

  • Scheduled Maintenance on Ruddle

    Dear Ruddle Users,
     

    Ruddle Scheduled Maintenance

    As a reminder, scheduled maintenance will be performed on Ruddle beginning Tuesday, May 3, 2022, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, May 5, 2022. 

    During this time, logins will be disabled, running jobs will be terminated and connections via Globus will be unavailable. The GPFS storage, including all YCGA sequencing data, will not be available. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

    Sincerely,

    Paul Gluhosky
     

  • Scheduled Maintenance on Farnam

    Dear Farnam Users,

    As a reminder, we will perform scheduled maintenance on Farnam starting on Tuesday, April 5, 2022, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, April 7, 2022.

    During this time, logins will be disabled and connections via Globus will be unavailable. Farnam storage (/gpfs/ysm and /gpfs/slayman) will remain available on the Grace cluster. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the clusters are available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on April 5, 2022). You can run the command “htnm” (short for “hours_to_next_maintenance”) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins. If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

    Sincerely,

    Paul Gluhosky

  • Grace Maintenance Extended

    Thursday, February 3, 2022 - 5:00pm

    Due to ongoing issues with the Loomis storage system and discussions with vendors to resolve them, we are continuing to work on the Grace cluster at this time. The maintenance period is, therefore, being extended. A further email notification will be sent when the maintenance has been completed and the cluster is fully available or if there is a significant change in the status. The Loomis storage will remain unavailable on Farnam.

    We recognize the impact that this has on your work and we are working hard to resolve the problems as quickly as possible. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

  • Grace Scheduled Maintenance

    Tuesday, February 1, 2022 - 8:00am to Thursday, February 3, 2022 - 5:00pm

    Scheduled maintenance will be performed on Grace beginning Tuesday, February 1, 2022, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, February 3, 2022. 

    In addition to the normal maintenance activities, we are working to address the Loomis performance issue.

    During this time, logins will be disabled and connections via Globus will be unavailable. The Loomis storage will not be available on the Farnam cluster. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.
     

  • Performance Issues on Grace Storage (Loomis)

    Friday, January 28, 2022 - 5:00pm

    Over the past week, we have experienced performance problems with the Loomis storage system (which holds the application tree, home directories, scratch, and most project directories for Grace). At first, this only resulted in intermittent slow logins. On Tuesday afternoon, however, the situation became more severe, and we were forced to reboot the entire storage system. Many jobs running at that time failed, so please check the status of your jobs. At that point, we disabled the slurm partitions on Grace to prevent further job failures. We have been working closely with the vendor to resolve the issue. We have taken measures that have significantly improved the situation and have re-opened most of the partitions on Grace. There are still intermittent issues, however. The scavenge partition is down to reduce load. It will remain down until after next week’s maintenance. As we continue to monitor the situation and work with the vendor, we may take additional actions such as disabling more partitions. We may make additional changes during next week’s Grace maintenance depending on recommendations from the vendor. We realize that this may be impacting your work, and we appreciate your patience. If you have questions, comments or concerns, please contact us at hpc@yale.edu.

  • Grace Update: Partitions now open

    Thursday, January 27, 2022 - 12:30pm

    We have made significant progress with the storage issue and re-opened all partitions on Grace.  Please let us know if you see any issues.  Jobs that were running on Tuesday afternoon (01/25/2022) likely failed.  Please check the status of your jobs.

  • Loomis Storage is Unavailable, Grace job partitions disabled

    Tuesday, January 25, 2022 - 3:00pm

    We are currently addressing an issue with the Loomis storage system. All job partitions on Grace are disabled while we work to resolve the issue.

Pages