System Updates Archive

  • Performance Issues on Grace Storage (Loomis)

    Friday, January 28, 2022 - 5:00pm

    Over the past week, we have experienced performance problems with the Loomis storage system (which holds the application tree, home directories, scratch, and most project directories for Grace). At first, this only resulted in intermittent slow logins. On Tuesday afternoon, however, the situation became more severe, and we were forced to reboot the entire storage system. Many jobs running at that time failed, so please check the status of your jobs. At that point, we disabled the slurm partitions on Grace to prevent further job failures. We have been working closely with the vendor to resolve the issue. We have taken measures that have significantly improved the situation and have re-opened most of the partitions on Grace. There are still intermittent issues, however. The scavenge partition is down to reduce load. It will remain down until after next week’s maintenance. As we continue to monitor the situation and work with the vendor, we may take additional actions such as disabling more partitions. We may make additional changes during next week’s Grace maintenance depending on recommendations from the vendor. We realize that this may be impacting your work, and we appreciate your patience. If you have questions, comments or concerns, please contact us at hpc@yale.edu.

  • Grace Update: Partitions now open

    Thursday, January 27, 2022 - 12:30pm

    We have made significant progress with the storage issue and re-opened all partitions on Grace.  Please let us know if you see any issues.  Jobs that were running on Tuesday afternoon (01/25/2022) likely failed.  Please check the status of your jobs.

  • Loomis Storage is Unavailable, Grace job partitions disabled

    Tuesday, January 25, 2022 - 3:00pm

    We are currently addressing an issue with the Loomis storage system. All job partitions on Grace are disabled while we work to resolve the issue.

  • Intermittent issues with Grace Cluster Responsiveness

    Friday, January 21, 2022 - 2:00pm

    YCRC staff are investigating intermittent issues with the responsiveness of the grace cluster, including logins that take a long time to complete.  We are working to resolve the problem as quickly as possible.

  • Milgram Scheduled Maintenance

    Dear Milgram Users,
     

    Scheduled maintenance will begin Tuesday, December 7, 2021, at 8:00 am. We expect that the cluster will return to service by the end of the day on Thursday, Decemebr 9, 2021. During this time, logins to the cluster will be disabled. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on Tuesday, December 7, 2021). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time.” Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order. All running jobs will be terminated at the start of the maintenance period. Please plan accordingly.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

    Sincerely,

    Paul Gluhosky

  • Ruddle Scheduled Maintenance

    Dear Ruddle Users,
     

    Scheduled maintenance will begin Tuesday, November 2, 2021, at 8:00 am. We expect that the cluster will return to service by the end of the day on Thursday, November 4, 2021. During this time, logins to the cluster will be disabled. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on Tuesday, November 2, 2021). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time.” Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order. All running jobs will be terminated at the start of the maintenance period. Please plan accordingly.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

    Sincerely,

    Paul Gluhosky

  • Ruddle Issues

    12:00 - 10/5/21 - A network problem is impacting access to storage from Ruddle.  This may prevent logins.  We have temporarily disabled slurm so that new jobs will not start during the outage.   We are working with the vendor to resolve the issue as quickly as possible.

    09:40 - 10/5/21 - YCRC staff have resolved the network interruption on Ruddle.

    09:08 - 10/5/21 - Ruddle has experienced a network interruption that may prevent users from connecting to the cluster. YCRC staff are currently working on resolving this issue.

  • Farnam Maintenance

     

    Dear Farnam Users,

    As a reminder, we will perform scheduled maintenance on Farnam starting on Tuesday, October 5, 2021, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, October 7, 2021.

    During this time, logins will be disabled and connections via Globus will be unavailable. Farnam storage (/gpfs/ysm and /gpfs/slayman) will remain available on the Grace cluster. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the clusters are available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on October 5, 2021). You can run the command “htnm” (short for “hours_to_next_maintenance”) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins. If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

  • Resolved: Ruddle Storage Issue

    RESOLVED - 9/14/2021 - 10:45 - Ruddle’s storage service has been resumed by YCRC staff. Users are encourage to check their job status.

    9/14/2021 - 10:30 -  YCRC staff are currently working to resume storage service on the Ruddle cluster. While this issue is being resolved, users will be unable to login.

  • Milgram Unavailable

    Monday, August 9, 2021 - 2:00pm

    We are currently experiencing an issue with logins to Milgram. We are working to restore access as soon as possible. We apologize for the inconvenience.

    Update 2:45pm - The networking issue has been resolved and access has been restored to the cluster.

Pages