Performance Issues on Grace Storage (Loomis)

Friday, January 28, 2022 - 5:00pm

Over the past week, we have experienced performance problems with the Loomis storage system (which holds the application tree, home directories, scratch, and most project directories for Grace). At first, this only resulted in intermittent slow logins. On Tuesday afternoon, however, the situation became more severe, and we were forced to reboot the entire storage system. Many jobs running at that time failed, so please check the status of your jobs. At that point, we disabled the slurm partitions on Grace to prevent further job failures. We have been working closely with the vendor to resolve the issue. We have taken measures that have significantly improved the situation and have re-opened most of the partitions on Grace. There are still intermittent issues, however. The scavenge partition is down to reduce load. It will remain down until after next week’s maintenance. As we continue to monitor the situation and work with the vendor, we may take additional actions such as disabling more partitions. We may make additional changes during next week’s Grace maintenance depending on recommendations from the vendor. We realize that this may be impacting your work, and we appreciate your patience. If you have questions, comments or concerns, please contact us at hpc@yale.edu.