Omega Issues

Thursday, September 3, 2015 - 10:00pm

During file system maintenance on Omega last week, the HPC team discovered file corruption on both the home and scratch partitions. We anticipate that the damaged files on the home partition will be restored from backup tapes, so the HPC team and the hardware vendor have focused their efforts on recovering as many scratch files as possible. We believe that we can recover a significant number of files, but there will still be data loss. To date, the diligent efforts of the HPC team have recovered roughly 25% of the corrupt files, and we believe that it will be fruitful to continue the recovery effort throughout the holiday weekend. 

Our current plan is to start a number of confidence tests on Tuesday to ensure that the file system is healthy. After the tests are complete, we will then bring Omega back online for regular usage.  Since we do not know exactly how long the recovery and testing processes will take, we are not able to provide a firm estimate of when Omega will be returned to regular service.  Our aim is to recover as much data as reasonably possible and to ensure that future usage will be entirely safe.  When we return Omega to service, we will provide users with lists of files that were damaged and recovered, as well as lists of files that we believe were lost.  Please note that until Omega is returned to service, all logins will be disabled.

We are very mindful of the serious impact this may have on your research, and we apologize for any inconvenience this issue has caused.

We will be updating this status page with current information, so please check for updates.  As always, if you have any questions, concerns or comments, please email hpc@yale.edu.