System Updates Archive

  • Omega Network Issue

    Thursday, September 17, 2015 - 1:30pm

    We are currently experiencing intermittent networking issues on Omega.  Roughly 500 nodes are offline and many end users are unable to login.  The HPC team is investigating and working hard to bring the system back to regular service as soon as possible.   Currently, logins are enabled, however may be disabled as we trouble shoot and fix the issue.

     

    We sincerely apologize for this inconvenience, especially in light of recent events.

     

    Please refer to this for the latest updates and email hpc@yale.edu with any questions, concerns or comments.

     

    Yale Center for Research Computing

  • Omega is Back in Service

    Tuesday, September 15, 2015 - 3:00pm

    Omega is now back in service.  Files have been restored to /home from tape backup and a significant number of files (under 1 MB in size) have been recovered from /scratch. For instructions on discovering which files have been impacted by corruption, and guidance regarding the potential for recovery of additional files that may be of particular importance, please enter the following command after logging into Omega:

    cat /scratch/maint-notes-sept-2015/README

    We are keenly aware of the impact that this disruption may have on your research, and we sincerely apologize for the inconvenience. We will provide all possible assistance as you recover from this unfortunate event. Please email hpc@yale.edu to request help, or with any questions, concerns or comments.

  • Omega Update

    Monday, September 14, 2015 - 10:30pm

    As of today, September 14, the HPC team has completed restoration of files to /home from tape backup and has recovered a significant number of files (under 1 MB in size) from /scratch. The team has successfully run confidence tests and benchmarks on both the storage system and the InfiniBand network, with entirely satisfactory results. However, we are continuing to seek additional assurances from the Lustre file system development team and our vendors to confirm that we have taken all reasonable steps to ensure that the storage system is in the best possible condition.  At this point, unless the Lustre developers or our vendors inform us of additional recommended actions, we plan to return Omega to service by tomorrow afternoon. At that time, we will provide users with both detailed information about files affected by the storage corruption, and guidance regarding the potential for recovery of additional files that may be of particular importance. 

    We are keenly aware of the impact that this disruption may have on your research, and we sincerely apologize for the inconvenience. We will provide all possible assistance as you recover from this unfortunate event.  Please email hpc@yale.edu to request help, or with any questions, concerns or comments.

    Please refer to this page for the latest updates and please note that logins will be disabled until Omega is brought back into service.

  • Omega Update

    Sunday, September 13, 2015 - 7:00pm

    As of today, September 13, the HPC team has been able to restore files to /home from tape backup, and has recovered a significant number of files (under 1 MB in size) from /scratch, and confidence tests run against the storage have been positive.  We have run into some network instability that we need to correct before allowing user logins.  Networking tests are currently underway and these tests will continue throughout this evening.  We will be sending an additional status communication tomorrow, Monday September 14 with updates of these networking confidence tests, which will indicate if Omega is stable and ready to be brought into service.

    We are deeply sympathetic to the impact of this issue on your research, and we sincerely apologize for the inconvenience. Once Omega is back in normal operation, we will provide any assistance we can to help you recover from this unfortunate disruption.

  • Omega Issues - Update

    Friday, September 11, 2015 - 3:30pm

    As of today, September 11, the team is continuing to restore files to /home from tape backup, and to recover as many small files (under 1 MB in size) as possible from /scratch.  We will continue these processes until Sunday, September 13, when we will suspend them (even if they are incomplete) and begin file system confidence tests, so that we can bring Omega back into normal service Monday morning.   We will continue to restore files from the /home backup tapes next week, if necessary, even after Omega is back in service. (These files will be placed in a separate disk area to avoid disruption.) We may also be able to recover additional corrupted files from /scratch (including some files larger than 1 MB) after Omega returns to normal service, but we will need to coordinate with individual principal investigators to do so.

    We believe that the confidence tests on Omega will proceed without issue on Sunday, but their successful completion will be necessary before we can return Omega to normal service.  When the tests are complete, we will send out a communication indicating Omega’s status and readiness for service Monday morning.  At that time, we will provide users with lists of files that were damaged, including the status of each one (i.e., recovered, potentially recoverable, or lost).  Logins will continue to remain disabled until Omega is returned to normal service.

    We are deeply sympathetic to the impact of this issue on your research, and we sincerely apologize for the inconvenience. Once Omega is back in normal operation, we will provide any assistance we can to help you recover from this unfortunate disruption.

    To check for the latest updates, please see that status page on the Yale Center for Research Computing website.  If you have any questions, concerns or comments please contact us at hpc@yale.edu

  • Omega Issues - Update

    Tuesday, September 8, 2015 - 12:30pm

    File recovery processes on Omega proceeded throughout the holiday weekend and are continuing. Updates, as they develop, will continue to be posted here.  Logins remain disabled.  When Omega is returned to service, we will provide users with lists of files that were damaged and recovered, as well as lists of files we believe were lost.

    We are very mindful of the serious impact this may have on your research, and we apologize for any inconvenience this issue has caused.   As always, if you have any questions, concerns or comments, please email hpc@yale.edu.

  • Omega Issues

    Thursday, September 3, 2015 - 10:00pm

    During file system maintenance on Omega last week, the HPC team discovered file corruption on both the home and scratch partitions. We anticipate that the damaged files on the home partition will be restored from backup tapes, so the HPC team and the hardware vendor have focused their efforts on recovering as many scratch files as possible. We believe that we can recover a significant number of files, but there will still be data loss. To date, the diligent efforts of the HPC team have recovered roughly 25% of the corrupt files, and we believe that it will be fruitful to continue the recovery effort throughout the holiday weekend. 

    Our current plan is to start a number of confidence tests on Tuesday to ensure that the file system is healthy. After the tests are complete, we will then bring Omega back online for regular usage.  Since we do not know exactly how long the recovery and testing processes will take, we are not able to provide a firm estimate of when Omega will be returned to regular service.  Our aim is to recover as much data as reasonably possible and to ensure that future usage will be entirely safe.  When we return Omega to service, we will provide users with lists of files that were damaged and recovered, as well as lists of files that we believe were lost.  Please note that until Omega is returned to service, all logins will be disabled.

    We are very mindful of the serious impact this may have on your research, and we apologize for any inconvenience this issue has caused.

    We will be updating this status page with current information, so please check for updates.  As always, if you have any questions, concerns or comments, please email hpc@yale.edu.

  • Power Outage Impacting Louise Cluster

    Thursday, August 20, 2015 - 2:12am

    A brief power outage on August 20th, 2015 at approximately 2:12 am resulted in 100 nodes on Louise crashing and rebooting.  Any job running on Louise at that time may be impacted and we recommend that these jobs be checked.   The HPC team will continue to monitor the situation and updates as needed, will be posted to this page. 

    We apologize for any inconvenience this may have caused and as always, if you have any questions or concerns please don’t hesitate to contact us at hpc@yale.edu

  • Omega Extended Maintenance 8/24/15 to 8/31/15

    Monday, August 24, 2015 - 8:00am

    An extended maintenance window is scheduled for the Omega Cluster, which includes both the Geology and Geophysics (GEO) and High Energy Physics (HEP) clusters.  Maintenance will begin starting at 8 am on Monday, August 24th.  All systems will be available for use again by the morning of Monday, August 31st.

  • Emergency Maintenance - Grace Cluster and GPFS Filesystem

    Tuesday, August 4, 2015 - 8:00am

    Over the past several weeks we have experienced several outages of our GPFS filesystem, which primarily serves the Grace cluster and which is also mounted on other clusters.  In an effort to remedy the issues we have been experiencing with GPFS, we are conducting an emergency outage of the Grace cluster and the GPFS filesystem (/gpfs) on all clusters. Maintenance began on Tuesday, August 4th at 8:00 am.  We are now expecting maintenance to be complete by the morning of Saturday August 8th, but this is subject to change.  Please refer back to this page for updates.  

    We apologize for the late notice and for any inconvenience this may cause and as always if you have any questions, concerns or comments please email hpc@yale.edu.

Pages