System Updates Archive
-
Omega Issues - Update
Friday, September 11, 2015 - 3:30pmAs of today, September 11, the team is continuing to restore files to /home from tape backup, and to recover as many small files (under 1 MB in size) as possible from /scratch. We will continue these processes until Sunday, September 13, when we will suspend them (even if they are incomplete) and begin file system confidence tests, so that we can bring Omega back into normal service Monday morning. We will continue to restore files from the /home backup tapes next week, if necessary, even after Omega is back in service. (These files will be placed in a separate disk area to avoid disruption.) We may also be able to recover additional corrupted files from /scratch (including some files larger than 1 MB) after Omega returns to normal service, but we will need to coordinate with individual principal investigators to do so.
We believe that the confidence tests on Omega will proceed without issue on Sunday, but their successful completion will be necessary before we can return Omega to normal service. When the tests are complete, we will send out a communication indicating Omega’s status and readiness for service Monday morning. At that time, we will provide users with lists of files that were damaged, including the status of each one (i.e., recovered, potentially recoverable, or lost). Logins will continue to remain disabled until Omega is returned to normal service.
We are deeply sympathetic to the impact of this issue on your research, and we sincerely apologize for the inconvenience. Once Omega is back in normal operation, we will provide any assistance we can to help you recover from this unfortunate disruption.
To check for the latest updates, please see that status page on the Yale Center for Research Computing website. If you have any questions, concerns or comments please contact us at hpc@yale.edu
-
Omega Issues - Update
Tuesday, September 8, 2015 - 12:30pmFile recovery processes on Omega proceeded throughout the holiday weekend and are continuing. Updates, as they develop, will continue to be posted here. Logins remain disabled. When Omega is returned to service, we will provide users with lists of files that were damaged and recovered, as well as lists of files we believe were lost.
We are very mindful of the serious impact this may have on your research, and we apologize for any inconvenience this issue has caused. As always, if you have any questions, concerns or comments, please email hpc@yale.edu.
-
Omega Issues
Thursday, September 3, 2015 - 10:00pmDuring file system maintenance on Omega last week, the HPC team discovered file corruption on both the home and scratch partitions. We anticipate that the damaged files on the home partition will be restored from backup tapes, so the HPC team and the hardware vendor have focused their efforts on recovering as many scratch files as possible. We believe that we can recover a significant number of files, but there will still be data loss. To date, the diligent efforts of the HPC team have recovered roughly 25% of the corrupt files, and we believe that it will be fruitful to continue the recovery effort throughout the holiday weekend.
Our current plan is to start a number of confidence tests on Tuesday to ensure that the file system is healthy. After the tests are complete, we will then bring Omega back online for regular usage. Since we do not know exactly how long the recovery and testing processes will take, we are not able to provide a firm estimate of when Omega will be returned to regular service. Our aim is to recover as much data as reasonably possible and to ensure that future usage will be entirely safe. When we return Omega to service, we will provide users with lists of files that were damaged and recovered, as well as lists of files that we believe were lost. Please note that until Omega is returned to service, all logins will be disabled.
We are very mindful of the serious impact this may have on your research, and we apologize for any inconvenience this issue has caused.
We will be updating this status page with current information, so please check for updates. As always, if you have any questions, concerns or comments, please email hpc@yale.edu.
-
Power Outage Impacting Louise Cluster
Thursday, August 20, 2015 - 2:12amA brief power outage on August 20th, 2015 at approximately 2:12 am resulted in 100 nodes on Louise crashing and rebooting. Any job running on Louise at that time may be impacted and we recommend that these jobs be checked. The HPC team will continue to monitor the situation and updates as needed, will be posted to this page.
We apologize for any inconvenience this may have caused and as always, if you have any questions or concerns please don’t hesitate to contact us at hpc@yale.edu
-
Omega Extended Maintenance 8/24/15 to 8/31/15
Monday, August 24, 2015 - 8:00amAn extended maintenance window is scheduled for the Omega Cluster, which includes both the Geology and Geophysics (GEO) and High Energy Physics (HEP) clusters. Maintenance will begin starting at 8 am on Monday, August 24th. All systems will be available for use again by the morning of Monday, August 31st.
-
Emergency Maintenance - Grace Cluster and GPFS Filesystem
Tuesday, August 4, 2015 - 8:00amOver the past several weeks we have experienced several outages of our GPFS filesystem, which primarily serves the Grace cluster and which is also mounted on other clusters. In an effort to remedy the issues we have been experiencing with GPFS, we are conducting an emergency outage of the Grace cluster and the GPFS filesystem (/gpfs) on all clusters. Maintenance began on Tuesday, August 4th at 8:00 am. We are now expecting maintenance to be complete by the morning of Saturday August 8th, but this is subject to change. Please refer back to this page for updates.
We apologize for the late notice and for any inconvenience this may cause and as always if you have any questions, concerns or comments please email hpc@yale.edu.
-
Omega Unscheduled Outage
Thursday, July 16, 2015 - 5:15am to 7:30amAt approximately 5:15am today, a circuit breaker tripped impacting power to about half of the nodes on the Omega cluster. Power was restored to those nodes at approximately 7:30am.
The root cause remains under investigation by Data Center Engineering.
The following jobs were impacted:
compute-45-2 Down cpu 0:16 load jobname=flvc36mrun2.pbs user=mjr92 q=gputest compute-45-3 Down cpu 0:16 load jobname=flvc36mrun2.pbs user=mjr92 q=gputest compute-45-4 Down cpu 0:16 load jobname=flvc36mrun2.pbs user=mjr92 q=gputest compute-46-1 Down cpu 0:12 load jobname=flvc36mrun2.pbs user=mjr92 q=gputest compute-46-2 Down cpu 0:12 load jobname=flvc36mrun2.pbs user=mjr92 q=gputest compute-46-4 Down cpu 0:12 load jobnum=4803721 jobname=...Se-3_b3lypecp user=wd89 q=esi compute-46-5 Down cpu 0:12 load jobnum=4803705 jobname=2_CdSe-3_pw91ecp user=wd89 q=esi compute-46-8 Down cpu 0:12 load jobnum=4803703 jobname=2_CdSe-3_m06lecp user=wd89 q=esi compute-47-1 Down cpu 0:12 load jobname=2_CdSe-3_m06lecp user=wd89 q=esi compute-47-3 Down cpu 0:12 load jobnum=4803700 jobname=3_CdSe-3_pw91ecp user=wd89 q=esi compute-47-4 Down cpu 0:12 load jobname=3_CdSe-3_pw91ecp user=wd89 q=esi compute-47-7 Down cpu 0:12 load jobname=3_CdSe-3_pw91ecp user=wd89 q=esi compute-48-2 Down cpu 0:12 load jobnum=4803491 jobname=20_41_TS_qm user=ma583 q=esi compute-48-4 Down cpu 0:12 load jobname=20_41_TS_qm user=ma583 q=esi compute-48-5 Down cpu 0:12 load jobname=20_41_TS_qm user=ma583 q=esi compute-48-7 Down cpu 0:12 load jobname=20_41_TS_qm user=ma583 q=esi compute-49-3 Down cpu 0:12 load jobnum=4803027 jobname=..._Thiophene.sh user=br287 q=esi compute-49-6 Down cpu 0:12 load jobname=CdSe270-2 user=wd89 q=esi compute-49-8 Down cpu 0:12 load jobnum=4803091 jobname=..._sec_D61A____ user=ma583 q=esi compute-50-2 Down cpu 0:12 load jobname=..._sec_D61A____ user=ma583 q=esi compute-50-3 Down cpu 0:12 load jobname=..._sec_D61A____ user=ma583 q=esi compute-50-4 Down cpu 0:12 load jobname=..._sec_D61A____ user=ma583 q=esi compute-50-6 Down cpu 0:12 load jobnum=4803581 jobname=...e_Opt_BS1.pbs user=ky254 q=esi compute-50-7 Down cpu 0:12 load jobnum=4803581 jobname=...e_Opt_BS1.pbs user=ky254 q=esi compute-50-8 Down cpu 0:12 load jobnum=4803189 jobname=..._Part_Opt.pbs user=ky254 q=esi compute-51-1 Down cpu 0:12 load jobnum=4797311 jobname=CdSe270 user=wd89 q=esi compute-51-3 Down cpu 0:12 load jobnum=4803326 jobname=...qc_restart.sh user=br287 q=esi compute-51-8 Down cpu 0:12 load jobnum=4797430 jobname=CdSe270-2 user=wd89 q=esi compute-52-1 Down cpu 0:12 load jobnum=4797430 jobname=CdSe270-2 user=wd89 q=esi compute-52-3 Down cpu 0:12 load jobname=CdSe270-2 user=wd89 q=esi compute-53-1 Down cpu 0:12 load jobnum=4797430 jobname=CdSe270-2 user=wd89 q=esi compute-53-2 Down cpu 0:12 load jobnum=4800703 jobname=...y2_opt_xqc.sh user=br287 q=esi compute-53-3 Down cpu 0:12 load jobname=...y2_opt_xqc.sh user=br287 q=esi compute-53-4 Down cpu 0:12 load jobnum=4803305 jobname=...A__imp_41____ user=ma583 q=esi compute-53-5 Down cpu 0:12 load jobnum=4800641 jobname=corr user=jh943 q=esi compute-53-6 Down cpu 0:12 load jobnum=4797430 jobname=CdSe270-2 user=wd89 q=esi compute-53-7 Down cpu 0:12 load jobnum=4800639 jobname=corr user=jh943 q=esi compute-53-8 Down cpu 0:12 load jobname=corr user=jh943 q=esi compute-37-2 Down cpu 0:8 load jobnum=4802733 jobname=...12-part52.txt user=fk65 q=fas_normal compute-37-3 Down cpu 0:8 load jobnum=4802539 jobname=...0.160-1850-dp user=pd283 q=fas_normal compute-37-4 Down cpu 0:8 load jobnum=4802504 jobname=opp.2x10x1.1 user=md599 q=fas_normal compute-37-9 Down cpu 0:8 load jobnum=4801872 jobname=...hfb_dlnz_long user=cng8 q=fas_very_long compute-37-10 Down cpu 0:8 load jobnum=4788165 jobname=...cavity_009.sh user=awc24 q=fas_very_long qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal compute-37-14 Down cpu 0:8 load jobnum=4802534 jobname=...0.270-1920-dp user=pd283 q=fas_normal compute-37-15 Down cpu 0:8 load jobnum=4802345 jobname=SiLTO.12si.0.5O user=ak688 q=fas_normal compute-38-6 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-7 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-8 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-9 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-10 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-11 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-13 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-14 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-38-15 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-3 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-4 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-8 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-10 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-12 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-13 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-14 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal compute-44-16 Down cpu 0:8 load jobname=opp.2x12x1.2 user=md599 q=fas_normal qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal compute-15-7 Down cpu 0:8 load jobnum=4802287 jobname=...tion_31111_17 user=jmm357 q=fas_normal qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal qstat: Unknown Job Id Error 4799449.rocks.omega.hpc.yale.internal compute-25-2 Down cpu 0:8 load jobname=...0.160-1850-dp user=pd283 q=fas_normal compute-25-7 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-25-8 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-25-11 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-25-12 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-25-13 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-26-7 Down cpu 0:8 load jobname=7-14-15-aspect19 user=krv8 q=fas_normal compute-26-8 Down cpu 0:8 load jobnum=4802318 jobname=7-14-15-aspect19 user=krv8 q=fas_normal compute-26-9 Down cpu 0:8 load jobnum=4802345 jobname=SiLTO.12si.0.5O user=ak688 q=fas_normal compute-26-15 Down cpu 0:8 load jobnum=4802235 jobname=H.2x2.neg.PTO user=ak688 q=fas_normal compute-27-1 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-2 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-3 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-4 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-5 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-6 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-8 Down cpu 0:8 load jobname=openX_1.0_1 user=mw564 q=fas_long compute-27-9 Down cpu 0:8 load jobname=openX_1.0_1 user=mw564 q=fas_long compute-27-11 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-12 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-14 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-15 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-27-16 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-28-3 Down cpu 0:8 load jobname=7-14-15-aspect18 user=krv8 q=fas_normal compute-28-5 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-28-6 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-28-8 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-28-10 Down cpu 0:8 load jobnum=4802235 jobname=H.2x2.neg.PTO user=ak688 q=fas_normal compute-28-15 Down cpu 0:8 load jobname=openX_0.35_2 user=mw564 q=fas_long compute-29-1 Down cpu 0:8 load jobnum=4800438 jobname=openX_0.35_2 user=mw564 q=fas_long compute-29-7 Down cpu 0:8 load jobname=7-14-15-aspect9 user=krv8 q=fas_normal compute-29-9 Down cpu 0:8 load jobname=7-14-15-aspect9 user=krv8 q=fas_normal compute-29-10 Down cpu 0:8 load jobname=7-14-15-aspect9 user=krv8 q=fas_normal compute-29-12 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-29-15 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-29-16 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-30-6 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-30-7 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-30-10 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-30-11 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-30-15 Down cpu 0:8 load jobname=7-14-15-aspect9 user=krv8 q=fas_normal compute-30-16 Down cpu 0:8 load jobname=7-14-15-aspect9 user=krv8 q=fas_normal compute-31-1 Down cpu 0:8 load jobnum=4802235 jobname=H.2x2.neg.PTO user=ak688 q=fas_normal compute-31-2 Down cpu 0:8 load jobname=H.2x2.neg.PTO user=ak688 q=fas_normal compute-31-4 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-31-5 Down cpu 0:8 load jobnum=4802379 jobname=SiSBTOLT.5mlO.2 user=ak688 q=fas_normal compute-31-7 Down cpu 0:8 load jobnum=4802379 jobname=SiSBTOLT.5mlO.2 user=ak688 q=fas_normal compute-31-9 Down cpu 0:8 load jobname=SiSBTOLT.5mlO.2 user=ak688 q=fas_normal compute-31-10 Down cpu 0:8 load jobname=SiSBTOLT.5mlO.2 user=ak688 q=fas_normal compute-31-11 Down cpu 0:8 load jobnum=4803002 jobname=...symP_CBS-APNO user=vaccaro q=fas_normal compute-31-12 Down cpu 0:8 load jobnum=4803001 jobname=...symP_CBS-APNO user=vaccaro q=fas_normal compute-31-14 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-31-15 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-32-4 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-32-5 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-32-6 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-32-10 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-32-11 Down cpu 0:8 load jobnum=4803083 jobname=plrs user=olz3 q=fas_very_long compute-32-12 Down cpu 0:8 load jobname=plrs user=olz3 q=fas_very_long compute-32-13 Down cpu 0:8 load jobname=plrs user=olz3 q=fas_very_long compute-32-15 Down cpu 0:8 load jobnum=4802235 jobname=H.2x2.neg.PTO user=ak688 q=fas_normal compute-32-16 Down cpu 0:8 load jobnum=4802235 jobname=H.2x2.neg.PTO user=ak688 q=fas_normal compute-33-3 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-4 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-6 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-7 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-9 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-10 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-12 Down cpu 0:8 load jobname=SimpleQueue user=sd566 q=fas_normal compute-33-13 Down cpu 0:8 load jobnum=4802777 jobname=...12-part92.txt user=fk65 q=fas_normal compute-33-14 Down cpu 0:8 load jobnum=4802318 jobname=7-14-15-aspect19 user=krv8 q=fas_normal compute-33-15 Down cpu 0:8 load jobnum=4802318 jobname=7-14-15-aspect19 user=krv8 q=fas_normal compute-33-16 Down cpu 0:8 load jobnum=4802776 jobname=...12-part91.txt user=fk65 q=fas_normal compute-34-1 Down cpu 0:8 load jobnum=4800448 jobname=openX_1.0_4 user=mw564 q=fas_long compute-34-2 Down cpu 0:8 load jobnum=4800448 jobname=openX_1.0_4 user=mw564 q=fas_long compute-34-3 Down cpu 0:8 load jobnum=4800448 jobname=openX_1.0_4 user=mw564 q=fas_long compute-34-5 Down cpu 0:8 load jobname=7-14-15-aspect18 user=krv8 q=fas_normal compute-34-8 Down cpu 0:8 load jobnum=4802534 jobname=...0.270-1920-dp user=pd283 q=fas_normal compute-34-9 Down cpu 0:8 load jobnum=4802534 jobname=...0.270-1920-dp user=pd283 q=fas_normal compute-34-12 Down cpu 0:8 load jobnum=4780106 jobname=...sd-apVDZ_freq user=vaccaro q=fas_very_long compute-39-13 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-40-4 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-40-5 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-40-6 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-40-8 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-40-10 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-40-13 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-6 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-7 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-8 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-10 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-11 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-15 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-41-16 Down cpu 0:8 load jobnum=4802043 jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-42-1 Down cpu 0:8 load jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-42-2 Down cpu 0:8 load jobname=...0_CSF_256_512 user=jbb83 q=astro_prod compute-42-4 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-5 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-8 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-9 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-10 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-12 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-13 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-42-15 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-1 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-2 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-4 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-5 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-6 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-8 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-11 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-13 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-14 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod compute-43-15 Down cpu 0:8 load jobname=L500_CSF_SFoff user=etl28 q=astro_prod
-
GPFS Issues
Saturday, August 1, 2015 - 12:00amPreviously we have experienced several outages of our GPFS filesystem, which primarily serves the Grace cluster but which is also mounted to other clusters as well. We understand these outages have been particularly disruptive for some and we apologize for any inconvenience this has caused.
To remedy the issue, the HPC team addressed networking issues during a recent Grace outage. The GPFS filesystem currently appears stable and is currently exported to Louise and BDN. The plan is to have GPFS also mounted onto Omega but additional testing will be needed. A timeline has yet to be defined. In the meantime, if you notice any continued issues, please report to hpc@yale.edu with the time and a description of the issue.
As always, if you have any questions, concerns or comments please contact us at hpc@yale.edu
Aug 1, 2015:
The HPC team recently repaired the network configurations that had caused several unexpected GPFS file system outages. The GPFS file system resides on Grace, but is also currently mounted on Louise and BulldogN. The plan is to have GPFS mounted on Omega, but additional testing is needed.
-
Lustre on Omega was suspended briefly
Tuesday, July 7, 2015 - 12:30pmThe Lustre filesystem on Omega was briefly suspended. While jobs continued to run, they were likely suspended for roughly 1 hour, 12:30 -1:30 pm. We are recommending that jobs running during this time be checked.
-
Brief Grace Cluster Outage
Monday, June 29, 2015 - 6:30pmThe Grace cluster experienced brief outages today, 6/29 at 1:20 pm and again at 6:30 pm. Any scheduled job running during these times was likely impacted and will need to be restarted.
The HPC team is working with the vendor to understand the root cause and updates will be posted here.
We apologize for any inconvenience this may have caused and as always, if you have any questions, concerns or comments please contact us at hpc@yale.edu