Yale

Center for Research Computing

System Updates Archive

Notice: Grace cluster maintenance - currently ongoing

Maintenance is currently ongoing for the Grace cluster and GPFS filesystem as described below.

First Maintenance Window: 5/4/2015 – 5/6/2015

We will be bringing the Grace cluster offline Monday May 4th starting at 8 am through Wednesday May 6th along with the GPFS filesystem. Users of the Bulldog N and Louise clusters will be impacted if they are currently using /GPFS, otherwise no impact should be expected. All systems should be back online the morning of May 7th.

UPDATE: Cluster coming back online delayed until 5:30 pm 5/8.

Second Maintenance Window: 5/20/2015 – 5/22/2015

The second maintenance window for the Grace cluster is scheduled for Wednesday May 20th through Friday May 22nd. Grace will be back online Saturday May 23rd. Users of other clusters will not be impacted unless using /gpfs filesystem, which may be unavailable at times during the maintenance window. Monitoring will continue over the holiday weekend to ensure the system is stable.

If you have any questions, concerns or comments, please don’t hesitate to contact us at HPC@Yale.edu.
Power Outage at West Campus

Monday, March 23, 2015 - 8:00pm to 11:45pm

Update: Service to Grace and Omega was restored ~11:45am 23-Mar-2015.

An unscheduled power outage at West Campus has affected the Grace and Omega clusters.
Omega and Grace Unscheduled Outage

Monday, March 16, 2015 - 5:22pm to 8:45pm

Update: Omega and Grace both fully restored by 8:45PM 16-Mar-2015.

Update: The storage on Omega began experiencing intermittent outages starting at 5:22PM 16-Mar-2015. The issues were resolved on Omega by 7:20PM 16-Mar-2015.

Omega and Grace are currently unavailable. The HPC staff are currently working to restore the services.
Globus Scheduled Downtime

Thursday, March 12, 2015 - 12:00am

The Globus file transfer service will be unavailable on Saturday, 21-Mar-2015, between 11:00AM and 3:00PM EST for upgrades. Active file transfers will be suspended during this time and will resume when the service is restored.
Scheduled Maintenance for Omega/HEP

Monday, February 2, 2015 - 12:00am
The Omega/HEP clusters will be unavailable from Mon 16-Feb-2015 at 8:00AM thru Wed 18-Feb-2015 at 5:00PM EST for scheduled maintenance. Work includes the following:
- Infiniband network upgrades
- glibc GHOST network security patch for Linux
- storage hardware maintenance
- cluster storage performance tuning
- new “phitest” job queue created for testing Intel Xeon Phi accelerator hardware
- access to Grace cluster storage from Omega
- remote file transfer for Grace cluster storage is now available via globus.org
- remote desktop pilot kicked off
Grace File System Interruption

Thursday, January 15, 2015 - 11:20am to 11:35am

The Grace file system suffered a brief outage at ~11:20am. The file system is back on-line. You are encouraged to check your jobs, as some may have been lost due to the interruption.
Grace file system offline - fixed.

Tuesday, January 6, 2015 - 4:30pm to 5:00pm

Grace file System went off line at 4:30pm. We are currently looking into it. ETA 30 mins. Running Jobs were lost due to the outage. Please check your job.
Scratch2 quota problems on Louise

Tuesday, December 30, 2014 - 8:28pm to Wednesday, January 21, 2015 - 5:00pm
Update 21-Jan-2015: Quota reporting issue has been resolved.

Update 12-Jan-2015 12:08 EST: The downtime is complete. Space is being recalculated and must complete before quotas are re-enabled.

Update 12-Jan-2015 10:49 EST: The following filesystems will be unmounted and then re-mounted at 12:00EST today:
- /scratch
- /scratch2
- /home2
- /data2
Update 08-Jan-2015 12:51 EST: The quota remediation steps will be executed at 9:00AM Fri 09-Jan-2015. Please note that this work may affect running jobs. More details to follow.

Update 08-Jan-2015 10:47 EST: An action plan has been formulated by Hitachi Support. The plan is being reviewed and will be executed soon.

Update 03-Jan-2015 19:56 EST: The quotas have been temporarily removed until the next filesystem check step is complete. You should not experience any false “Disk quota exceeded” messages in the near term.

Update: The quota on Virtual Volume “/scratch2″ was increased to 220TB in an attempt to compensate for the quote mis-reporting. This is nearly double the size of the underlying filesystem. In the interim, Hitachi engineers has us execute a “checkfs” on the filesystem. By our estimates, it should complete in ~24 hours. Unfortunately, the quota limit was reached and we cannot increase it until the checkfs completes. At this point, we will likely delete all quotas on /scratch2 until the issue is resolved.

Hitachi has two potential fixes for the quota mis-reporting issue. One of them will require unmounting the underlying filesystem. We will report back as soon as we know if that course of action is recommended.

We are currently battling a quota issue on Louise’s Hitachi storage. For several users, /scratch2 is reporting much higher usage than actual. If you experience disk quota issues in /scratch2 and are under quota, please contact hpc@yale.edu and we will temporarily increase your quote so processing can continue.

We are actively working with Hitachi to resolve this issue.
Louise maintenance December 16th -8:30am - Completed

Tuesday, December 16, 2014 - 8:30am to 9:30am

A brief downtime for Louise in order to resynchronize the quotas on the storage system. The downtime itself should last less than an hour. Unfortunately, because home directories are involved, we will have to take the login nodes offline and stop all running jobs. The downtime will take place 16-Dec-2014 at 8:30am.
Network Maintenance

Sunday, November 23, 2014 - 12:30am to 1:30am

In response to the recent power outage at the 300 George St. data center, Yale ITS’s Data Network Operations team is making some changes to make the network more resilient. This requires a maintenance window which will take place from 5:30-6:30am on Sunday November 23. All login nodes will be inaccessible during this time. Interactive Jobs will be lost, all other jobs will continue to run without interruption.

Pages