System Updates Archive

  • Scheduled Maintenance on Grace, Milgram and McCleary.

    The Milgram and Grace clusters will be unavailable from 2pm on June 19 until 10am on June 21, and McCleary will have limited availability during that time due to electrical work being performed in the HPC Data Center.

    For Grace and Milgram:  During this time, logins will be disabled, running jobs will be terminated, and connections via Globus will be unavailable. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance.

    For McCleary:  During this time, the cluster will be available, but approximately 60% of the nodes in the day partition, all nodes in the week partition, some GPU and large memory nodes, and all YCGA nodes will not be available.

    If you have questions, or would like to provide feedback, please contact us at .

  • Issues Connecting to Milgram Cluster

    6/5/23 1:30pm -YCRC is aware of issues impacting some new logins to the Milgram cluster.  Milgram itself is up, and slurm jobs are not impacted.  However, some new connections into Milgram may fail.  YCRC is working with ITS to resolve the issue as quickly as possible.  If you experience issues, please contact us at .

  • McCleary Scheduled Maintenance 5/30, 9am - 1pm

    The McCleary cluster will be unavailable from 9am-1pm on Tuesday May 30 while maintenance is performed on the YCGA storage.

  • 5/24/2023 11:30am RESOLVED: Ruddle, Farnam, McCleary fully operational after power outage.

    UPDATE - 11:30 5/24/23 - A power disruption at the HPC Data Center on 5/24/2023 that lasted from about 9:45-10:45am impacted the Ruddle, Farnam, and McCleary clusters.  All nodes on Ruddle and Farnam and some nodes on McCleary rebooted.  Power and normal operation have been restored.  However, any jobs running on those nodes at the start of the outage died.  Please check the status of your HPC jobs.

    UPDATE - 10:40 5/24/23 - Power has been restored for Farnam, Ruddle and McCleary clusters. YCRC are currently working to resume normal operation of impacted equipment.

    09:48 5/24/23 - A power disruption at the HPC Data Center has impacted the Farnam, Ruddle and McCleary clusters. YCRC staff, along with electricians are working on restoring normal operation. Updates will be posted as the become available. 

  • 4/27/2023 10:30am: RESOLVED: Ruddle cluster now on line

    Due to a power outage at West Campus, the Ruddle cluster was down on Thursday morning from from about 9am through 10:30am.   Power has been restored, and the cluster is available again.  Jobs that were running at the start of the outage failed.  Please check the status of your jobs if they may have been running during that time. 
  • GLOBUS: Maintenance Downtime for Database Upgrades - Saturday, March 11, 2023

    Saturday, March 11, 2023 - 11:00am to 1:00pm

    Globus services will be unavailable for up to two hours beginning at 10 am US Central on Saturday, March 11, 2023.  During this planned downtime, all Globus services will be unavailable. Please see https://www.globus.org/blog/maintenance-downtime-database-upgrades-saturday-march-11-2023 for more information.

  • Milgram Scheduled Maintenance 2/7/23 - 2/9/23

    Dear Milgram Users,
     
    We will perform scheduled maintenance on Milgram starting on Tuesday, February 7, 2023, at 8:00 am.  Maintenance is expected to be completed by the end of day, Thursday, February 9, 2023.

    During this time, logins to the cluster will be disabled. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance.  An email notification will be sent when the maintenance has been completed, and the cluster is available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on February 7, 2023).  You can run the command “htnm” (short for “hours_to_next_maintenance”) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins.  If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.)  Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.  All running jobs will be terminated at the start of the maintenance period.  Please plan accordingly.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

     

  • Resolved - 12/26/22 Palmer Performance Issues

    12/27/22 - 11:59am-  The Palmer storage vendor has verified that the Grace performance issue has been resolved. Users are encouraged to check any jobs they had running on 12/26/22.

    12/26/22 - 9:45pm - YCRC staff and the Palmer storage vendor are currently working on resolving performance issues that may affect Grace users.

  • UPDATE - Scheduled Gibbs Maintenance Complete- December 7, 2022

    UPDATE - 12/7/2022 - 12:30pm -

    Dear Farnam and Ruddle Users,The scheduled maintenance on the Gibbs storage system has been completed. Gibbs is now available on Farnam and Ruddle. Gibbs remains unavailable on Grace due to the continuation of the scheduled maintenance which is expected to be completed by the end of the day on Thursday.

    Update 12/6/2022 - Maintenance on Gibbs has continued into the evening, and updates will be posted until the storage is available to users.

    Dear Ruddle and Farnam Users,

    The Gibbs storage system will be unavailable starting on Tuesday, December 6, 2022, at 8:00 am so that a critical firmware fix can be applied. Maintenance is expected to be completed by the end of the day.

    During the maintenance, the Ruddle and Farnam clusters will remain available for logins, but filesets on Gibbs, including YCGA sequencer data stored there, will not be accessible. Globus will remain available for file transfers that do not access Gibbs. Any process that tries to access storage on Gibbs during the outage will fail. Other jobs will proceed unaffected.

    An email notification will be sent when the maintenance has been completed.Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at .

  • Grace Scheduled Maintenance - Dec 6th-Dec 8th

    Dear Grace Users,
     
    As a reminder, we will perform scheduled maintenance on Grace starting on Tuesday, December 6, 2022, at 8:00 am.  Maintenance is expected to be completed by the end of day, Thursday, December 8, 2022. 
     
    During this time, logins will be disabled, running jobs will be terminated, and connections via Globus will be unavailable.  We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance.  An email notification will be sent when the maintenance has been completed, and the cluster is available.
     
    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on December 6, 2022).  You can run the command “htnm” (short for “hours_to_next_maintenance”) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins.  If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.”  (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.)  Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.
     
    Loomis Decommission
     
    The Loomis GPFS filesystem will be retired and unmounted from Grace and Farnam during the December maintenance. All data except for a few remaining private filesets have already been transferred to other systems (e.g., current software, home, scratch to Palmer and project to Gibbs). The remaining private filesets are being transferred to Gibbs in advance of the maintenance and owners should have received communications accordingly. The only potential user impact of the retirement is on anyone using the older, deprecated software trees (see below). Otherwise, the Loomis retirement should have no user impact but please reach out if you have any concerns or believe you are still using data located on Loomis.
     
    Decommission of Old, Deprecated Software Trees
     
    As part of the Loomis Decommission, we will not be migrating the old software trees located at /gpfs/loomis/apps/hpc, /gpfs/loomis/apps/hpc.rhel6 and /gpfs/loomis/apps/hpc.rhel7. The deprecated modules can be identified as being prefixed with “Apps/”, “GPU/”, “Libs/” or “MPI/” rather than beginning with the software name. If you are using software modules in one of the old trees, please find an alternative in the current supported tree or notify us ASAP so we can continue to support your work.
     
    Migration of Current Software Tree and Upcoming Upgrade to Red Hat 8
     
    We have already migrated the software located in /gpfs/loomis/apps/avx to Palmer at /vast/palmer/apps/grace.avx. To continue to support this software without interruption, we are maintaining a symlink at /gpfs/loomis/apps/avx to the new location on Palmer, so software will continue to appear as if it is on Loomis even after the maintenance, despite being hosted on Palmer. When the operating system on Grace is upgraded to Red Hat 8 in 2023, a new unified software tree will be created that will be shared with the upcoming McCleary cluster and the aforementioned symlink will be removed. More information on that upgrade will come next year.
     
    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
     
    Sincerely,
     
    Paul Gluhosky
     

Pages