System Updates Archive

  • Scheduled Maintenance on Ruddle

    Dear Ruddle Users,
     
    As a reminder, scheduled maintenance will be performed on Ruddle beginning Tuesday, November 5, 2019, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, November 7, 2019. Please note that this maintenance begins and ends one day later than previously stated on the YCRC website.
     
    During this time, logins will be disabled, running jobs will be terminated and connections via Globus will be unavailable. The GPFS storage, including all YCGA sequencing data, will not be available. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the cluster is available.
     
    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.
     
    Sincerely,
     
    Paul Gluhosky
     
  • Scheduled Maintenance on Grace

    Dear Grace, Omega and Farnam Users,

    Out-of-cycle scheduled maintenance will be performed on Grace and Omega beginning Monday, November 4, 2019, at 8:00 am. Maintenance is expected to be completed by the end of the day Monday. 

    This maintenance is required in order to prepare the Infiniband network for the upcoming deployment of additional common and PI-purchased nodes. As such, there are no changes that impact users and no new functionality is being introduced. We plan, however, to bring online 174 new common nodes and 76 Pi-purchased nodes soon after the maintenance window, once necessary validation testing has been completed.

    During this time, logins will be disabled and connections via Globus will be unavailable. The Loomis storage will not be available on the Farnam cluster but the cluster itself will remain available. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the clusters are available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on November 4, 2019). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail”. (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

  • Scheduled Maintenance on Ruddle

  • Farnam Maintenance Extended

    UPDATE- Farnam Maintenance will be extended until Friday morning, September 13. 2019

    Dear Farnam Users,

    We will be performing preventative maintenance to ensure stable operation of the cluster. During this time logins will be disabled and Farnam storage will be unavailable on all clusters. An email notification will be sent when the maintenance has been completed and the cluster is available. We have engaged the storage vendor to investigate performance issues during the maintenance. Depending on what they discover, It is possible that the return to service will be delayed. If that happens, we will inform you.

    Please note any job whose requested time limit implies that it will still be running when the maintenance begins at 8:00 am on 9/9 will wait with a reason that starts with “ReqNodeNotAvail”. These jobs will be eligible for running after maintenance completes if left in the queue. You can run the command htnm (short for hours_to_next_maintenance) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins.
     

    Summary of major maintenance changes

    • Decommission compute nodes with Sandy Bridge (Intel Xeon E5-2670) and Bulldozer (AMD Opteron 6276) codename CPUs.
    • Bring online Cascade Lake (Intel Xeon Gold 6240) compute nodes. Please see the compute nodes table on the Farnum page for details.
    • Move Project and Scratch directories to make them consistent across clusters. We will create symlinks to maintain compatibility with old paths.

      Example:
      /gpfs/ysm/project/netid and /gpfs/ysm/scratch60/netid
      Will soon be
      /gpfs/ysm/project/group/netid and /gpfs/ysm/scratch60/group/netid

    Please visit the status page on research.computing.yale.edu for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

  • Farnam Scheduled Maintenance

    Reminder: Farnam Scheduled Maintenance

    Dear Farnam Users,

    We will perform scheduled maintenance on Farnam starting Monday, September 9th, 2019 at 8:00 am through the end of the day on Wednesday, September 11, 2019. We will be performing preventative maintenance to ensure stable operation of the cluster. During this time logins will be disabled and Farnam storage will be unavailable on all clusters. An email notification will be sent when the maintenance has been completed and the cluster is available. We have engaged the storage vendor to investigate performance issues during the maintenance. Depending on what they discover, It is possible that the return to service will be delayed. If that happens, we will inform you.

    Please note any job whose requested time limit implies that it will still be running when the maintenance begins at 8:00 am on 9/9 will wait with a reason that starts with “ReqNodeNotAvail”. These jobs will be eligible for running after maintenance completes if left in the queue. You can run the command htnm (short for hours_to_next_maintenance) to get the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins.
     

    Summary of major maintenance changes

    • Decommission compute nodes with Sandy Bridge (Intel Xeon E5-2670) and Bulldozer (AMD Opteron 6276) codename CPUs.
    • Bring online Cascade Lake (Intel Xeon Gold 6240) compute nodes. Please see the compute nodes table on the Farnum page for details.
    • Move Project and Scratch directories to make them consistent across clusters. We will create symlinks to maintain compatibility with old paths.

      Example:
      /gpfs/ysm/project/netid and /gpfs/ysm/scratch60/netid
      Will soon be
      /gpfs/ysm/project/group/netid and /gpfs/ysm/scratch60/group/netid

    Please visit the status page on research.computing.yale.edu for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

  • /gpfs/ysm and /gpfs/slayman are unavailable from Grace

    Friday, September 6, 2019 - 2:30pm

    To resolve the filesystem issues and in anticipation of the upcoming Farnam maintenance on Sept 9, the filesystems /gpfs/ysm and /gpfs/slayman are temporarily unavailable on Grace. We expect to remount those filesystems following the Farnam maintenance. If you need to move data from /gpfs/ysm to /gpfs/loomis, login into Farnam to initiate the transfer (which is still mounting /gpfs/ysm, /gpfs/slayman and /gpfs/loomis).

    If you had jobs running on Grace that were accessing /gpfs/ysm, they will have failed due to the unmounting process. We apologize for the inconvenience.

  • Login and Degraded Filesystem Issues on Grace and Farnam

    Friday, September 6, 2019 - 12:00pm

    Users of the Grace and Farnam clusters may be experiencing slow response time when logging in and executing interactive commands. The support team is investigating and engaging with the storage vendor to resolve the problem. Further information will be posted here when available.

    Update 2:30pm: The login issue has been resolved.

  • Scheduled Maintenance on Grace

    8/26 - 08:00 AM 

    Grace, Omega and Farnam Users,
     

    Scheduled maintenance will be performed on Grace and Omega beginning Monday, August 26, 2019, at 8:00 am. Maintenance is expected to be completed by the end of day, Wednesday, August 28, 2019. During this time, logins will be disabled and connections via Globus will be unavailable. The Loomis storage will not be available on the Farnam cluster. We ask that you logoff the system prior to the start of the maintenance, after saving your work and closing any interactive applications. An email notification will be sent when the maintenance has been completed, and the clusters are available.

    Aside from the system and security updates we perform during maintenance, we want you to know about the following changes.
     
    Module Changes on Grace: The software available via the modules system on Grace is being upgraded to be more consistent with our other clusters. During the maintenance, we will change the default module list to a new module collection. We encourage you to look at the new collection today if you have not already done so. To try the new collection, run the following on the login node:

    source /apps/bin/try_new_modules.sh

    Then you can run “module avail” to see the list of available software in the new collection. To return to the old collection, simply log out and log back into the cluster. The old installations will remain, but all new software will be installed into the new collection. More information about this transition is available on our website at http://docs.ycrc.yale.edu/clusters-at-yale/applications/new-modules-grace.
     
    Grace Login Nodes IP Addresses: During the maintenance the IP addresses of Grace’s two login nodes will change to 10.181.0.11 and 10.181.0.14. If part of your workflow involves directly accessing or white-listing these IP addresses, they will need to be updated. Most users will not be impacted by this change.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on August 26, 2019). If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail”. (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.

    Please visit the status page at research.computing.yale.edu for the latest updates. If you have questions, comments, or concerns, please contact us at hpc@yale.edu.

  • UPDATED 8/14: Farnam Filesystem Performance

    UPDATE - 8/14/2019 - 3:30 pm - YCRC staff continue to work with our storage vendor to monitor Farnam’s performance. Please notify hpc@yale.edu if you experience any issues. 

  • Small number of node failures

    Tuesday, July 23, 2019 - 1:00pm

    Due to thunderstorms, we have seen some power issues that caused some HPC Cluster compute nodes to shut down, resulting in possible job failures. The impact to clusters seems to have been limited to a small number of nodes on Grace, Ruddle, Farnam, and Omega.

Pages