System Updates Archive

  • Grace Scheduled Maintenance Dec 3-4

    Dear Grace Users,

    Scheduled maintenance will be performed on the Grace cluster starting on December 3-4, 2024, at 8:00am.
     
    Due to the limited updates needed on Grace at this time, the upcoming December maintenance will not be a full 3-day downtime but will rather have limited disruptions.  The Grace cluster and storage will remain online and available throughout the maintenance period and there will be no disruption to running or pending batch jobs.  However, certain services will be unavailable for short periods during the maintenance window.  There will be reduced availability of compute nodes at times, so users might experience temporality increased wait times.  

    Maintenance will be performed on sets of nodes, in the following order.  Each set will be down briefly and then returned to service.

    Tuesday December 3:

        Login nodes (there are two nodes but only one will be down at a time)
        Globus
        Transfer node (transfer-grace.ycrc.yale.edu)
        Half of commons nodes

    Wednesday December 4:

        The remaining commons nodes
        All PI nodes

    Note to groups with PI nodes:
    As the maintenance window approaches, the Slurm scheduler will not start any job Submitted to a PI partition if the job’s requested wallclock time extends past the start of the downtime for PI nodes (8:00 am on December 4, 2024).  If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.”  (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.)  Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.

    The Message of the Day (MOTD) will be updated throughout the maintenance period to report the current status.  An email notification will be sent when the maintenance is completed.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have any questions, comments, or concerns, please contact us at hpc@yale.edu (link sends e-mail).

     

  • McCleary Scheduled Maintenance

    Scheduled maintenance will be performed on the McCleary cluster, starting on Tuesday, October 15, 2024, at 8:00 am. Maintenance is expected to be completed by the end of day, Thursday, October 17, 2024.

    During the maintenance, logins to the cluster will be disabled.  We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance. An email notification will be sent when the maintenance has been completed, and the cluster is available.

    As the maintenance window approaches, the Slurm scheduler will not start any job if the job’s requested wallclock time extends past the start of the maintenance period (8:00 am on October 15, 2024). You can run the command “htnm” (short for “hours_to_next_maintenance”) to determine the number of hours until the next maintenance period, which can aid in submitting jobs that will run before maintenance begins. If you run squeue, such jobs will show as pending jobs with the reason “ReqNodeNotAvail.” (If your job can actually be completed in less time than you requested, you may be able to avoid this by making sure that you request the appropriate time limit using “-t” or “–time”.) Held jobs will automatically return to active status after the maintenance period, at which time they will run in normal priority order.  All running jobs will be terminated at the start of the maintenance period.  Please plan accordingly.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have questions, comments, or concerns, please contact us at hpc@yale.edu (link sends e-mail).

     

  • Milgram Scheduled Maintenance

    Dear Milgram Users,
     

    Scheduled maintenance will be performed on the Milgram cluster starting on Tuesday, August 20, 2024, at 8:00 am.  Maintenance is expected to be completed by the end of day, Thursday, August 22, 2024.

    During the maintenance, logins to the cluster will be disabled. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance.  An email notification will be sent when the maintenance has been completed, and the cluster is available.

    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have questions, comments, or concerns, please contact us at hpc@yale.edu (link sends e-mail).

     

  • 7/10/2024 11:40am RESOLVED: Grace fully operational after power issue

    7/10/24 - 11:40am - An an electrical supply issue at the HPC Data Center that started midday on 7/8/2024 brought down Grace nodes with names beginning with r801 through r806.  Repairs are now complete and normal operation has been restored.  However, any jobs running on those nodes at the start of the outage terminated.  Please check the status of your HPC jobs.

  • Milgram Globus Maintenance 6/10 @ 9am

    Milgram Globus services will be unavailable Monday, 6/10/24, beginning at 9am, while an upgrade is performed. The service is expected to be available by the end of the day.

    If you have any questions or concerns, please contact hpc@yale.edu (link sends e-mail)

  • Grace Scheduled Maintenance

    Dear Grace Users,
     
    Scheduled maintenance will be performed on the Grace cluster starting on Tuesday, June 4, 2024, at 8:00 am.  Maintenance is expected to be completed by the end of day, Thursday, June 6, 2024.

    During this time, logins will be disabled, running jobs will be terminated, and connections via Globus will be unavailable.  We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance.  An email notification will be sent when the maintenance has been completed, and the cluster is available.
     
    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have questions, comments, or concerns, please contact us at hpc@yale.edu (link sends e-mail).

  • Issue with Milgram Connections on VPN

    3/28/24 - ITS is currently working on resolving an issue that inhibits connections to Milgram over the access.its.yale.edu VPN.  Users are asked to point their VPN clients to-

    vpn3.its.yale.edu, vpn5.its.yale.edu or vpn6.its.yale.edu until this issue is resolved.

    Please contact hpc@yale.edu (link sends e-mail) if you have any questions.

  • Scheduled Maintenance on Milgram

    Milgram Scheduled Maintenance
     
    Dear Milgram Users,
     
    We write to remind you that scheduled maintenance will be performed on the Milgram cluster starting on Tuesday, February 6, 2024, at 8:00 am.  Maintenance is expected to be completed by the end of day, Thursday, February 8, 2024.
     
    During the maintenance, logins to the cluster will be disabled. We ask that you save your work, close interactive applications, and logoff the system prior to the start of the maintenance.  An email notification will be sent when the maintenance has been completed, and the cluster is available.
     
    Upgrade to Red Hat 8
    Milgram’s current operating system, Red Hat Enterprise Linux (RHEL) 7, will be officially end-of-life in 2024 and will no longer be supported with security patches by the developer.  Therefore, Milgram will be upgraded to RHEL 8 during this maintenance.  
     
    Changes to Interactive Partitions and Jobs
    We are making two changes to interactive jobs during the upcoming maintenance.  
     
    The ‘interactive’ and ‘psych_interactive partitions will be renamed to ‘devel’ and ‘psych_devel’, respectively, to bring Milgram in alignment with the other clusters.  This change has been made on other clusters in recognition that interactive-style jobs (such as OnDemand and ‘salloc’ jobs) are commonly run outside of the ‘interactive’ partition.  Please adjust your workflows accordingly after the maintenance.
     
    Additionally, all users will be limited to 4 interactive app instances (of any type) at one time.  Additional instances will be rejected until you delete older open instances.  For OnDemand jobs, closing the window does not terminate the interactive app job.  To terminate the job, click the “Delete” button in your “My Interactive Apps” page in the web portal.
     
    Please visit the status page at research.computing.yale.edu/system-status for the latest updates.  If you have questions, comments, or concerns, please contact us at hpc@yale.edu (link sends e-mail).
     
    Sincerely,
     
    Paul Gluhosky
     
  • Problem with Grace cluster

    Thursday, January 25, 2024 - 3:20pm
    1/25 3:20pm The Grace cluster is currently experiencing a problem with home directories causing logins to fail and any processes/jobs that require home directory access to hang. We are working with the storage vendor to resolve the issue. We are aware of the issue and working to restore access as quickly as possible. We will post here if and when we have further updates. We apologize for the inconvenience.
  • Gibbs Available

    Wednesday, January 3, 2024 - 4:37pm
    1/3 4:37pm The Gibbs filesystem (pi and project storage) is now back online  and accessible. We understand the disruptions this has caused and sincerely apologize for any inconvenience.
     
    Gibbs is now functioning normally. However, we are still actively investigating the root cause of the outage to prevent similar incidents from occurring in the future.

Pages