System Updates Archive

  • Grace file system offline - fixed.

    Tuesday, January 6, 2015 - 4:30pm to 5:00pm

    Grace file System went off line at 4:30pm. We are currently looking into it. ETA 30 mins. Running Jobs were lost due to the outage. Please check your job.

  • Scratch2 quota problems on Louise

    Tuesday, December 30, 2014 - 8:28pm to Wednesday, January 21, 2015 - 5:00pm

    Update 21-Jan-2015: Quota reporting issue has been resolved.

     

    Update 12-Jan-2015 12:08 EST: The downtime is complete. Space is being recalculated and must complete before quotas are re-enabled.

     

    Update 12-Jan-2015 10:49 EST: The following filesystems will be unmounted and then re-mounted at 12:00EST today:

     

    • /scratch
    • /scratch2
    • /home2
    • /data2

     

    Update 08-Jan-2015 12:51 EST: The quota remediation steps will be executed at 9:00AM Fri 09-Jan-2015. Please note that this work may affect running jobs. More details to follow.

     

    Update 08-Jan-2015 10:47 EST: An action plan has been formulated by Hitachi Support. The plan is being reviewed and will be executed soon.

     

    Update 03-Jan-2015 19:56 EST: The quotas have been temporarily removed until the next filesystem check step is complete. You should not experience any false “Disk quota exceeded” messages in the near term.

     

    Update: The quota on Virtual Volume “/scratch2″ was increased to 220TB in an attempt to compensate for the quote mis-reporting. This is nearly double the size of the underlying filesystem. In the interim, Hitachi engineers has us execute a “checkfs” on the filesystem. By our estimates, it should complete in ~24 hours. Unfortunately, the quota limit was reached and we cannot increase it until the checkfs completes. At this point, we will likely delete all quotas on /scratch2 until the issue is resolved.

    Hitachi has two potential fixes for the quota mis-reporting issue. One of them will require unmounting the underlying filesystem. We will report back as soon as we know if that course of action is recommended.


    We are currently battling a quota issue on Louise’s Hitachi storage. For several users, /scratch2 is reporting much higher usage than actual. If you experience disk quota issues in /scratch2 and are under quota, please contact hpc@yale.edu and we will temporarily increase your quote so processing can continue.

     

    We are actively working with Hitachi to resolve this issue.

  • Louise maintenance December 16th -8:30am - Completed

    Tuesday, December 16, 2014 - 8:30am to 9:30am

    A brief downtime for Louise in order to resynchronize the quotas on the storage system. The downtime itself should last less than an hour. Unfortunately, because home directories are involved, we will have to take the login nodes offline and stop all running jobs. The downtime will take place 16-Dec-2014 at 8:30am.

  • Network Maintenance

    Sunday, November 23, 2014 - 12:30am to 1:30am

    In response to the recent power outage at the 300 George St. data center, Yale ITS’s Data Network Operations team is making some changes to make the network more resilient. This requires a maintenance window which will take place from 5:30-6:30am on Sunday November 23. All login nodes will be inaccessible during this time.  Interactive Jobs will be lost, all other jobs will continue to run without interruption.
     

  • Omega Scratch online

    Saturday, November 15, 2014 - 7:02pm
  • Omega Scratch unresponsive. Saturday Nov 15.

    Saturday, November 15, 2014 - 1:08pm to 2:02pm

    Omega scratch file system became unresponsive this afternoon.  We are looking into the issue.  The file system should be available in the next 45 minutes.  Please check your jobs output.  

  • Planned Network Maintenance - brief outage 11/23/2014

    Sunday, November 23, 2014 - 12:30am to 1:30am

    In response to the recent power outage at the 300 George St. data center, Yale ITS’s Data Network Operations team is making some changes to make the network more resilient. This requires a downtime which will take place from 5:30-6:30am on Sunday November 23. All login nodes will be inaccessible during this time. The job schedulers on the clusters will not be affected. Only connectivity to the Login Nodes.

  • New HPC Cluster Grace

    Thursday, May 29, 2014 - 5:39am

    The High Performance Computing team is pleased to announce the upcoming availability of a new HPC cluster – Grace. As final testing concludes, we would like to share more information about Grace; named after computer scientist and United States Navy Rear Admiral Grace Murray Hopper, who received her Ph.D. in Mathematics from Yale in 1934.

    Grace is an IBM System x High Performance Computing Cluster installed at Yale’s West Campus Data Center. The cluster consists of 72 compute nodes each with 20 cores and 128Gb of RAM. The processors are Intel Xeon E5-2660v2’s running at 2.2GHz. All nodes are running RHEL 6.4. Attached storage is in the form of 1PB of GPFS (General Parallel File System). The cluster nodes are connected internally via FDR InfiniBand.

    The expected general availability date is Monday June 2nd at 12:00pm.

    All users with accounts on the BulldogJ, BulldogK and Omega cluster will be provisioned an account on Grace. Like Omega,  Grace will support only SSH key authentication. For those with accounts on Omega, we will copy the public key(s) installed on Omega to Grace. Other users will receive an email notification containing the instructions for providing a public SSH key.

    Please be aware that the scheduling, managing, monitoring and reporting of cluster workloads on Grace will be handled differently from other clusters. Instead of Moab, Grace will run the IBM Platform LSF (or simply, LSF) scheduler. Simplified documentation will be made available shortly by the HPC team, and comprehensive documentation will be added to the HPC website.

    Please consider the following guidelines when deciding where to run jobs:
    Run on Grace if:

    1. your jobs do not depend heavily on MPI
    2. your jobs require many cores and/or lots of memory on a single node
    3. your jobs are embarrassingly parallel (e.g. they use SimpleQueue)
    4. your jobs can share a node with other jobs
    5. your jobs tend to use many small-to-medium files

    Run on Omega if:

    1. you depend on MPI to run in parallel on large numbers of cores/nodes
    2. you need high-performance (parallel) I/O, particularly with large files
    3. you have special node/queue privileges on Omega
    4. your jobs require exclusive node access
    5. you need GPUs

    In other cases, you may wish to select a cluster based on the observed cluster load or according to which one has the proper software for your work.

    As previously communicated, steps are being taken to decommission the BulldogJ and BulldogK clusters. If you are affected by this, please begin identifying data that needs to be retained and moved elsewhere. It
    is strongly recommended that unnecessary data be deleted at this time
    prior to migration. It should be expected that it may take up to 10
    hours to transfer 1TB of data to a new location, though actual transfer
    times will depend on numerous factors.  Data transfers must be completed
    by July 11th.

    If you are currently using the Omega cluster, please consider using
    Grace instead based on the guidelines above.

    If you have any questions or concerns about this exciting new offering,
    please contact hpc@yale.edu.

  • ulimits placed on Omega login node

    Wednesday, June 4, 2014 - 2:40pm

    The following ulimits have been placed on the Omega login node:

    • processes reniced to 10
    • resident per process memory limited to 1gb
    • 2gb total virtual memory
    • 500 procs per user
    $ ulimit -u
    500
    
    $ ulimit -v
    2097152
    
    $ ulimit -m
    1048576
    
  • Bulldog J/K Decommission

    Thursday, June 5, 2014 - 3:34pm

    The High Performance Computing team is pleased to announce the upcoming availability of a new HPC cluster Grace. Final acceptance testing of the cluster is in progress and the process of decommissioning the Bulldog-J and Bulldog-K clusters has begun. Over the next (2) months, data and jobs will need to be migrated to other clusters.

    What do I need to know?

    The schedule of events to support the decommissioning is as follows:

    • May 27: Job submission to the fas_very_long queue will be suspended.
    • June 16: Job submission to all other queues will be suspended.
    • June 27: All compute nodes will be turned off by this date; any active jobs will be cancelled.
    • July 11: All data must be moved off BulldogJ and BulldogK by this date.

    What should you do now?

    • Plan ahead for the suspension of the job queues on the dates listed.
    • Begin identifying data that needs to be retained and moved elsewhere. It is strongly recommended that unnecessary data be deleted at this time prior to migration. It should be expected that 1Tb of data will take approximately 10 hours to be transferred to a new location, however, individual results depend on network traffic in and out of the clusters, as well as your local network connection.

    Assistance will be available any researchers who need help to identify a new location for their data and jobs, or help with moving the data. If you have any questions or concerns about this work or the schedule, please contact hpc@yale.edu .

    Thank you in advance for your cooperation.

Pages