Grace

About

Grace Hopper and UNIVAC

Grace is a shared-use resource for the Faculty of Arts and Sciences (FAS). The cluster is named for the computer scientist and United States Navy Rear Admiral Grace Murray Hopper, who received her Ph.D. in Mathematics from Yale in 1934.

Logging in

If you are a first time user, make sure to read the pertinent links from our user guide about using ssh. Once you have submitted a copy of your public key to us, you should be able to ssh to grace.hpc.yale.edu. As with the other Yale clusters, there are two login nodes; you will be randomly placed on one of them.

Partitions

Grace now uses a different scheduler from the old Grace, called Slurm, and therefore you will need to translate any existing submission scripts from LSF to Slurm. See our Slurm documentation for more information on using the scheduler. As on Grace, all nodes are shared, where multiple jobs will run on a single node if they request less than the total number of cores on that node.

name max cores/job max walltime/job description
interactive 4 6 hours compiling/debugging/testing programs
day 640 24 hours
week 100 7 days
gpu 20 24 hours
bigmem 40 24 hours 1.5 TB of RAM per node
scavenge 20 24 hours 1 node limit per job

Scavenge Partition

The scavenge partition is a new partition available on Grace. It allows you to run a job outside of your normal fairshare restriction and makes use of any unutilized cores that are unavailable via the public partitions. However, note that any job on the scavenge partition is subject to preemption if the node in use is required by a job on its normal private partition. This means that your job will be killed immediately, so make sure to only run jobs on the scavenge partition that either have good checkpoint or otherwise can be restarted with minimal loss of progress. For this reason, keep in mind that not all jobs are a good fit for the scavenge queue, such as jobs with a long start up time or jobs that go a long time between checkpointing.

Software

Grace uses the modules system for managing software and its dependencies. See our documentation on modules here.

Compute Hardware

Node Type Processor (--constraint tag) Speed Cores RAM
IBM NeXtScale nx360 M4 Intel Xeon E5-2660 V2 (ivybridge) 2.20GHz 20 128G
Lenovo NeXtScale nx360 M5 Intel Xeon E5-2660 V3 (haswell) 2.60GHz 20 128G
Lenovo NeXtScale nx360 M5 Intel Xeon E5-2660 V4 (broadwell) 2.00GHz 28 256G

GPUs

Grace has 6 nodes on the gpu partition with 2 Nvidia Tesla K80s, each with 2 GPUs (for a total of 4 GPUs per node). See our GPU guide for instruction on requesting GPUs for your job.

Storage

File System: 2 PB of GPFS storage via FDR InfiniBand

By default, each group has 300 GB and a 1TB storage quota on the home and project partitions, respectively. A group's usage can be monitored using the groupquota.sh script available at:

/gpfs/apps/bin/groupquota.sh

Storage


Each PI group is provided with storage space for research data on the HPC clusters. The storage is separated into three tiers: home, project, and temporary scratch.

Home

Home storage is designed for reliability, rather than performance. Do not use this space for routine computation. Use this space to store your scripts, notes, etc. Home storage is backed up daily.

Project

In general, project storage is intended to be the primary storage location for HPC research data in active use. Project storage is not backed up.

60-Day Scratch (scratch60)

This temporary storage should typically give you the best performance. Files older than 60 days will be deleted automatically. This space is not backed up, and you may be asked to delete files younger than 60 days old if this space fills up.

Other Storage Options

If you or your group finds these quotas don't accommodate your needs, contact us at hpc@yale.edu.

You can also mount Storage@Yale, which is a service offered by Yale ITS to University members. Note that S@Y mounted on a cluster will not be available to be mounted elsewhere. To request S@Y mounted on the clusters, fill out our S@Y Request Form.