Grace is a shared-use resource for the Faculty of Arts and Sciences (FAS). The cluster is named for the computer scientist and United States Navy Rear Admiral Grace Murray Hopper, who received her Ph.D. in Mathematics from Yale in 1934.
Migration from Omega to Grace
If you are moving to Grace due to the upcoming Omega decommission, there are a few things to know. The key differences between Omega and Grace are:
- Grace uses the Slurm scheduler and you will need to translate your submission scripts accordingly.
- All the queues (called "partitions" in Slurm) on Grace are "shared". This means that unless you request exclusive node access when submitting jobs, multiple jobs from different users may run on a single node. This also means that your jobs that don't use shared memory parallelism will start sooner since they can fit on cores scattered across a large number of nodes.
- If you used SimpleQueue on Omega, please look at our documentation for the improved Dead Simple Queue for Slurm.
- Grace has a "scavenge" queue that allows you to run on unused private resources if the public queues are very oversubscribed. See below for details.
- In addition to your "home" and "scratch" (called "project" on Grace) storage spaces, there is an additional storage space called "scratch60". See below for details.
Cleaning Out Omega Data
All Omega files are now stored solely on the Loomis GPFS system. For groups that have migrated their workloads entirely to Grace or Farnam, their Omega data is now available from Grace and Farnam for copying and clean-up until December 2018. See Cleaning Out Omega Data for instructions on retrieving your data.
If you are a first time user, make sure to read the pertinent links from our user guide about using ssh. Once you have submitted a copy of your public key to us, you should be able to ssh to
grace.hpc.yale.edu. As with the other Yale clusters, there are two login nodes; you will be randomly placed on one of them.
Grace uses the Slurm job scheduler. Unless users request exclusive node access when submitting jobs, multiple jobs from different users may run on a single node. To facilitate this, the scheduler will strictly enforce memory limits to ensure that all jobs have access to the memory requested for them. To see more details about how jobs are scheduled see our Job Scheduling documentation.
All partitions on Grace have a default walltime limit of 1 hour. Use the
-t HH:MM:SS flag to request additional time up to the limits listed below. Similarly, they have a default memory limit of 5GB per requested core. If you run into insufficient memory errors, use the
--mem-per-cpu flag to increase your job's memory limit. Slurm documentation for more details on requesting computing resources and submitting jobs.
|name||max resources*||max walltime/job||nodes (hostname)|
|interactive***||4 c||6 hours||2|
|day||640 c||900 c||24 hours||57||34||72|
|week||100 c||250 c||7 days||48||6|
|gpu||6 n||24 hours||6 (2xK80)||6 (1xP100)|
|bigmem||40 c||24 hours||2|
|scavenge||6400 c||24 hours||all||all||all||all|
* "c" = cores or core equivalents (5GB of memory = 1 cores), "n" = nodes
** if your group hits this limit, you will see jobs pending with the reason "MaxCpuPerAccount"
*** interactive jobs; jobs to compile/debug/test programs; etc. (limited to one batch or interactive job at a time per user)
See the "Hardware" section below for the flag to request specific node types.
|name||max cores/user||max walltime/job||nodes (hostname)|
|pi_anticevic_gpu||100 days||8 (2xK80)|
|pi_manohar||180 days||8, 1xP100||2|
|pi_owen_miller||28 days||5||pi_poland||28 days||10|
A scavenge partition is available on Grace. It allows you to run jobs outside of your normal fairshare restriction and makes use of any unutilized cores that may be available in any partition on the cluster. However, any job running in the scavenge partition is subject to preemption if any node in use by the job is required for a job in the node's normal partition. This means that your job may be killed without advance notice, so you should only run jobs in the scavenge partition that either have good checkpoint capabilities or that can be restarted with minimal loss of progress. For this reason, keep in mind that some jobs are not good fits for the scavenge partition, such as jobs with long startup times or jobs that run a long time between checkpoint operations.
Grace uses the modules system for managing software and its dependencies. See our documentation on modules here.
|Node Type||Processor (--constraint tag*)||Speed||Cores||Available RAM|
|IBM NeXtScale nx360 M4||Intel Xeon E5-2660 V2 (ivybridge)||2.20GHz||20||120GB|
|Lenovo NeXtScale nx360 M5||Intel Xeon E5-2660 V3 (haswell)||2.60GHz||20||120GB|
|Lenovo NeXtScale nx360 M5||Intel Xeon E5-2660 V4 (broadwell)||2.00GHz||28||250GB|
*For more info on how to use constraint, please see the Slurm Documentation.
Grace has 12 nodes on the gpu partition. Six have 2 Nvidia Tesla K80s each (each K80 has 2 GPUs, for a total of 4 GPUs per node). Six have one Nvidia Tesla P100 each. See our GPU guide for instruction on requesting GPUs for your job.
File System: 2 PB of GPFS storage via FDR InfiniBand
By default, each group has 300 GB and a 1TB storage quota on the home and project partitions, respectively. A group's usage can be monitored using the groupquota script available at:
Each PI group is provided with storage space for research data on the HPC clusters. The storage is separated into three tiers: home, project, and scratch. You can monitor your storage usage by running the "groupquota" command on the cluster.
Please note: the only storage backed up on every cluster is /home
Home storage is a small amount of space to store your scripts, notes, final products (e.g. figures), etc. Home storage is backed up daily.
In general, project storage is intended to be the primary storage location for HPC research data in active use.
60-Day Scratch (
Use this space to keep intermediate files that can be regenerated/reconstituted if necessary. Files older than 60 days will be deleted automatically. This space is not backed up, and you may be asked to delete files younger than 60 days old if this space fills up.
Other Storage Options
If you or your group finds these quotas don't accommodate your needs, please see the off-cluster research data storage options.
Contact us at email@example.com about purchasing cluster storage.