Yale Center for Research Computing (YCRC) HPC Policies
July 19, 2019
Table of Contents
- Access and Accounts
- System Administration
- Compute Resources
- Data & Security Considerations on YCRC Resources
- Resource Scheduling and Jobs
- Research Support and User Assistance
- Account Expiration
- Hardware Acquisition and Lifecycle
- Facilities, Equipment, and Other Resources
The Yale Center for Research Computing (YCRC) is a component of the Provost’s Office, reporting jointly to the Vice Provost for Research, the Deputy Dean for Academic & Scientific Affairs at the Yale School of Medicine, and the university’s Chief Information Officer. It advances research at Yale by administering a sustainable state-of-the-art computational infrastructure, providing technology services, and facilitating an interdisciplinary approach to the development and application of advanced computing and data processing technology throughout the research community.
The YCRC operates high performance computing (HPC) resources in support of faculty research in a wide range of disciplines across the university. Upon request, any Yale faculty member with a primary or secondary appointment in a unit of the Faculty of Arts and Sciences or the Yale School of Medicine will be provided with a Principal Investigator (PI) group account on the YCRC’s HPC resources. Other university faculty or staff may apply for such an account, subject to additional administrative approval. Once a group account has been created, and with approval of the PI, members of a PI’s research group may request individual accounts under the group account. The PI is ultimately responsible for all use of YCRC resources by members of the PI’s group.
Because of its focus on computational research conducted by faculty, research staff, postdocs, and graduate students, the YCRC provides only limited access to undergraduate students, most often when they are part of a PI’s research group or enrolled in a class that makes use of YCRC resources. In unusual circumstances, undergraduate students may request use of YCRC resources for independent research under the oversight of a faculty sponsor/advisor. Such requests will be considered on a case-by-case basis, and approval will be at the discretion of the YCRC.
The YCRC encourages appropriate use of its resources for classes. Instructors planning use of YCRC resources in their classes must consult with the YCRC and obtain prior approval. Accounts created for class use will be disabled and removed once the class is over.
Policies regarding account creation and access to HPC resources are subject to change.
The YCRC’s systems administration group administers all HPC systems, including machines (known as HPC clusters), storage facilities, and related datacenter networking. Since these systems are intended to support research applications and environments, they are designed and operated to achieve maximum levels of performance, rather than high availability or other possible goals. Therefore, the system administration group’s primary responsibility is to maximize system stability and uptime while maintaining high performance. To that end, all systems are subject to regular maintenance periods as listed on the System Status Page on the YCRC website. During these maintenance periods and, rarely, at other times, some or all of the YCRC HPC resources may not be available for use.
All users may have access to compute resources on one or more YCRC clusters, subject to reasonable limits and YCRC discretion. Users are expected to do their best to use resources efficiently and to release idle resources. User accounts are personal to individual users and may not be shared. Under no circumstances may any user make use of another user’s account. Users are expected to follow standard security practices to ensure the safety and security of their accounts and data. (See the Data & Security section below.)
All users have access to limited amounts of storage in home, project, and short-term scratch directories, free of charge. Quotas are used to limit the amount of storage and the number of files per user and/or group. Users are expected to do their best to delete files they no longer need. (Files in short-term scratch directories are purged automatically after 60 days.) Faculty PIs are responsible for all storage usage by members of their groups. Upon request and subject to YCRC discretion, additional storage may be provided, which may or may not incur a cost.
Users are not permitted to use or store high risk, sensitive, or regulated data of any sort (e.g., HIPAA data or PHI) at any time on any YCRC facilities other than those specifically designated for such data.
Security of YCRC facilities and all data stored on them is extremely important, and the YCRC takes several steps to help provide a secure computing environment, including:
- Operating the clusters in a secure datacenter, with restricted and logged access;
- Using firewalls to allow access only from Yale computers or the Yale VPN, and restricting that access to only login and data transfer servers;
- Requiring ssh key pairs (not passwords) for ssh authentication;
- Keeping operating systems up to date, and regularly applying security patches;
Users also bear individual responsibility for the security of YCRC clusters and their data and are expected to follow standard security practices to ensure the safety and security of their accounts and data. This includes:
- Using strong passphrases on ssh keys;
- Setting permissions appropriately on data files and directories;
- Never sharing private keys, passphrases, or other login information;
- Following the terms of any Data Use Agreement that covers the data.
Further information regarding Yale cybersecurity and data classifications can be found at http://cybersecurity.yale.edu.
Only files in users’ home directories are backed up, and then only for a short time (currently approximately 30 days). No other user files are backed up at all. Backups are stored locally, so major events affecting the HPC data center could destroy both the primary and backup copies of user files. Users should maintain their own copies of critical files at other locations. YCRC cannot guarantee the safety of files stored on HPC resources.
The YCRC’s HPC resources are shared by many users. The YCRC uses a workload management system (Slurm) to implement and enforce policies that aim to provide each PI group with fair, but limited, access to the HPC clusters. Users may not run computationally intensive jobs on the login nodes. Instead, users must submit such jobs to Slurm, specifying the amount of resources to be allocated for the jobs. Jobs running for longer than one week are discouraged. Jobs exceeding their requested resource amounts will be terminated by Slurm with little or no warning. In order to avoid loss of data if jobs terminate unexpectedly, users are strongly encouraged to checkpoint running jobs at regular intervals.
The YCRC includes a number of research support staff who can help users with a variety of tasks, including education and training, software installation, and cluster usage. The YCRC offers a growing number of classes and workshops on a variety of topics relevant to research computing. New users are particularly encouraged to attend one of the introductory training workshops to learn about the HPC clusters and become familiar with the YCRC’s standard operating procedures.
The YCRC procures, installs, and maintains a number of standard software tools and applications intended for use on YCRC facilities, including its HPC clusters. Among these are compilers and languages (e.g., Python, C, C++, Fortran), parallel computing tools (e.g., MPI, parallel debuggers), application systems (e.g., R, Matlab, Mathematica), and libraries (e.g., Intel Math Kernel Library, NAG, GNU Scientific Library, FFTW). Users requiring additional software for use on YCRC facilities are encouraged to install their own copies, though the YCRC’s research support staff is available to assist as needed. For customizable systems such as Python and R, the YCRC has set up procedures to enable users to easily install their own modules, libraries, or packages.
The YCRC has set up its own “ticketing system” to help manage and address inquiries, requests, and troubleshooting related to the HPC clusters. Users may contact YCRC staff by sending email to firstname.lastname@example.org. While the time required to resolve particular issues may vary widely, users may expect an initial communication from a YCRC staff member within a reasonable time (often within one business day). Users may also obtain assistance by visiting the YCRC to meet with the research support staff, either on a first-come-first-served basis during the open office hours posted on the YCRC website, or by appointment at other times.
Accounts will expire when users leave the university or the accounts have been inactive for one (1) year. Faculty PIs are ultimately responsible for ensuring that all files are properly managed or removed when accounts expire. However, the YCRC reserves the right to delete expired accounts and all files associated with them, if necessary. When possible, users should arrange to transfer file ownership before leaving the university.
At least once per year, the YCRC will offer PIs an opportunity to purchase dedicated HPC compute and storage resources for their groups. Often, such opportunities may be coordinated with the YCRC’s regular refresh and upgrade cycles for its compute and storage hardware infrastructure, but, subject to YCRC approval, PIs may request hardware purchases at other times, as well. At its discretion, the YCRC may restrict the types and quantities of dedicated HPC resources purchased for PI groups and may require that purchased resources be compatible with the YCRC’s datacenter and network infrastructure and with applicable policies.
All HPC resources will be purchased from vendors of the YCRC’s choosing and will be purchased with warranties acceptable to the YCRC (currently for 5 years). HPC resources will have lifetimes consistent with their warranties, commencing upon delivery, after which the resources may be decommissioned, or the lifetimes may be extended, at the YCRC’s discretion.
Each PI group may access its dedicated compute resources using a private Slurm partition restricted to use by group members. The YCRC reserves the right to allow other users to make use of any idle dedicated compute resources by submitting jobs to a Slurm scavenge partition. Any such scavenge job shall be subject to nearly immediate termination should a member of the owning PI group request access (via its private partition) to the compute resources being used by the scavenge job.
Please see the YCRC website for further details.
The Yale Center for Research Computing (YCRC) operates five primary high performance computing (HPC) clusters located in an HPC data center at Yale’s West Campus facility in West Haven, approximately 7 miles from the main Yale campus.
Faculty of Arts & Sciences Research: For non-biomedical research in science, engineering, and other fields, the YCRC currently runs the Omega and Grace HPC clusters, totaling over 1,200 nodes and nearly 13,000 cores. These two clusters are traditional Linux clusters, containing dual-processor nodes that have 8-28 cores per node and 4-9 GB of memory per core. A small number of nodes have Nvidia GPUs and/or additional memory per core. The clusters use InfiniBand networks for computational work. Omega and Grace share ~2.6 Petabytes of usable storage deployed in a high-performance GPFS parallel file system. Yale acquired the Omega and Grace clusters using internal funds primarily, except for a modest number of cluster nodes that were funded by individual faculty grants or gifts. Limited access to these two clusters is available free of charge to all faculty, students, and research staff in the sciences, social sciences, engineering, and other fields for non-disease-specific scientific research. Portions of the clusters have been dedicated to specific research groups in astrophysics, climate science, high-energy physics, energy sciences, engineering, life sciences, and other fields. In general, these clusters operate at between 80% or more of their theoretical capacities.
Biomedical Research: For biomedical research, the YCRC currently runs the Farnam and Ruddle HPC clusters, totaling nearly 500 nodes and approximately 9,000 cores. These clusters primarily support NIH-funded biomedical research, including disease-specific research. Most of the nodes on both clusters contain dual 10-core processors and up to 128 GB of memory, and they are connected via 10 Gbps Ethernet to a 40Gbps network backbone. A few nodes on both clusters have 32 cores and 1.5-2.0 Terabytes of memory, and a small number of nodes on Farnam contain Nvidia K80, 1080Ti, P100, or Titan V GPUs. Together, Farnam and Ruddle provide over 5 Petabytes of usable disk storage. These clusters were acquired using a combination of NIH grants, grants from private foundations, and internal university funds. Access to Farnam is provided free of charge for biomedical and life science researchers at Yale, primarily in the Yale School of Medicine. Several specific research groups have secured special priority access by contributing funds to the cluster purchases. The Ruddle cluster is restricted to research related to genome sequencing performed at the Yale Center for Genome Analysis (YCGA). Over the past several years, the compute and storage resources on these two clusters (and two predecessor clusters) have been nearly fully subscribed by existing users, and the clusters typically operate at over 80% of their theoretical maximum capacities.
Psychology Research: The YCRC operates the Milgram HPC cluster to support computational research in neuroscience in the Department of Psychology. This cluster, which is HiPAA-aligned to permit use for projects involving regulated identifiable data, comprises 72 nodes totaling 1,920 cores. Most nodes have 256 GB of memory. Data storage is provided on a 1.2 Petabyte high-performance GPFS parallel file system. Use of the Milgram cluster is restricted to specific research groups in the Department of Psychology.
Storage Facilities: In addition to storage facilities associated with the HPC clusters or specific departmental or laboratory facilities, Yale operates two large research storage facilities connected to a high-speed 100 Gbps Science Network described below.
Active Storage: The active research file storage system provides more than 2.5 PB of mirrored storage that is optimized to provide groups (research labs, departments, schools) and individual researchers the ability to store and use large quantities of data on the HPC clusters or any Windows, Mac or Linux computer connected to the Yale network.
Archive Storage: Yale operates a mirrored tape library designed to provide long-term archival data storage. Users stage data through archive nodes attached to the Science Network, and the system automatically writes the data in a tape library at West Campus and replicates it to a second tape library located in downtown New Haven.
Network Infrastructure: Yale operates a high-performance network infrastructure in support of research computing and HPC. A 100-Gbps Science Network physically connects the HPC datacenter and the two research storage facilities. It also provides a direct connection to the Internet2 through a non-firewalled Science DMZ, currently running at 10 Gbps, but scalable up to 100 Gbps should that be required in the future. Virtual LAN technology is used to segregate access and applications on the network, and 10-Gbps connections are provided via the VLANs to a number of individual laboratory and departmental storage and server facilities.
Yale has a separate main campus network based on a 10-Gbps backbone, with most buildings connected via 1-Gbps local networks. The main campus network is currently used for all university purposes other than high-speed data transfer through the Science Network VLANs. It also provides commodity Internet connectivity via multiple 1-Gbps connections from several commercial vendors. Yale connects to the Internet2 via the Connecticut Education Network (CEN), which connects, in turn, to the Northern Crossroads GigaPop (NOX). In addition to the direct connection from the Science DMZ, Yale provides one firewalled 10-Gbps connection from the main campus network.
File-sharing Infrastructure: Yale has a site license for the Globus file transfer and sharing software. The YCRC supports a number of endpoints on the HPC clusters. Other Yale departments and laboratories may provide endpoints supported by departmental or IT staff. All endpoints are connected to Yale’s Science Network and Science DMZ to facilitate high-speed data transfer and sharing among Yale, national supercomputing facilities, and other universities with whom Yale researchers collaborate world-wide.