Yale Center for Research Computing (YCRC) HPC Policies
September 25, 2020
Table of Contents
- Access and Accounts
- Systems Administration
- Compute & Storage Resources
- Data & Security Considerations on YCRC Resources
- Resource Scheduling and Jobs
- Research Support and User Assistance
- Account Expiration
- Hardware Acquisition and Lifecycle
- Facilities, Equipment, and Other Resources
The Yale Center for Research Computing (YCRC) is a computational core facility under the Office of the Provost created to support the advanced computing needs of the research community. The YCRC is staffed by a team of research scientists, application specialists, and systems administrators with expertise in supporting high performance computing (HPC) and computationally dependent disciplines.
The YCRC advances research at Yale by administering a state-of-the-art cyberinfrastructure, providing sustainable research services, and facilitating an interdisciplinary approach to the development and application of advanced computing and data processing technology throughout the research community.
The YCRC operates high performance computing (HPC) resources in support of faculty research in a wide range of disciplines across the university. Upon request, the YCRC will provide a principal investigator group account on the YCRC’s HPC resources to any member of the ladder faculty appointed at the level of assistant professor or above, or any member of the research faculty appointed at the level of Research Scientist/Scholar or above. Other university faculty or staff may apply for such an account, subject to additional administrative approval. Once a group account has been created, and with approval of the PI, members of a PI’s research group may request individual user accounts under the group account. The PI is ultimately responsible for all use of YCRC resources by members of the PI’s group.
Because of its focus on computational research conducted by faculty, research staff, postdocs, and graduate students, the YCRC provides only limited access to undergraduate students, most often when they are part of a PI’s research group or enrolled in a class that makes use of YCRC resources. In unusual circumstances, undergraduate students may request use of YCRC resources for independent research under the oversight of a faculty sponsor/advisor. Such requests will be considered on a case-by-case basis, and approval will be at the discretion of the YCRC.
The YCRC encourages appropriate use of its resources for classes. Instructors planning use of YCRC resources in their classes must consult with the YCRC and obtain prior approval at least 60 days prior to the start of the class. Accounts created for class use will be disabled and removed once the class is over.
Policies regarding account creation and access to HPC resources are subject to change.
The YCRC’s engineering group administers all HPC systems, including machines (known as HPC clusters), storage facilities, and related datacenter networking. Since these systems are intended to support research applications and environments, they are designed and operated to achieve maximum levels of performance, rather than high availability or other possible goals. Therefore, the engineering group’s primary responsibility is to maximize system stability and uptime while maintaining high performance. To that end, all systems are subject to regular maintenance periods as listed on the System Status Page on the YCRC website. During these maintenance periods and, rarely, at other times, some or all of the YCRC HPC resources may not be available for use.
In general, the YCRC will provide each PI group with access to compute and storage resources on one YCRC cluster, subject to reasonable limits and YCRC discretion. In some circumstances, the YCRC may provide a PI group or some of its members with access to additional resources as appropriate for their computational work. (Such additional access may be temporary.) Users are expected to do their best to use resources efficiently and to release idle resources.
All users will have access to limited amounts of storage in home, project, and short-term scratch directories, free of charge. Quotas are used to limit the amount of storage and the number of files per user and/or PI group. Users are expected to do their best to delete files they no longer need. (Files in short-term scratch directories are purged automatically after 60 days.) PIs are responsible for all storage usage by members of their groups. Upon request and subject to YCRC discretion, additional storage may be provided, which may or may not incur a cost.
Users are not permitted to use or store high or medium risk, sensitive, or regulated data of any sort (e.g., HIPAA data, PHI, or PII) at any time on any YCRC facilities other than those explicitly designated for such data. Regardless of the sensitivity/risk classification of the data, users are not permitted to use or store, on any YCRC facility, data that are covered by a data use agreement (DUA) unless the DUA has been approved by the Office of Sponsored Projects, and the YCRC has been informed of and agreed to meet all applicable computing-related requirements of the DUA, including, but not limited to, requirements for data encryption, access control, auditing, and special actions to be taken upon removal of the data.
Security of YCRC facilities and all data stored on them is extremely important, and the YCRC takes several steps to help provide a secure computing environment, including:
- Operating the clusters in a secure datacenter, with restricted and logged access;
- Using firewalls to allow access only from Yale computers or the Yale VPN, and restricting that access to only login and data transfer servers;
- Requiring ssh key pairs (not passwords) for ssh authentication;
- Keeping operating systems up to date, and regularly applying security patches;
In addition, users bear individual responsibility for the security of YCRC clusters and their data. Accordingly, users are expected to follow standard security practices to ensure the safety and security of their accounts and data. This includes:
- Using strong passphrases on ssh keys;
- Setting permissions appropriately on data files and directories;
- Never sharing private keys, passphrases, or other login information;
- Following the terms of any Data Use Agreement that covers the data.
Further information regarding Yale cybersecurity and data classifications can be found at http://cybersecurity.yale.edu.
Except as described on the YCRC website for specific clusters, only files in users’ home directories are backed up, and then only for a short time (currently approximately 30 days on most clusters). No other user files are backed up at all. Backups are stored locally, so major events affecting the HPC data center could destroy both the primary and backup copies of user files. Users should maintain their own copies of critical files at other locations. YCRC cannot guarantee the safety of files stored on HPC resources.
The YCRC’s HPC resources are shared by many users. The YCRC uses a workload management system (Slurm) to implement and enforce policies that aim to provide each PI group with fair, but limited, access to the HPC clusters. Users may not run computationally intensive jobs on the login nodes. Instead, users must submit such jobs to Slurm, specifying the amount of resources to be allocated for the jobs. Jobs running for longer than one week are discouraged. Jobs exceeding their requested resource amounts will be terminated by Slurm with little or no warning. In order to avoid loss of data if jobs terminate unexpectedly, users are strongly encouraged to checkpoint running jobs at regular intervals.
The YCRC includes a number of research support staff who can help users with a variety of tasks, including education and training, software installation, and cluster usage. The YCRC offers a growing number of classes and workshops on a variety of topics relevant to research computing. New users are particularly encouraged to attend one of the introductory training workshops to learn about the HPC clusters and become familiar with the YCRC’s standard operating procedures.
The YCRC procures, installs, and maintains a number of standard software tools and applications intended for use on YCRC facilities, including its HPC clusters. Among these are compilers and languages (e.g., Python, C, C++, Fortran), parallel computing tools (e.g., MPI, parallel debuggers), application systems (e.g., R, Matlab, Mathematica), and libraries (e.g., Intel Math Kernel Library, NAG, GNU Scientific Library, FFTW). Users requiring additional software for use on YCRC facilities are encouraged to install their own copies, though the YCRC’s research support staff is available to assist as needed. For customizable systems such as Python and R, the YCRC has set up procedures to enable users to easily install their own modules, libraries, or packages.
The YCRC has set up its own “ticketing system” to help manage and address inquiries, requests, and troubleshooting related to the HPC clusters. Users may contact YCRC staff by sending email to email@example.com. While the time required to resolve particular issues may vary widely, users may expect an initial communication from a YCRC staff member within a reasonable time (often within one business day). Users may also obtain assistance by visiting the YCRC to meet with the research support staff, either on a first-come-first-served basis during the open office hours posted on the YCRC website, or by appointment at other times.
The YCRC audits cluster accounts annually on November 1, and all accounts no associated with a valid netid will expire at that time. In addition, it is essential for the YCRC to be able to contact all account holders for security and communication purposes. Therefore, logins will be disabled throughout the year for accounts that are found not to be associated with a valid email address. When accounts expire, PIs are ultimately responsible for ensuring that all files are properly managed or removed. However, the YCRC reserves the right to delete expired accounts and all files associated with them, if necessary. When possible, users should arrange to transfer file ownership before leaving the group(s) with which they have been working.
At least once per year, the YCRC will offer PIs an opportunity to purchase dedicated HPC compute and storage resources for their groups. Often, such opportunities may be coordinated with the YCRC’s regular refresh and upgrade cycles for its compute and storage hardware infrastructure, but, subject to YCRC approval, PIs may request hardware purchases at other times, as well. At its discretion, the YCRC may restrict the types and quantities of dedicated HPC resources purchased for PI groups and may require that purchased resources be compatible with the YCRC’s datacenter and network infrastructure and with applicable policies.
All HPC resources will be purchased from vendors of the YCRC’s choosing and will be purchased with warranties acceptable to the YCRC (currently for 5 years). HPC resources will have lifetimes consistent with their warranties, commencing upon delivery, after which the resources may be decommissioned, or the lifetimes may be extended, at the YCRC’s discretion.
Each PI group may access its dedicated compute resources using a private Slurm partition restricted to use by group members. The YCRC reserves the right to allow other users to make use of any idle dedicated compute resources by submitting jobs to a Slurm scavenge partition. Any such scavenge job shall be subject to nearly immediate termination should a member of the owning PI group request access (via its private partition) to the compute resources being used by the scavenge job.
Please see the YCRC website for further details.
The Yale Center for Research Computing (YCRC) supports and provides access to four high performance computing (HPC) clusters located in an HPC data center at Yale’s West Campus facility in West Haven, Connecticut. Divisions between clusters are based on the research applications used: non-biomedical research in the Faculty of Arts & Sciences, biomedical and psychology/HIPAA. Within these divisions, any researchers can access a portion of the clusters that is designated shared usage free of charge. Additional dedicated resources can and have been purchased by research groups.
The Grace HPC cluster is intended for non-biomedical research in natural science, social science, engineering, and other fields. It comprises over 835 nodes (servers) containing nearly 23,000 total CPU cores. This cluster uses Linux as its operating system and Slurm for scheduling. Each node on the cluster has 20-36 CPU cores with 4-9 GB of memory per core. Some nodes have Nvidia GPUs or additional memory per core. Grace’s network infrastructure is high bandwidth, low latency InfiniBand. Grace has ~2.6 Petabytes of usable storage in its high-performance GPFS parallel file system. The cluster was acquired using a combination of internal university funds and individual faculty grants. Grace generally operates at over 80% of its theoretical capacity.
The Farnam and Ruddle HPC clusters are intended for use in biomedical research. Access to the Ruddle cluster is restricted to researchers engaged in genome sequencing at the Yale Center for Genome Analysis (YCGA). Together these clusters total nearly 400 nodes (servers) with 8,000 total CPU cores. These clusters use Linux as their operating system and Slurm for scheduling. Each node on the cluster has 20-36 CPU cores with 5-9 GB of memory per core. Some nodes on Farnam have Nvidia GPUs and/or additional memory per core. Farnam and Ruddle use Ethernet for networking. Together, Farnam and Ruddle provide over 4 Petabytes of usable storage deployed across several high-performance GPFS parallel file systems. These clusters were acquired using a combination of NIH grants, grants from private foundations, and internal university funds.
The Milgram HPC cluster is HiPAA-aligned and supports computational research involving regulated identifiable data. At the present time, use of the Milgram cluster is restricted to specific research groups in the Department of Psychology performing neuroscience research. It comprises 65 nodes (servers) with 1,764 total CPU cores. Each node on the cluster has 20-36 CPU cores with 5-9 GB of memory per core. Milgram has 1.2 usable Petabytes of high-performance storage in its GPFS parallel file system. During the Spring of 2020, YCRC plans to expand both the storage and computational capability of this cluster.
In addition to storage facilities associated with the HPC clusters or specific departmental or laboratory facilities, Yale operates two large research storage facilities connected to a high-speed 100 Gbps Science Network. For research data in active use, the Storage@Yale file storage system provides more than 2.5 PB of mirrored storage that is optimized to provide research labs, departments, schools, and individual researchers the ability to store and use large quantities of data on the HPC clusters or any Windows, Mac or Linux computer connected to the Yale network. For archive storage, Yale operates a mirrored tape library. Researchers can stage data to and from HPC storage through archive nodes attached to the Science Network, and the system automatically writes the data to the archive storage.
Throughout its campus, Yale operates a high-performance network infrastructure in support of research computing and HPC. The 100-Gbps Science Network connects the HPC datacenter and the main campus. It also provides a direct connection to the Internet2 through a non-firewalled Science DMZ, currently running at 10 Gbps, but scalable up to 100 Gbps should that be required in the future. Virtual LAN technology is used to segregate access and applications on the network, and 10-Gbps connections are provided via the VLANs to a number of individual laboratory and departmental storage and server facilities.
To facilitate high-speed data transfers, the YCRC holds a site license for the Globus file transfer and sharing software. The YCRC supports a number of Globus endpoints for the HPC clusters. Other Yale departments and laboratories can create endpoints supported by departmental or IT staff. All endpoints are connected to Yale’s Science Network and Science DMZ, enabling high-speed data transfers among Yale, national supercomputing facilities, and other universities with whom Yale researchers collaborate world-wide.
Yale has a separate main campus network based on a 10-Gbps backbone, with most buildings connected via 1-Gbps local networks. The main campus network is currently used for all university purposes other than high-speed data transfer through the Science Network VLANs. It also provides commodity Internet connectivity via multiple 1-Gbps connections from several commercial vendors. In addition to the direct connection from the Science DMZ, Yale provides one firewalled 10-Gbps connection to Internet2 from the main campus network. Over the next several years, Yale will upgrade its main campus network, incorporating both a faster backbone and software defined networking technology.