Yale Center for Research Computing
High Performance Computing Cluster Usage Guidelines
The Yale Center for Research Computing (YCRC) provides shared access to a number of Linux-based High Performance Computing (HPC) clusters in support of Yale’s scientific research and formal undergraduate and graduate education.
- To support research and teaching at Yale by providing broad access to computational and storage resources larger than those available in individual labs, but smaller than those available through National HPC infrastructure such as XSEDE or Department of Energy facilities.
- To deploy a high quality and well-maintained set of HPC resources for research and instructional use by individual faculty, their students, post docs, research staff, and external collaborators.
- To provide the training, technical assistance, and administrative support necessary to enable easy and efficient use of the YCRC’s resources.
Access: Access to the clusters is open to all science, social science, and engineering faculty in the Faculty of Arts and Sciences, the Yale School of Medicine, and Yale School of Public Health, as well as to their students and research staff. With approval, access may be granted to other Yale faculty, students, and research staff, and to external collaborators. Upon request, access may be provided for instructional use that is independent of research activities.
Policies: The clusters are operated by the YCRC following policies approved by the Deputy Provost for Research, the Deputy Dean for Academic and Scientific Affairs in the Yale School of Medicine, and committees appointed by them. At the present time, the YCRC does not charge for ordinary, shared use of the clusters or their primary attached storage systems, and such use is generally on a first-come, first-served basis. Computational and storage usage by individuals and faculty research groups is monitored and limited appropriately to ensure that all individuals have equitable access. All policies regarding HPC usage are reviewed regularly and adjusted as necessary by YCRC management, cognizant committees, the Deputy Provost, and the Deputy Dean. Users should consult the YCRC website for more complete, detailed, and up-to-date information about the clusters, policies for their use, and how to access and use them.
- Usage and Accounting: All research usage is monitored and consolidated under accounts for PI groups (typically associated with individual ladder faculty members). Post docs, students, research staff, external collaborators, and all other categories of research users must be associated with a PI group account, and their usage is assessed against that account. No user may be associated with more than one PI group except by special permission.
- External Collaborators: Cluster use by external collaborators is permitted upon request by a PI with suitable justification. All external use is subject to periodic review.
- Usage Allocations: Computational usage is measured and tracked in terms of Service Units (SUs). One SU is equivalent to one core-hour of computing (one hour of elapsed time consuming a single compute core and/or a corresponding portion of the memory on a single node). The number of SUs accumulated per hour may depend on usage of special resources and/or the job priority selected when a job is submitted.
Depending on need and resource availability, PI groups may not necessarily be granted access to all clusters. Resources on each cluster may include both publicly available nodes and storage available to all users, and nodes and storage that are reserved for exclusive or high-priority access by specific PI groups.
The normal quarterly SU allocation for each PI group using publicly available resources on a given cluster is 5% of the total SUs available on those resources in that quarter. Additional usage is permitted at lower priority if sufficient resources are available. Additional access at normal priority may be granted at the discretion of the YCRC. Allocations are reset at the start of each calendar quarter. YCRC management reserves the right to restrict any group’s or individual’s access to any or all clusters to ensure fair access to all.
- Storage and Quotas: During calendar year 2016, most clusters will migrate to a three-tier storage architecture:
Clusters that have not migrated to the three-tier architecture provide the Home and Project tiers only. (In such cases, the Project tier may be known as the Scratch tier.)
No charge is made for use of storage up to limits imposed by standard storage quotas. Initial quotas on home and project storage are set at levels suitable for the majority of PI groups. Modest no-cost permanent quota increases may be granted upon request with appropriate justification. In addition, PIs may arrange to purchase, at their expense, additional permanent storage in the Project tier. The YCRC also has a limited ability to provide short-term temporary quota increases for the Project tier. Please see the YCRC website for additional information.
In general, both project storage and temporary scratch storage are designed to provide high levels of I/O performance, but temporary scratch storage, which is local to each cluster, may have somewhat more desirable characteristics for the types of applications frequently run on a particular cluster. Users should not rely upon Home storage for high performance I/O operations.
Home storage is backed up daily, but neither project storage, nor temporary scratch storage is backed up. Users should not rely upon project storage or scratch storage for long-term storage of important data. It is each PI’s responsibility to ensure that important files created in project or temporary scratch storage are preserved by copying them to alternative storage facilities intended for that purpose (e.g., Storage@Yale). For details, please read the pages on the individual clusters and storage on the YCRC website
Home Storage: (Cluster-specific, per group; permanent) Each PI group is provided a limited amount of cluster-specific home storage (initially 100 GB) for scripts, programs, and other small files. Home storage may be located on a remote storage device, so it may not support high-performance I/O operations. Home storage is backed up daily.
Project Storage: (Cluster-independent; per group; permanent) Each PI group is provided a reasonably large amount of cluster-independent high performance project storage for program input data, outputs, private applications, etc. Each group’s initial quota for project storage will be 1 TB. In general, project storage is intended to be the primary storage location for HPC research data. Project storage resides on a high-performance parallel file system (currently GPFS) that is mounted on all clusters, making it easy for users to move computational work among the clusters. Project storage is not backed up.
Temporary Scratch Storage: (Cluster-specific; per-group; time-limited) Each cluster has several hundred Terabytes of high performance shared temporary scratch storage intended for short-term files (up to 60-days), such as in-job temporary files, files that need be kept only until final job outputs have been validated, etc. Temporary scratch storage resides on a high performance parallel file system local to each cluster; and hardware details may vary among the clusters. PI groups will have large quotas (~30 TB) on a space-available basis. Files in temporary scratch storage will be deleted after 60 days, and users may be asked to delete files sooner than that if the storage space fills up too quickly
Scheduling Policies: Each HPC cluster uses a job queuing/scheduling system to manage user jobs. (Several different systems are used currently, but all clusters will migrate to a standardized system during 2016.) Procedurally, users ssh to a cluster login node, from which they may submit jobs to that cluster’s scheduler. In general, jobs using publicly available resources are scheduled on first-come, first-served basis, possibly subject to adjustment based on the priority selected by the user when submitting a job. In addition, the scheduler may make a “fair-share” adjustment so that jobs from groups with high recent usage will have slightly lower priority, while jobs from groups with low recent usage will have somewhat higher priority, all else being equal. On each cluster there may be limits on the number of jobs that any one user or group may run concurrently, as well as on the aggregate number of cores that a group may be using at any one time. In some cases, PI groups may have purchased exclusive or high-priority access to a portion of a cluster, in which case those groups may have access to one or more special queues subject to rules determined by those groups.
User Support and Assistance: The YCRC staff is responsible for installing and supporting the HPC hardware, system software, and a number of widely used software tools, libraries, and applications (e.g., commercial compilers, MPI, numerical/mathematical libraries, Matlab®, R, Gaussian®, Mathematica®, etc.). In addition, the YCRC staff provides user training and assists users with cluster usage and installation of less widely used software packages. To facilitate support activities and maintain a high level of service quality, the YCRC makes use of a support software system to track requests and problem reports. Users are strongly encouraged to make requests or report problems to YCRC staff by email (firstname.lastname@example.org) in order to ensure that such requests or reports are properly entered into the support software system.
Additional Considerations: The HPC clusters are shared facilities used by hundreds of users, so the YCRC takes several steps to ensure safe and equitable access for all users.
Regulated Data: No regulated data may be stored or used on any YCRC HPC cluster other than the Milgram cluster, which is currently available exclusively to a limited set of PI groups. Such data include, but is not limited to, so-called “3-Lock” data such as electronic Protected Health Information (ePHI), HiPAA data, or other data subject to governmental regulations or private data use agreements. For information on these types of data, see http://www.yale.edu/its/secure-computing/data/levels/index.html.
Accounts and Authentication: On any cluster, each user is permitted only a single account, usually corresponding to the user’s Yale NetID. Security is a major concern on the HPC clusters. Access to a cluster requires use of ssh from a Yale IP address to a cluster login node using suitable authentication. (For access from an off-campus location, a VPN connection and/or two-factor authentication is required.) At the present time, authentication requires use of an ssh public-private key pair, except on the Louise cluster where the user’s NetID password may be used. In the future, Yale may implement alternative authentication procedures as appropriate. Details are available on the YCRC website.
Maintenance Periods: The YCRC strives to operate its clusters on a 7x24 basis, except for regularly scheduled maintenance periods (approximately twice per year for up to 5 days). The YCRC publishes a regular maintenance schedule and notifies users well in advance of scheduled maintenance periods. Users are responsible for managing their workloads accordingly. In the rare event of an emergency maintenance period, users may receive little or no notice.
Login Nodes: The login nodes on the clusters are shared by all users and are intended only for lightweight activities such as editing, reviewing program input and output files, file transfers, and job submission and monitoring. Users are not permitted to run programs (including interactive programs like Matlab®, R, or Mathematica®) on the login nodes, and the system administrators reserve the right to terminate without prior notice any user session found to be consuming excessive resources on a login node. For security reasons, the system software on the login nodes may be updated frequently and become inconsistent with the system software installed on the compute nodes, so it is strongly recommended that all compilations and other program building activities be performed on a compute node. (Special queues are available to provide rapid access to compute nodes for program development activities.)
Licensed Software: Yale has licensed a number of commercial software products for use on the clusters, and those are available to all users, subject to limits on the number of simultaneous users. In addition, a number of individual PI groups have licensed commercial software products for use only by their group members. Users are responsible for ensuring that they abide by all license terms and requirements of the software that they use on the HPC clusters.
Every effort will be made to provide for uninterrupted completion of all jobs that start to run. However, in rare instances, unforeseen circumstances may cause job failures or require suspending or even killing jobs without warning. Users are strongly urged to make use of checkpointing or other standard programming practices to allow for this possibility.
Every effort will be made to ensure that data stored on the clusters are not lost or damaged. However, only home directories will be backed up (daily in most cases), and any data that have not yet been backed up is subject to loss due to hardware failures or other circumstances. It is each PI’s responsibility to ensure that important files created in project or temporary scratch storage are preserved by copying them to alternative storage facilities intended for that purpose (e.g., Storage@Yale).
- Acknowledgements: All publications and presentations reporting on research benefiting from the use of any of the YCRC HPC clusters must include an appropriate acknowledgement of support from the Yale Center for Research Computing, in addition to any other support acknowledgements that may be appropriate to specific research programs. Suitable acknowledgements for the YCRC clusters depend on the particular cluster(s) used, as shown here:
Louise Cluster: “This work was supported by the HPC facilities operated by, and the staffs of, the Yale Center for Research Computing and Yale’s W.M.Keck Biotechnology Laboratory, as well as NIH grants RR19895 and RR029676-01, which helped fund the cluster.”
BulldogN Cluster: “This work was supported by the HPC facilities operated by, and the staffs of, the Yale Center for Research Computing and the Yale Center for Genome Analysis.”
Ruddle Cluster: “This work was supported by the HPC facilities operated by, and the staffs of, the Yale Center for Research Computing and the Yale Center for Genome Analysis, as well as NIH grant 1S10OD018521-01, which helped fund the cluster.”
All other YCRC clusters: “This work was supported by the HPC facilities operated by, and the staff of, the Yale Center for Research Computing.”
In addition, users are expected to provide the YCRC with an electronic copy of all publications benefiting from cluster use, along with suitable bibliographic citation information. This information is required so that the YCRC can make complete reports to its sponsors and submit grant proposals for new equipment. Publication information and questions about appropriate acknowledgement should be sent to the YCRC by email to email@example.com.