High-Performance Computing System Administrator

The Yale Center for Research Computing (YCRC) seeks a High-Performance Computing System Administrator to join the center’s Engineering team to provide hardware and software administration for a growing number of high-performance computing (HPC) clusters used in faculty research.

The YCRC provides support that spans the Yale School of Medicine and Faculty of Arts & Sciences and encompasses Yale’s HPC clusters, multiple petabytes of high-performance storage, and technologies for computational science and the analysis, sharing, and management of large-scale research data.

The successful candidate will support the infrastructure behind all of the above, including hardware, system and resource-management software, networking, storage, monitoring and security measures. This is a highly-collaborative effort, so frequent interaction with other system administrators, research-support staff, management, vendors and researchers is a regular part of the role. The successful candidate will also participate in designing, recommending and vetting architectures, specifications, and configurations of new systems.

The Yale Center for Research Computing is a component of the Provost’s Office, and is governed jointly by the Vice Provost for Research, the Deputy Dean(s) for Research at the Yale School of Medicine, and the Chief Information Officer of the University.

Responsibilities Specific to This Role:

  1. Configure and support HPC clusters.
  2. Install, administer and maintain hardware, system software, networking, accounts, and security measures.
  3. Deploy and support data storage and backup.
  4. Diagnose and correct system issues, whether these be issues with correct operation or performance.
  5. Reinstate integrity of system as quickly as possible following an outage in order to minimize downtime.
  6. Manage end-user accounts.
  7. Triage and solve user-submitted tickets, especially when they relate to the infrastructure.
  8. Track resource usage using monitoring and queuing software.
  9. Develop and maintain documentation for team members and end users.
  10. Research developments in HPC architecture and new technologies, processes, and methodologies.
  11. Patch system firmware and software as needed.
  12. Determine specifications for new systems and tailor these to meet business needs (together with team).
  13. Conduct training and user education.
  14. Perform other duties as assigned.
For questions, email Eric Peskin (eric.peskin@yale.edu).