Designs, deploys, configures, and administers CPU/GPU HPC clusters, including management and compute nodes, storage infrastructure, interconnects such as InfiniBand, and physical infrastructure in the datacenter and related systems.
Monitors, configures, maintains and tunes GPU nodes for optimal performance and utilization following state-of-the-art practices for the required workloads.
Assists with the development and implementation of monitoring and observability tools and infrastructure, collection and aggregation of metrics, development of dashboards.
Develops, maintains, and enforces security procedures and system documentation for operational and compliance purposes.
Tunes, secures, and maintains the HPC job scheduling environment, including fair-sharing, accounting, and policy enforcement.
Assists with the implementation and maintenance of secure and reliable backup, archival, disaster-recovery, and restore capabilities for systems and research data.
Performs vulnerability scanning, patch management, system and firmware updates across the infrastructure.
Maintains complex systems and network administration functions. Works with moderated guidance to administer simple systems and assists in the administration of larger systems.
Plans and installs necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems. Installs and maintains and appropriate level of intrusion detection, monitoring, and auditing software as required.
Tracks compliance and maintains documentation for hardware, software, and service inventories for management reports.
Performs other related work as needed.
Requirements
Education:
Minimum requirements include a college or university degree in related field.
Work Experience:
Minimum requirements include knowledge and skills developed through 5-7 years of work experience in a related job discipline.
Certifications:
---
Linux system administration experience in a large, distributed computing environment.
Demonstrated experience and knowledge of system security and best practices.
Technical Skills and Knowledge:
Knowledge of Linux administration, preferably RHEL/Rocky.
Administration of GPU infrastructure, such as tuning, driver updates, performance monitoring, etc.
Solid skills in scripting with Python or Bash.
Installing, configuring, and managing job schedulers, such as Slurm, Torque, PBS, and LSF.
Automation tools such as Ansible, Puppet, Chef, Salt.
Provisioning tools, including xCAT, Confluent, and Warewulf.
Implementing monitoring tools, such as CheckMK, Zabbix, Nagios, Prometheus, and Grafana.
Working, documenting and enforcin
Benefits
Vision insurance
Additional Information
Department
Provost Research Computing Center
About the Department
The University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research oversees the conduct of sponsored research, research program development, and contract management functions.
Job Summary
The University of Chicago is seeking a highly qualified HPC Systems Security Engineer to join the HPC Systems and Operations team that builds and manages RCC's HPC infrastructure. The individual in this position will be involved in the operation, maintenance, security, and compliance of large-scale complex HPC systems primarily used for research.
This position designs automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. Works independently to install, configure, and maintain operating systems. Uses best practices and systems knowledge to monitor and alert systems, utility software, and firewalls. Guides maintenance for production servers as well as Windows and Linux servers.
This is a hybrid position requiring at least 3 days working onsite.