Design, build, and optimize large-scale distributed GPU compute clusters
Identify and resolve GPU workloads' performance bottlenecks across compute, storage, and networking layers
Collaborate with research and development teams to profile, benchmark, and fine-tune GPU-based workloads
Automate system deployment, monitoring, and troubleshooting across thousands of nodes
Collaborate with research, and engineering teams to support evolving workloads
Own critical infrastructure projects - from concept to implementation and support
Test and deploy new hardware and software, and partner with vendors to resolve complex issues
Requirements
5+ years of experience in large-scale Linux systems engineering in HPC, AI or distributed infrastructure roles
Extensive experience in Linux system installation, performance tuning, and troubleshooting
Expertise in troubleshooting distributed GPU workloads
Deep knowledge around GPU optimization and performance
Proficiency in Python scripting and automation frameworks
CUDA or C/C++ experience is a plus
Experience with NVIDIA technologies beyond CUDA, such as NCCL, GPUDirect RDMA, and NVLink
Familiarity with configuration management tools (e.g. Salt, Ansible, Puppet, Chef)
Comfortable diagnosing complex system issues at the hardware, OS, and network levels
Strong communication and organizational skills; able to collaborate across diverse technical teams
Thrive in fast-paced environments and excited by high-impact work
Culture:
This fund brings a scientific approach to trading financial products. They've built one of the world's most sophisticated computing environments for research and development, and their researchers are at the forefront of innovation in the world of algorithmic trading.
Seem like something you might be interested in? The goal is to find the best people and bring them together to do great work in a place where everyone is valued. They're proud of their diverse staff; with offices all over the globe they benefit from varied and unique perspectives.
This is an equal opportunity employer; so whoever you are they'd love to get to know you.
Whilst we carefully review all applications, to all jobs, due to the high volume of applications we receive it is not possible to respond to those who have not been successful.
Contact
If this sounds like you, or you'd like more information, please get in touch:
George Hutchinson-Binks
(+44)
in/george-hutchinson-binks-a62a69252
Additional Information
One of the world's top algorithmic trading firms, our client is looking for GPU Systems Engineers to help scale and evolve their exceptionally sophisticated HPC/AI research environment.
Joining the Research and Development team, you will collaborate with experts responsible for the compute, storage, operating systems, and automation tools that enable trading and research to run 24/7 across the globe. They design, grow, and operate infrastructure at a large scale, including triple-digit petabyte-scale storage and massive CPU and GPU clusters in globally distributed data centers. As such, this is a high-impact role with broad scope, from HPC/AI cluster design and performance tuning, to troubleshooting and automation for thousands of nodes.