Software Engineer - GPU reliability
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- This role offers a unique opportunity to make a significant impact on a critical part of our existing and growing infrastructure. Your responsibilities may vary day to day, but will include:
- Building and maintaining tools and software features to automate systems engineering workflows related to GPU management, monitoring, metrics collection, maintenance, and network configuration
- Troubleshooting software and hardware bugs on a fleet of GPU devices, including application, network, operating system, and/or kernel issues
- Working across HRT's engineering teams to tune workloads and processes to use GPUs more efficiently
- Analyzing GPU job statistics to identify trends and areas for improvement
Requirements
- Required:
- BS and/or MS in computer science or a related field
- 2+ years of relevant experience, including programming in Python and managing GPUs
- Experience using automation to solve problems and improve process efficiency
- Experience working with, troubleshooting, tuning, and deploying various types of GPU hardware
- Strong grasp of computer science fundamentals and software design patterns
- Solid understanding of Linux/UNIX operating systems
- Familiarity with open-source software
- Ability to debug and analyze problems quickly
- Skilled at balancing multiple tasks while maintaining meticulous attention to detail
- Ability to operate effectively as a team player and also work independently
- Ability to learn at a fast pace and apply new skills effectively
- Preferred:
- Understanding of Debian operating system
- Familiarity with systems configuration management and monitoring technologies
- Familiarity with continuous integration and continuous deployment tools and processes
- Understanding of networking protocols
- The estimated base salary range for this position is 200,000 to 300,000 USD per year (or local equivalent). The base pay offered may vary depending on multiple individualized factors, including location, job-related knowledge, skills, and experience.
- Culture
- Hudson River Trading (HRT) brings a scientific approach to trading financial products. We have built one of the world's most sophisticated computing environments for research and development. Our researchers are at the forefront of innovation in the world of algorithmic trading.
Benefits
Additional Information
Hudson River Trading (HRT) is seeking a Software Engineer focused on GPU reliability to join our Systems Development team. The Systems Development team builds and maintains the platform that is shared by all Systems teams to provision, monitor, and manage HRT's server and network infrastructure. In this role, your main focus will be to develop tools in Python to analyze the performance of GPU hardware and build creative solutions to improve observability, reliability, and efficiency of the fleet. You'll work closely with other engineering teams to deeply understand research and trading workflows and ensure that GPU infrastructure is utilized optimally. Strong Python skills and development experience are required, along with Unix experience and a background of managing GPU hardware at scale.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at wehrtyou? Share your experience