Skip to main content
Back to jobs

Software Engineer - GPU reliability

External
wehrtyou logoWehrtyou · New York, NY
Full-timeOn-site2w ago
LinuxObservabilityPython
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • This role offers a unique opportunity to make a significant impact on a critical part of our existing and growing infrastructure. Your responsibilities may vary day to day, but will include:
  • Building and maintaining tools and software features to automate systems engineering workflows related to GPU management, monitoring, metrics collection, maintenance, and network configuration
  • Troubleshooting software and hardware bugs on a fleet of GPU devices, including application, network, operating system, and/or kernel issues
  • Working across HRT's engineering teams to tune workloads and processes to use GPUs more efficiently
  • Analyzing GPU job statistics to identify trends and areas for improvement

Requirements

  • Required:
  • BS and/or MS in computer science or a related field
  • 2+ years of relevant experience, including programming in Python and managing GPUs
  • Experience using automation to solve problems and improve process efficiency
  • Experience working with, troubleshooting, tuning, and deploying various types of GPU hardware
  • Strong grasp of computer science fundamentals and software design patterns
  • Solid understanding of Linux/UNIX operating systems
  • Familiarity with open-source software
  • Ability to debug and analyze problems quickly
  • Skilled at balancing multiple tasks while maintaining meticulous attention to detail
  • Ability to operate effectively as a team player and also work independently
  • Ability to learn at a fast pace and apply new skills effectively
  • Preferred:
  • Understanding of Debian operating system
  • Familiarity with systems configuration management and monitoring technologies
  • Familiarity with continuous integration and continuous deployment tools and processes
  • Understanding of networking protocols
  • The estimated base salary range for this position is 200,000 to 300,000 USD per year (or local equivalent). The base pay offered may vary depending on multiple individualized factors, including location, job-related knowledge, skills, and experience.
  • Culture
  • Hudson River Trading (HRT) brings a scientific approach to trading financial products. We have built one of the world's most sophisticated computing environments for research and development. Our researchers are at the forefront of innovation in the world of algorithmic trading.

Benefits

Dental insuranceVision insurancePaid time offPerformance bonusParental leave

Additional Information

Hudson River Trading (HRT) is seeking a Software Engineer focused on GPU reliability to join our Systems Development team. The Systems Development team builds and maintains the platform that is shared by all Systems teams to provision, monitor, and manage HRT's server and network infrastructure. In this role, your main focus will be to develop tools in Python to analyze the performance of GPU hardware and build creative solutions to improve observability, reliability, and efficiency of the fleet. You'll work closely with other engineering teams to deeply understand research and trading workflows and ensure that GPU infrastructure is utilized optimally. Strong Python skills and development experience are required, along with Unix experience and a background of managing GPU hardware at scale.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at wehrtyou? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect