AI Hardware Systems Manager, Annapurna Labs, Trainium Machine Learning Fleet Operations
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
The MLA Fleet Operations team was formed to maintain an exceptionally high quality bar for our fleet of advanced machine learning accelerators and server products. We perfect the customer experience by developing scalable software for rapid incident response times and data visualization as well as diving deep into hardware issues as they arise.
Requirements
- Bachelor's degree in computer science, electrical engineering, or related field
- 2+ years of engineering team management experience
- Knowledge of and proficiency in the use of Python scripting language
- Experience with general troubleshooting/debugging of hardware
- Experience designing, building, operating, and managing large-scale distributed systems or web services
- 7+ years of experience in systems engineering, platform engineering, SRE, or hardware operations
- Experience in automating, deploying, and supporting large-scale infrastructure
- Experience in server technologies such as, thermal, mechanical, power, and signal integrity
- Experience working cross-functionally across several teams both technical and non-technical
- Experienc
Additional Information
Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago, even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world. In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Acceleration Fleet Operations Team is looking for a technical leader to manage a team of 5-10 engineers and own operations across multiple ML server platforms spanning tens of thousands of hosts globally. We are seeking a manager who combines strong technical depth in hardware systems and software development with proven people leadership. You will build and grow a high-performing team, set technical direction for fleet-scale automation and tooling, and drive operational excellence across some of the most advanced server hardware in existence. You will define your team's 6-12 month roadmap, influence org-level priorities, and represent fleet operations in VP-level reviews. You are equally comfortable debugging a complex hardware failure as you are coaching an engineer through a career development conversation. Our team has end to end ownership of some of the most advanced server hardware in the world. We drive technical debug efforts and write truly massive scale autonomous software to monitor, optimize, and remediate machine learning hardware. Come define how we operate the future of ML infrastructure. Key job responsibilities - Build, hire, mentor, and grow a team of platform development engineers responsible for ML fleet operations across multiple accelerator platforms - Define team roadmap and technical strategy for fleet health, automation, and data infrastructure - balancing near-term operational demands against long-term engineering investments - Drive operational excellence by establishing metrics, SLAs, and processes that maximize platform sellability and customer experience - Partner with hardware engineering, software engineering, and product teams to prioritize debug efforts and translate fleet learnings into permanent design fixes - Own escalation paths for critical fleet incidents and lead cross-functional war rooms to resolution - Influence org-level priorities by surfacing fleet-wide patterns and advocating for systemic improvements across the ML hardware portfolio - Raise the bar on team software practices - ensuring automation is maintainable, tested, documented, and reusable at scale - Represent fleet operations in executive reviews, providing data-driven narratives on platform health and roadmap A day in the life As a Manager on the MLA Fleet Operations team, you set the direction for how your team keeps the world's most advanced ML accelerators healthy at scale. You start each day with your people - holding 1:1s, coaching engineers through ambiguous technical problems, removing blockers, and ensuring the team is focused on the highest-impact work. From there, you review fleet health with the team, understanding which issues are trending, which investigations need unblocking, and where to allocate engineering effort for maximum customer impact. You partner with hardware design teams to advocate for fleet-informed design changes and with service teams to align on deployment schedules. You balance long-term automation investments against near-term operational demands, and you represent your team's work to senior leadership with clear data and crisp narratives. When critical incidents arise, you lead the response - marshaling the right people, driving root cause, and ensuring corrective actions land.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Annapurna Labs (U.S.) Inc.? Share your experience