Skip to main content
Back to jobs

Sr Manager, AI Systems Quality & Reliability , Annapurna AI Servers and Systems

External
Full-timeOn-site1d ago
AWSMachine Learning
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro, Graviton, and Trainium families of processors. Machine Learning Annapurna (MLA) functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization. We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability. This position leads the Quality and Reliability Engineering function within the Manufacturing, Quality and Reliability team.

Requirements

  • Experience in root cause analysis and error correction, identifying changes to procedures and systems to implement long-term fixes and avoid repeating issues
  • - Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, Materials Science, Physics, or related field
  • - 10+ years of reliability or quality engineering experience with server compute platforms, semiconductor packaging, or high-volume electronics manufacturing
  • - 5+ years of people management experience leading reliability, quality, or hardware engineering teams
  • - Experience establishing quality management systems and reliability programs across multiple manufacturing vendors or sites
  • Experience leading teams across multiple locations in complex manufacturing/production environments
  • Experience working in a fast-paced, rapidly changing operations environment
  • - Master's Degree or PhD in Reliability Engineering, Materials Science, or related field
  • - Experience with liquid cooling reliability (cold plate, TIM, coolant loop failure modes)
  • - Experience with advanced semiconductor packaging reliability (large-die BGA, warpage, solder joint fatigue)
  • - Demonstrated ability to establish vendor quality standards and drive compliance across ODM/CM partners
  • - Experience with reliability prediction methodologies (Weibull analysis, acceleration models, DFMEA)
  • - Working knowledge of manufacturing quality tools (SPC, FMEA, 8D, DOE)
  • - Strong executive communication skills - ability to translate technical reliability risk into business impact for senior leadership
  • - Meets/exceeds Amazon's leadership principles requirements for this role
  • - Meets/exceeds Amazon's functional/technical depth and complexity for this role
  • Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
  • Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, Materials Science, Physics, or related field
  • 10+ years of reliability or quality engineering experience with server compute platforms, semiconductor packaging, or high-volume electronics manufacturing
  • 5+ years of people management experience leading reliability, quality, or hardware engineering teams
  • Experience establishing quality management systems and reliability programs across multiple manufacturing vendors or sites

Additional Information

AWS Annapurna Labs is seeking a Senior Manager of Quality & Reliability Engineering to lead the QnR function within the Trainium Manufacturing, Quality and Reliability organization. You will own quality and reliability outcomes for all Trainium AI server products - from component qualification through fleet performance - leading an engineering team across multiple concurrent chip and system generations. This role defines reliability strategy for liquid-cooled and air-cooled platforms at rapidly scaling volumes, builds quality systems across a multi-supplier global manufacturing base, drives fleet failure investigations to root cause, and establishes the reliability characterization capabilities required for next-generation technologies. Key job responsibilities - Lead and grow a QnR engineering team, hiring, developing, and retaining top reliability and quality engineering talent. - Set technical direction for component qualification, reliability testing (HALT, HTOL, thermal cycling, QRV), DFMEA, and vendor quality standards across all Trainium programs. - Own quality and reliability outcomes end-to-end - from DFM input during design through fleet reliability performance. - Drive component specific manufacturing process quality improvements in partnership with Manufacturing Engineering, establishing incoming quality requirements and process controls at all supplier sites. - Build and maintain the reliability prediction and monitoring infrastructure - ensuring fleet performance is tracked against predictions, degradation trends are identified early, and corrective actions are data-driven. - Establish systematic failure analysis processes that connect field failures back to manufacturing history, supplier data, and component-level root cause for rapid containment. - Scale qualification processes to keep pace with multi-supplier, multi-generation production - including automation of qualification workflows and standardization of test methodologies across vendors.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Amazon Development Center U.S., Inc.? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect