Principal Software Engineer, At-Scale Reliability and Fleet Intelligence - CSP Engagements
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Drive reliability work streams with CSP engineering teams - ensuring shared understanding of MTBI measurement methodology, failure classification, and health monitoring architecture
- Gather and synthesize CSP fleet reliability data - identify failure patterns that appear across multiple customers and champion improvements back into NVIDIA's firmware, driver, and hardware teams
- Define consistent MTBI measurement methodology that works across different CSP monitoring environments and operational practices
- Conduct fleet-scale failure pattern analysis using statistical methods (Pareto, survival analysis, Weibull) to classify failures as systemic, environmental, or configuration-specific
- Drive fleet health monitoring integration architecture - ensure NVIDIA's health agents, telemetry, and reporting align with CSP operational workflows and automation
- Define burn-in reliability test environment and cluster certification criteria in collaboration with quality teams, validating with customers that criteria are meaningful
- Collaborate with CSPs to ensure reliability-related integration work (health monitoring deployment, telemetry pipeline, alerting configuration) is complete ahead of at-scale launch
- Develop predictive failure models using fleet telemetry and validate their effectiveness in customer environments
- What we need to see:
- 15+ years of experience in systems software at datacenter scale, or reliability engineering with focus on at-scale challenges.
- BS or MS in Computer Science, Electrical Engineering, Statistics, or related field (or equivalent experience)
- Deep expertise in multi-NUMA, rack-scale system software and firmware. Statistical failure analysis methods: MTBF/MTBI calculation, Pareto analysis, root cause classification
- Experience with fleet-level telemetry and observability systems: time-series databases, anomaly detection, health scoring, event correlation
- Understanding of hardware failure modes in large-scale GPU/accelerator deployments - ability to classify and prioritize across compute, interconnect, memory, power, and thermal domains
- Experience defining or operating burn-in, stress testing, or certification frameworks for complex hardware systems. Familiarity with predictive maintenance or anomaly detection approaches applied to fleet health data
- Customer obsession - genuine passion for understanding fleet reliability challenges at scale and translating them into actionable engineering priorities
- Strong communication - ability to present statistical reliability findings to both deep technical audiences and executive leadership. Demonstrated success driving cross-functional improvements across hardware, firmware, and software teams without direct authority
- Ways to stand out from the crowd:
- Experience in fleet reliability at a hyperscaler (hardware health, fleet reliability at leading CSP/Hyperscaler)
- Familiarity with NVIDIA GPU error taxonomy (Xid errors, NVLink error counters, thermal events, CPER records)
- Experience building health scoring or predictive failure models for accelerator or HPC infrastructure
- Background in defining MTBI/MTBF measurement standards or certification programs for complex multi-component systems
- Understanding of how reliability data flows from device firmware through telemetry pipelines to fleet-level dashboards and automated remediation
- Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD. You will also be eligible for equity and
Additional Information
We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time Between Interruptions) in production. In this role, you will augment NVIDIA's internal software/firmware and quality teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of reliability software/firmware architecture, methodology, incorporate their fleet telemetry and failure data into NVIDIA's improvement priorities, and validate that reliability improvements measured in the lab translate to real customer environments. Your cross-CSP visibility enables you to distinguish systemic architectural gaps from environmental or configuration-specific issues that no single customer engagement could identify alone.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at NVIDIA? Share your experience