Skip to main content
Back to jobs

Principal Software Engineer, Rack-Scale System Software - CSP Engagements

External
NVIDIA logoNvidia · Santa Clara, CA
Full-timeOn-siteToday
API DesignDeep LearningDocumentationLeadershipObservability
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Drive technical work streams with CSP engineering teams on rack-scale system software - ensuring they deeply understand fabric management, NVSwitch behavior, error handling and recovery policies, health telemetry APIs, and SW/FW-controlled recovery operation
  • Capture and synthesize CSP engineering feedback on rack-scale system software - health monitoring APIs, SW-driven serviceability workflows, firmware update orchestration, and error recovery behavior - champion that feedback into NVIDIA's architecture decisions
  • Collaborate with multi-functional teams to ensure customer operational requirements are reflected in system software and firmware development
  • Identify cross-CSP patterns in rack-scale SW/FW issues, error handling behavior, and system configuration practices - drive documentation, tooling, and test strategy improvements as a result
  • Collaborate with execution teams on left-shift strategy - ensuring customer-side SW/FW integration work is identified early and completed ahead of hardware availability
  • Make critical technical decisions on rack-scale system SW/FW tradeoffs and mitigate execution risks through early engagement with CSP engineering teams
  • What we need to see:
  • 15+ years of experience in system software, platform firmware, or large-scale distributed systems engineering. BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • Deep understanding of rack-scale system software challenges: multi-component coordination, error propagation, health monitoring, and serviceability / reliability
  • Experience with fabric management software, cluster management, or system-level orchestration frameworks. Familiarity with firmware architectures and update lifecycle management (multi-component update sequencing, rollback, recovery)
  • Understanding of error handling and recovery design patterns in distributed systems - fault isolation, retry policies, graceful degradation
  • Experience with health monitoring and telemetry systems: health scoring, event correlation, API design for fleet-level observability
  • Understanding of GPU or accelerator system software (drivers, device management, power management) is a strong plus
  • Customer obsession - genuine passion for understanding how CSPs operate sophisticated systems at fleet scale and simplifying their experience
  • Proven success providing technical leadership across organizational boundaries and influencing system software design without direct authority. Strong communication - ability to translate complex system software architecture into actionable mentorship for customer engineering teams
  • Ways to stand out from the crowd:
  • Experience with NVIDIA NVSwitch, NVOS, or GPU fabric management software
  • Background in system software for large-scale clusters at a hyperscaler (cluster management, fleet orchestration, health platforms)
  • Experience crafting error handling and recovery frameworks for multi-component systems (hundreds or thousands of coordinating devices)
  • Familiarity with GPU or accelerator fleet operations - driver lifecycle, firmware rollout strategies, health-based scheduling
  • Understanding of how system software decisions impact serviceability, availability, and operational cost at fleet scale

Additional Information

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for rack-scale system SW/FW, working with CSP engineering teams to ensure they can deploy, monitor, and operate these systems reliably at fleet scale. In this role, you will collaborate with NVIDIA's cross-functional rack-scale system SW/FW engineering teams with dedicated CSP-facing technical leadership. Your focus is on the system-level software that manages, monitors, and recovers the rack as a whole - fabric management, GPU/NVSwitch error handling and recovery, health telemetry APIs, firmware update orchestration, and SW-driven serviceability. You will drive work streams with CSP engineering teams to build shared understanding of the architecture, incorporate their operational feedback, and ensure integration readiness.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at NVIDIA? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect