Additional Information
The Software Development Engineer II will design, build, and maintain cloud-based provisioning workflows for NVIDIA GB200/GB300 UltraServers, orchestrating complex multi-asset systems from infrastructure handoff to production delivery. This role requires expertise in AWS services, system architecture, and cross-functional collaboration with Manufacturing, Operations, and Program Management teams to deliver AI/ML infrastructure.
Key job responsibilities
The Software Development Engineer (SDE II) on the EC2 UltraServer Delivery team is responsible for delivering production-ready GB200 and GB300 UltraServers to customers by orchestrating complex multi-asset provisioning workflows. Following are the core responsibilities
System Design & Architecture
* Design and architect solutions that are cross-functional to Manufacturing, Operations, and Program Management
* Work in environments where the technology strategy is defined but the solution design is not
* Build solutions that are stable, logical, testable, and efficient with the ability to independently make trade-off decisions
* Investigate and develop design concepts to frame solution sets at an application and product level
Software Development
* Build cloud-based solutions using AWS native services for scaling infrastructure frameworks
* Write high-quality, maintainable code with proper testing and code reviews
* Develop and maintain the Multi-Asset Provisioning Service workflows for GB200 and GB300 UltraServer hosts
* Implement automation for hardware testing, cable validation, and testing processes
* Create observable systems with appropriate metrics and alarming
Operational Excellence
* Execute and monitor UltraServer workflows for UltraServer provisioning
* Troubleshoot workflow failures and coordinate with downstream teams
* Focus on operational excellence by identifying problems and proposing solutions that improve manufacturing software
Hardware & Software Integration
* Work with hardware and software integrations specific to GPU clusters and AI/ML training systems
* Manage network partition configurations for multi-node GPU clusters
* Handle firmware validation and consistency checks across asset groups
Team Collaboration
* Collaborate with customers and stakeholders to convert business needs into technical designs
* Participate in code reviews and technical assessments
A day in the life
This is a hands-on position in which you will own everything from end to end: requirements gathering, designs, design reviews, implementations, code reviews, incremental feature launches, operations, mentoring, and the driving of continuous improvement.