Senior Systems Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We are building next-generation AI infrastructure from the ground up. Our mission is to deliver highly performant, reliable, and scalable GPU clusters purpose-built for large-scale AI training and inference. As a startup, we operate with urgency, ownership, and a bias toward action. We are assembling the foundational infrastructure that will power frontier AI workloads-and we're looking for engineers who want to build it from zero to scale. We are hiring a Senior Deployment Engineer to lead hands-on bringup of GPU clusters across our data center environments. You will own the execution of node, rack, and network deployment, ensuring clusters are validated, performant, and production-ready. This role is deeply technical and execution-focused. You will be in the details-cabling racks, validating firmware, tuning fabrics, debugging performance-and helping us build repeatable processes as we scale.
Responsibilities
- Cluster Deployment & Bringup
- Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.
- Validate BIOS/BMC/firmware configurations and GPU health.
- Perform rack-level integration including power, cabling, and airflow validation.
- Bring up and validate high-speed network fabrics (InfiniBand, RoCE, 100-400G Ethernet).
- Network & Performance Validation
- Configure and validate leaf/spine network connectivity.
- Run cluster-wide burn-in and stress testing.
- Validate GPU-to-GPU and node-to-node performance (NCCL, RDMA, GPUDirect).
- Troubleshoot hardware, firmware, and fabric-level issues.
- Automation & Process
- Contribute to automation for provisioning and cluster validation.
- Improve deployment playbooks and documentation.
- Identify reliability issues early and drive corrective actions.
- Help turn ad hoc deployments into repeatable systems.
- Cross-Functional Collaboration
- Work closely with networking, systems software, and data center teams.
- Coordinate with hardware vendors to resolve bringup issues.
- Support rapid capacity expansion as we scale.
Requirements
- Required
- 5-8+ years in infrastructure engineering, hardware deployment, or data center operations.
- Hands-on experience deploying GPU servers (HGX/DGX or similar platforms).
- Experience with high-speed networking (InfiniBand, RoCE, Ethernet fabrics).
- Strong Linux systems knowledge.
- Experience troubleshooting distributed systems performance issues.
- Comfortable working onsite in data center environments as needed.
- Strongly Preferred
- Experience in AI/ML infrastructure or HPC environments.
- Familiarity with NCCL, CUDA, RDMA.
- Automation experience (Python, Ansible, Terraform, Bash).
- Experience in high-density power and cooling environments.
- What Success Looks Like
- Clusters are brought online quickly and correctly.
- Performance baselines meet or exceed expectations.
- Deployment processes become faster and more reliable over time.
- You help build the foundation for scaled infrastructure growth.
- For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.
Benefits
Additional Information
. Senior Deployment Engineer - GPU Infrastructure Bringup Location: United States (Travel Required) Team: Infrastructure Reports to: Head of Infrastructure
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at nscaleoperationsukltd? Share your experience