L2 Datacenter Support Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Troubleshoot and maintain InfiniBand fabrics, including performance tuning, link issues, and topology validation.
- Act as the escalation point for L1 for complex infrastructure and hardware issues.
- Own and maintain accurate infrastructure modeling, IPAM, and source-of-truth data in NetBox.
- Own InfiniBand fabric management and advanced troubleshooting, utilizing Verity for configuration, monitoring, and optimization of high-performance interconnects.
- Diagnose and resolve issues across GPU servers, networking, storage, and Kubernetes platforms.
- Perform deep hardware and system-level diagnostics (GPUs, PCIe, NICs, firmware, etc.).
- Support Kubernetes platform stability (node health, networking, scheduling issues).
- Contribute to automation of provisioning and operational workflows.
- Lead incident response, root cause analysis (RCA), and post-incident improvements.
- Collaborate with vendors and internal engineering teams on complex issues.
- Support infrastructure upgrades, firmware management, and capacity expansion.
- Required Skills & Experience:
- 3-6+ years of experience in infrastructure operations, datacenter engineering, or cloud platforms.
- Strong Linux systems expertise.
- Hands-on experience with bare metal provisioning systems and lifecycle management.
- Strong experience with InfiniBand networking (troubleshooting, performance, fabric management using UFM).
- Experience with IPAM/DCIM tools such as NetBox and Ethernet network configuration and validation leveraging Verity.
- Solid understanding of datacenter networking, storage, and hardware architecture.
- Working knowledge of Kubernetes in production environments.
- Strong troubleshooting skills across hardware and distributed systems.
Requirements
- Experience with NVIDIA GPU platforms and accelerated computing infrastructure.
- Familiarity with automation tools (Terraform, Ansible, etc.).
- Exposure to OpenStack (optional).
- Experience with observability stacks (Prometheus, Grafana, ELK).
- Success in this role:
- Rapid resolution of complex infrastructure and networking issues.
- High reliability and performance of InfiniBand and GPU infrastructure.
- Scalable and efficient bare metal provisioning processes.
- Strong contribution to automation and operational excellence.
- Trusted escalation point and technical leader within the team.
Benefits
Additional Information
We are looking for an experienced L2 Engineer to operate and support high-performance AI infrastructure platforms, including NVIDIA GPU clusters, InfiniBand fabrics, and Kubernetes-based IaaS environments. This role focuses on deep infrastructure expertise, ensuring performance, scalability, and reliability of the platform layer that powers AI workloads - without being responsible for the workloads themselves. You will play a key role in bare metal lifecycle management, advanced InfiniBand troubleshooting, and platform stability, working closely with engineering teams to operate cutting-edge infrastructure at scale.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Mirantis? Share your experience