Senior AI Infrastructure & Networking Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- AI Fabric Architecture & Deployment: Design, build, and optimize high-throughput, ultra-low-latency East-West compute networks using NVIDIA Spectrum-X Ethernet platforms (Spectrum-4 ASICs) and/or NVIDIA Quantum-X800 InfiniBand switching .
- Performance Tuning for Lossless Networking: Configure and fine-tune critical Layer 2/3 lossless transport mechanisms, including Remote Direct Memory Access over Converged Ethernet ( RoCE v2 ), Priority Flow Control ( PFC ), Explicit Congestion Notification ( ECN ), and DCQCN .
- Rail-Optimized Topologies: Implement and maintain non-blocking, multi-plane, full fat-tree network topologies mapped to 8-GPU server architectures to maximize collective communication performance via NCCL (NVIDIA Collective Communications Library).
- SmartNIC & DPU Management: Deploy and manage high-speed compute network interfaces, including ConnectX-8 SuperNICs (800 Gb/s) and BlueField-3 DPUs for isolated infrastructure management, storage acceleration, and multi-tenant security.
- Full-Stack Orchestration & Automation: Drive infrastructure-as-code deployments using Ansible and Terraform . Initialize and monitor the NVIDIA Network Operator within core Kubernetes orchestration layers.
- Telemetry & Validation: Utilize deep network telemetry tools such as NVIDIA NetQ and "What Just Happened" (WJH) to stream real-time switch diagnostics. Conduct line-rate cluster benchmarking using ib_write_bw and ib_write_lat to eliminate physical layer bottlenecks.
- Required Technical Skills &Qualifications
- Education: Bachelor's or Master's degree in Computer Science, Network Engineering, Systems Engineering, or a related technical discipline.
- AI Networking Expertise: Proven track record of configuring RoCE v2, adaptive routing, and traffic optimization specifically for machine learning/HPC workloads.
- Hardware Familiarity: Deep understanding of high-density scale-up and scale-out systems (NVIDIA HGX/DGX architectures, PCIe switching, OSFP/QSFP112 optical and copper assemblies).
- Software & Cluster Management: Experience with cluster deployment suites like NVIDIA Mission Control , Base Command Manager , Run:ai, or similar enterprise MLOps frameworks.
- Routing Protocols: Strong proficiency with advanced datacenter networking protocols, particularly eBGP IPv6 unnumbered underlays and EVPN/VXLAN overlays for multi-tenant isolation.
- Cabling & Layer 1 Validation: Experience managing complex structured fiber trunking (MPO-12/MPO-24 APC) and executing layer-1 diagnostics (ibdiagnet, iblinkinfo).
- Preferred Certifications
- NVIDIA Certified Professional - AI Networking (NCP-AIN) (Highly Preferred)
- NVIDIA Certified Expert - Cloud End-to-End Fabric (NCE-CEF)
- Advanced networking tracks from major vendors (e.g., CCIE, JNCIE, or Nokia Service Routing Architect) combined with proven data center fabric experience.
Benefits
Additional Information
We are seeking an expert Senior AI Infrastructure & Networking Engineer to lead the architecture, deployment, and optimization of our next-generation AI Factory. In this role, you will be responsible for building and scaling high-density GPU supercomputing clusters (up to 512+ nodes) featuring NVIDIA Blackwell UltraB300 systems. You will bridge the gap between heavy physical infrastructure (liquid cooling/busbar power) and advanced logical fabrics, ensuring predictable, line-rate, and lossless transport for massive generative AI training and reasoning workloads.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at GENESIS NETWORKS PTE LTD? Share your experience