Senior Staff Data Center Operations Engineer, GPU Hardware Architecture
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Engineering Education & Design Support: Provide deep-dive technical guidance to the Data Center Engineering team on upcoming silicon (e.g., NVIDIA Blackwell/Rubin, AMD MI350/400). Ensure future facility designs for power, cooling, and rack-spacing are ready for 2000W+ per-chip densities.
- Operational Tooling & SOPs: Build the "Operational Blueprint" for the field. Create precision SOPs for high-stakes GPU repairs (e.g., baseboard swaps, manifold maintenance) and develop diagnostic tooling that allows Site Ops to identify NVLink flapping, PCIe degradations, or thermal throttling.
- Advanced Troubleshooting & RCA: Act as the Tier-3 escalation point for the most complex hardware failures in the production environment. Lead Root Cause Analysis (RCA) on systemic issues that span the boundary between hardware and facility environmental factors.
- Silicon Roadmap Authority: Maintain a 24-month forward-looking view of NVIDIA and AMD architectures. Educate internal stakeholders on how transitions in HBM4, interconnect speeds, and liquid-cooling will impact Crusoe's physical infrastructure.
- Vendor & VAR Technical Lead: Support the technical relationship with OEMs and VARs. Audit their hardware builds, review their technical bulletins, and ensure their hardware roadmaps align with Crusoe's operational and engineering standards.
- Technical Requirements
- Silicon & Fabric Mastery: Expert-level knowledge of NVIDIA (Hopper/Blackwell/Rubin) and AMD (Instinct) architectures. Mastery of the physical and logical layers of NVLink, NVSwitch, and InfiniBand.
- Infrastructure Bridge-Building: Ability to translate "Silicon Data Sheets" into "Mechanical Engineering Requirements." You can explain how a GPU's specific heat-load profile affects CDU sizing and secondary loop design.
- Data-Driven Diagnostics: Proficient in Python, Go, or Bash to bui
Benefits
Additional Information
Crusoe is on a mission to accelerate the abundance of energy and intelligence . As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack - from electrons to tokens - to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster. We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that - with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI. We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved - people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services. If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe. The Mission Crusoe is building the world's most climate-aligned AI infrastructure. As we scale toward unprecedented power densities and liquid-cooled architectures, the gap between "Data Center Design" and "Silicon Reality" must be bridged. We are seeking a Senior Staff Data Center Operations Engineer, GPU Hardware Architecture to be the definitive technical authority on GPU platforms within the Data Center Engineering and Operations organization. Your mission is twofold: act as the primary technical consultant to our Data Center Engineering team to ensure future facilities are built for next-gen silicon, and provide the Operations team with the specialized tooling, SOPs, and predictive strategies needed to maintain peak cluster health. The Strategic Bridge For DC Engineering: You are the internal consultant. You translate upcoming GPU power/thermal roadmaps (NVIDIA/AMD) into design requirements for our next-generation facilities. For Site Operations: You are the "Technical Enabler." You develop the diagnostic tools and technical SOPs that enable field technicians to resolve complex GPU issues with surgical accuracy. For Sourcing: You are the "Technical Strategist." You define the technical sparing requirements and site-level inventory needs based on hardware failure telemetry.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Crusoe? Share your experience