Infrastructure Engineer (GPU & Compute)

External

Lightningai · New York, NY

Full-timeRemote3w ago

Cross-functional CollaborationLessLinuxObservabilityPythonPyTorch

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems-designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

Responsibilities

Systems, Image & Validation Infrastructure
Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
Run and maintain test clusters used for system validation, diagnostics, and bring-up
Validate firmware, drivers, and OS images across compute and GPU-enabled systems
Support hardware qualification efforts for next-generation platforms
GPU Diagnostics & Performance
Own GPU diagnostics and validation workflows across large-scale infrastructure
Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
Analyze system and GPU performance using tools such as NVIDIA DCGM
Identify failure patterns and drive improvements in system stability and validation coverage
Automation & Tooling
Build and maintain automation for provisioning, validation, and system bring-up
Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
Improve the reliability, repeatability, and scalability of image pipelines and validation systems
Systems & Operations
Manage and operate Linux-based systems in production and validation environments
Manage virtualization technology
Support bare-metal provisioning workflows, including PXE and image-based systems
Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
Cross-Functional Collaboration
Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
Collaborate with platform and ML teams to ensure systems meet workload requirements
Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure

Requirements

Lightning AI is seeking a GPU & Compute Infrastructure Engineer to join our Infrastructure Engineering team.
We're flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.
Required Qualifications
5+ years of experience in infrastructure engineering, systems engineering, or related roles
Strong Linux systems experience in production environments
Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
Familiarity with bare-metal provisioning and system bring-up workflows
Proficiency in Python or similar scripting/programming languages for automation
Ability to debug complex issues across hardware, OS, GPUs, and system software
Ideal Experience
Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
Data center operations experience, including working with physical hardware
Experience supporting AI/ML or HPC workloads at scale
Experience with GPU validation frameworks or large-scale hardware qualification processes

Benefits

The anticipated annual base saVision insuranceRemote work optionsFlexible scheduleEquity / stock optionsPerformance bonus

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at lightningai? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect