Skip to main content
Back to jobs

Infrastructure Engineer (GPU & Compute)

External
lightningai logoLightningai · New York, NY
Full-timeRemote3w ago
Cross-functional CollaborationLessLinuxObservabilityPythonPyTorch
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems-designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

Responsibilities

  • Systems, Image & Validation Infrastructure
  • Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
  • Run and maintain test clusters used for system validation, diagnostics, and bring-up
  • Validate firmware, drivers, and OS images across compute and GPU-enabled systems
  • Support hardware qualification efforts for next-generation platforms
  • GPU Diagnostics & Performance
  • Own GPU diagnostics and validation workflows across large-scale infrastructure
  • Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
  • Analyze system and GPU performance using tools such as NVIDIA DCGM
  • Identify failure patterns and drive improvements in system stability and validation coverage
  • Automation & Tooling
  • Build and maintain automation for provisioning, validation, and system bring-up
  • Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
  • Improve the reliability, repeatability, and scalability of image pipelines and validation systems
  • Systems & Operations
  • Manage and operate Linux-based systems in production and validation environments
  • Manage virtualization technology
  • Support bare-metal provisioning workflows, including PXE and image-based systems
  • Interface with hardware management systems (e.g., IPMI, Redfish) for monitoring and debugging
  • Cross-Functional Collaboration
  • Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
  • Collaborate with platform and ML teams to ensure systems meet workload requirements
  • Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure

Requirements

  • Lightning AI is seeking a GPU & Compute Infrastructure Engineer to join our Infrastructure Engineering team.
  • We're flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time.
  • Required Qualifications
  • 5+ years of experience in infrastructure engineering, systems engineering, or related roles
  • Strong Linux systems experience in production environments
  • Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
  • Familiarity with bare-metal provisioning and system bring-up workflows
  • Proficiency in Python or similar scripting/programming languages for automation
  • Ability to debug complex issues across hardware, OS, GPUs, and system software
  • Ideal Experience
  • Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
  • Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
  • Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
  • Data center operations experience, including working with physical hardware
  • Experience supporting AI/ML or HPC workloads at scale
  • Experience with GPU validation frameworks or large-scale hardware qualification processes

Benefits

The anticipated annual base saVision insuranceRemote work optionsFlexible scheduleEquity / stock optionsPerformance bonus

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at lightningai? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect