Staff Cluster Infrastructure Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Atoms is building the machines that power the next era of progress. Over the last decade, software has transformed the digital world. But the physical world, where food is made, minerals are mined, goods are moved, and industries are run, remains far less intelligent, far less efficient, and far more constrained. We're changing that. Atoms builds Physical AI - real-world robots for the industries that move civilization forward, starting with food, mining, and transport. Our systems are designed to understand, predict, and control the real world with precision, turning complex physical operations into something more reliable, more scalable, and more productive. This work requires more than robotics. It requires deep integration across hardware, software, AI, operations, manufacturing, and real estate. We don't just build machines in a lab. We deploy them into real environments, operate them, learn from them, and improve them until they work at scale. We are roboticists, engineers, operators, and builders. We believe the next great technology companies will not only transform information, but the physical systems that shape everyday life. If you want to work on hard problems with real-world impact, join us.
Responsibilities
- We're seeking a Cluster Infrastructure Engineer to join our founding team who will own the GPU compute fabric that trains our foundation models - optimizing the machines we have today, automating how we manage them, and laying the groundwork to scale as we grow.
- Manage and automate our GPU training clusters, including provisioning, bootstrapping, and lifecycle management.
- Automate bare-metal bring-up so new machines come online quickly and reliably as we add capacity.
- Build software abstractions that present a clean, unified interface to our training and simulation workloads.
- Work at the hardware/software boundary, where speed and reliability are critical, continuously raising the bar for automation and uptime.
- Run day-to-day operations: diagnose and resolve issues quickly when systems are under pressure.
- Design our infrastructure to scale smoothly as we grow from a smaller cluster of machines toward a larger fleet.
Requirements
- 6+ years experience operating GPU compute on Kubernetes (or similar orchestration), with the judgment to scale it as demand grows.
- Strong programming and scripting skills in Python, Go, or similar.
- Familiarity with Infrastructure-as-Code tools such as Terraform or CloudFormation.
- Comfort with bare-metal Linux environments, GPU hardware, and networking.
- A bias toward automation, reliability, and operating critical systems well.
- Why join us
- What else you need to know
- The base salary range for this role is $224,000 - $284,000 per year.
- Actual compensation will be determined on an individual basis and may vary depending on experience, skills, and qualifications.
- Base salary is just one part of your total rewards package. You may also be eligible for equity awards and an annual performance-based bonus.
- Benefits Summary (USA Full-Time Exempt Employees):
- Medical, Dental, Vision, Disability, and Life Insurance
- Flexible Spending Account / Health Savings Account Options
- 401(k)
- Equity
- Sick Time, Unlimited Flexible Time Off, and Paid Holidays
- Paid Parental Leave
- Pre-Tax Commuter Benefit Plan
- Team lunch in our SoMa office every Tuesday and Thursday
- Benefits are subject to change at the company's discretion.
- Atoms accepts applications on an ongoing basis.
- Ready to join us as we serve those who serve others?
- #LI-Onsite
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Atoms? Share your experience