Staff Machine Learning Infrastructure Engineer

External

Atoms · San Francisco, CA

$224K–$280K/yrFull-timeOn-siteToday

Cross-functional CollaborationIntegration TestingJavaKubernetesLessMachine Learning

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Atoms is building the machines that power the next era of progress. Over the last decade, software has transformed the digital world. But the physical world, where food is made, minerals are mined, goods are moved, and industries are run, remains far less intelligent, far less efficient, and far more constrained. We're changing that. Atoms builds Physical AI- real-world robots for the industries that move civilization forward, starting with food, mining, and transport. Our systems are designed to understand, predict, and control the real world with precision, turning complex physical operations into something more reliable, more scalable, and more productive. This work requires more than robotics. It requires deep integration across hardware, software, AI, operations, manufacturing, and real estate. We don't just build machines in a lab. We deploy them into real environments, operate them, learn from them, and improve them until they work at scale. We are roboticists, engineers, operators, and builders. We believe the next great technology companies will not only transform information, but the physical systems that shape everyday life. If you want to work on hard problems with real-world impact, join us.

Responsibilities

Training Infrastructure: Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks.
Distributed Computing & Orchestration: Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters.
Experiment Tracking & MLOps: Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance.
Data Engineering Pipelines: Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments.
Validation at Scale: Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free.
Cross-Functional Collaboration: Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle.

Requirements

8+ years of professional software engineering career experience
Strong backend systems programming skills with proficiency in Go, Python, Java or similar (with familiarity or exposure to Rust considered a plus).
Proficiency with Kubernetes for container orchestration and building cloud-agnostic environments from scratch.
Experience implementing distributed ML compute frameworks (e.g., Ray) to coordinate large pools of GPUs for heavy, multi-node workloads.
Hands-on experience building MLOps pipelines, metadata tracking architectures, and model registries using platforms like MLflow.
Prior experience managing high-throughput data pipelines using modern distributed data engines to feed data-hungry neural network architectures.
Why join us
What else you need to know
The base salary range for this role is $224,000 - $280,000 per year.
Actual compensation will be det

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Atoms? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect