Machine Learning Infrastructure Engineer
ExternalFull-timeOn-site4mo ago
Machine LearningPythonPyTorchRobotics
Prepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
At Mind Robotics, we're building generalized physical AI -robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure. We're looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training-powering everything from experimentation to production deployment.
Responsibilities
- Design and implement scalable systems for training large ML models
- Enable efficient workflows for data ingestion, training, and iteration
- Develop and optimize distributed training systems across hundreds of GPUs
- Implement strategies for parallelization, sharding, and efficient compute utilization
- Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
- Partner closely with modeling teams to accelerate iteration speed and reduce training costs
- Build internal tools for experiment tracking, monitoring, and debugging
- Implement systems for tracking training performance, failures, and resource utilization
- Debug and resolve bottlenecks across the training stack
- Provide lightweight infrastructure support for deploying and running models in production environments
- Optimize inference performance and reliability where needed
- Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
- Manage compute resources efficiently across training jobs
Requirements
- Strong experience building infrastructure for large-scale ML training
- Deep understanding of how modern LLM/VLM systems are trained and scaled
- Proven experience setting up and scaling distributed training across hundreds of GPUs
- Strong understanding of parallelization strategies (data, model, pipeline parallelism)
- Strong proficiency in Python programming
- Expert-level proficiency in PyTorch and/or JAX
- Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage
- Experience supporting inference systems in production
- Familiarity with robotics or embodied AI workloads
- Experience building tools for experiment management and researcher productivity
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at mindrobotics? Share your experience