AI Infrastructure Operations Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.
Responsibilities
- Manage and operate multiple advanced AI compute infrastructure clusters.
- Monitor and oversee cluster health, proactively identifying and resolving potential issues.
- Maximize compute capacity through optimization and efficient resource allocation.
- Deploy, configure, and debug container-based services using Docker.
- Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
- Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
- Contribute to the development and improvement of our monitoring and support processes.
- Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.
- Skills And Requirements
- 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
- Strong proficiency in Python scripting for automation and system administration.
- Deep understanding of Linux-based compute systems and command-line tools.
- Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
- Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
- Experience with monitoring and alerting systems.
- Should have a proven track record to own and drive challenges to completion.
- Excellent communication and collaboration skills.
- Ability to work effectively in a fast-paced environment.
- Willingness to participate in a 24/7 on-call rotation.
- Preferred Skills And Requirements
- Operating large scale GPU clusters.
- Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
- Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
- Familiarity with machine learning frameworks and tools.
- Experience with cross-functional team projects.
- Location
- SF Bay Area.
- Toronto, Canada.
- Bangalore, India.
- Why Join Cerebras
- Build a breakthrough AI platform beyond the constraints of the GPU.
- Publish and open source their cutting-edge AI research.
- Work on one of the fastest AI supercomputers in the world.
- Enjoy job stability with startup vitality.
- Our simple, non-corporate work culture that respects individual beliefs.
- Read our blog: Five Reasons to Join Cerebras in 2026.
- Apply today and become part of the forefront of groundbreaking advancements in AI!
- Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe i
Benefits
Additional Information
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras , to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Cerebras Systems? Share your experience