Staff Technical Program Manager - Cluster Orchestration & Applied Training
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
In this role, you will partner with engineering, product, infrastructure, and research-adjacent teams to improve both how workloads run on the cluster and how users interact with the training platform built on top of it. That includes driving programs across orchestration systems such as Slurm-on-Kubernetes (SUNK), Kueue, and workflow integrations, while also helping scale the environments, tooling, and operational mechanisms that make training and evaluation workflows easier to use. This is a highly cross-functional role for a TPM who combines strong technical depth, excellent execution instincts, and the ability to bring structure and clarity to fast-moving infrastructure and AI platform initiatives.
Responsibilities
- CoreWeave is seeking a Staff Technical Program Manager to lead complex, cross-functional programs across Cluster Orchestration and Applied Training within our AI/ML Platform Services organization.
- Drive end-to-end program execution for cluster orchestration initiatives spanning workload scheduling, self-service provisioning, upgrade and migration flows, and platform integrations.
- Lead cross-functional programs that improve how AI training, evaluation, RL, and mixed workloads run across CoreWeave clusters.
- Partner with engineering and product leaders to define roadmap priorities and deliver measurable improvements in utilization, reliability, scalability, observability, and user experience.
- Drive delivery for applied training initiatives across pre-training, fine-tuning, reinforcement learning, sandbox environments, and evaluation systems.
- Coordinate dependencies across platform engineering, infrastructure, product, customer-facing teams, and ecosystem partners to ensure successful launches and clear operational ownership.
- Build program mechanisms for release readiness, rollout planning, risk management, stakeholder communication, and post-launch review.
- Establish success metrics, dashboards, and operating cadences to improve cluster efficiency, workload startup performance, time-to-research, and adoption of new platform capabilities.
- Create clarity across ambiguous technical programs by aligning stakeholders, surfacing tradeoffs early, and driving decisions to resolution.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 8+ years of technical program management experience in cloud infrastructure, distributed systems, or AI/ML platforms.
- Experience leading large-scale cross-functional programs involving scheduling systems, cluster infrastructure, or ML platform capabilities.
- Strong technical fluency in Kubernetes, Slurm or comparable schedulers, distributed systems, and AI training workflows.
- Demonstrated ability to define program metrics and deliver measurable outcomes in performance, reliability, scale, or operational maturity.
- Excellent communication skills, with experience influencing engineering, product, and executive stakeholders.
- Preferred:
- Experience with orchestration and scheduling technologies such as Kubernetes, Slurm, Kueue, Ray, or similar systems.
- Familiarity with modern AI training and evaluation workflows, including pre-training, supervised fine-tuning, reinforcement learning, and experiment or sandbox environments.
- Understanding of GPU infrastructure, cluster capacity planning, multi-tenant execution, and distributed training tradeoffs.
- Experience building launch processes, release governance, dependency management, and operational review mechanisms in fast-scaling environments.
- Familiarity with AI developer and research tooling such as W&B, SkyPilot, or adjacent ecosystem platforms.
- Wondering if you're a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match.
- Why CoreWeave?
- At CoreWeave, we work hard, have fun, and move fast! We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a litt
Benefits
Additional Information
CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at CoreWeave? Share your experience