Senior / Staff ML Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Apple Batch is a fully managed platform within the Apple Data Platform that supports large-scale batch and ML workloads across Apple data centers and AWS/GCP. It orchestrates containerized workloads such as Spark, Ray, and LLM batch inference using YuniKorn/Kueue for advanced multi-cluster scheduling. The platform delivers org/team quota management, automatic node repair, end-to-end observability, strong security, and granular cost reporting. As part of the Apple Batch team, you will have a meaningful role in designing, developing, and deploying high-performance systems that power large-scale batch processing and ML workloads daily. We are building critical infrastructure that provides scalable batch execution, intelligent Kubernetes-native job scheduling, multi-tenant resource management, and efficient workload orchestration for ML training, inference, and data processing workloads across multi-cloud and on-premises environments. We are looking for a strong, enthusiastic engineer with deep expertise in Kubernetes scheduling and distributed systems. You will have significant individual responsibility and influence over critical platform services. You are someone with ideas and a real passion for building infrastructure that improves reliability, efficiency, and simplicity at Apple scale. ","responsibilities":"Design, build, and deploy highly reliable, large-scale distributed systems for batch processing and ML infrastructure across public clouds and Apple data centers using Go, Java, or Python Architect and operate Kubernetes-native scheduling systems such as Kueue and YuniKorn, building custom operators and CRDs to manage complex ML and data workloads Implement advanced scheduling strategies including gang scheduling, topology-aware routing, bin-packing, and fair-share queuing to maximize GPU efficiency and hardware utilization Build and manage secure, multi-tenant Kubernetes environments with strict resource isolation, quota governance, and priority-based preemption Drive end-to-end observability, monitoring, and incident response practices to ensure high availability and fault tolerance of production systems Collaborate with ML researchers, data engineers, SRE, and product teams to integrate scheduling solutions into Apple's broader AI and data platform ecosystem Contribute to platform adoption by guiding internal customers, gathering requirements, and delivering impactful platform capabilities