Senior Engineering Manager, Kernel and Virt
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Team Leadership & Development: Recruit, mentor, and coach engineers on the team, fostering a culture of ownership, technical excellence, and continuous improvement.
- Execution & Delivery: Own the team's project execution, translating high-level business goals into clear technical roadmaps, measurable milestones, and successful, on-time delivery.
- Cross-Functional Partnership: Collaborate with Product Management, other engineering teams, and key stakeholders to align priorities, manage dependencies, and communicate progress and risks.
- Operational Health: Ensure the production health, stability, and on-call rotation of all services owned by the Inference Orchestration team.
- Strategic Architecture & Planning: Define the technical roadmap and oversee the architecture of high-throughput scheduling systems for massive Kubernetes clusters (1,000+ nodes, 10,000+ pods), focusing on scalability techniques like multi-scheduler architectures and batch dispatching.
- Maximize GPU Utilization : Eliminate GPU waste in multi-tenant environments by implementing fractional GPU allocation, leveraging mechanisms like KAI-Scheduler's Reservation Pods or hard-isolation tools like HAMi, and configuring time-based fairshare scheduling to balance over-quota pool access.
- Orchestrate Complex Inference : Implement and manage disaggregated AI inference pipelines using frameworks like NVIDIA Grove, coordinating multicomponent deployments (e.g., prefill leaders, decode workers, KV routers) with multilevel autoscaling and explicit startup ordering.
- Optimize Placement & Topology : Deploy topology-aware scheduling to align pod placement with physical hardware dimensions, such as NVLink connections, PCIe lanes, and NUMA nodes, minimizing communication latency for multi-GPU operations.
- Platform Performance & Reliability: Drive initiatives to enhance overall cluster performance, including optimizing scheduling latency, API server load, and implementing fault tolerance mechanisms like Checkpoint/Restore for long-running AI training jobs.
- Manage AI Storage & Fault Tolerance: Orchestrate efficient model weight distribution using OCI Image Volumes and implement Checkpoint/Restore capabilities (via CRIU and NVIDIA cuda-checkpoint) for long-running training fault recovery.
- Security and Isolation: Define and enforce security best practices for AI workloads, ensuring multi-layered isolation environments and agent sandboxes are deployed to safely execute untrusted code (e.g., using Kata Containers, gVisor, or microVMs).
Requirements
- Engineering Leadership Experience: Proven track record of managing and growing high-performing engineering teams, preferably within a distributed systems or infrastructure domain.
- Kubernetes and AI Infrastructure Domain Knowledge: Deep expertise in Kubernetes at scale and a strong foundational understanding of the core challenges in AI workload orchestration, scheduling, and resource management.
- Hardware-Aware Optimization: Strategic knowledge of GPU architectures (NVIDIA and/or AMD), interconnects (like NVLink), and hardware topology and their direct impact on AI training and inference performance.
- Resource and Cost Management: Experience in balancing performance against cost, applying principles like Dominant Resource Fairness (DRF), and directing strategies for maximizing cluster efficiency.
- Systems Engineering & Security: Familiarity with concepts in container runtime internals, system isolation, and security contexts to manage risk in shared infrastructure.
- AI/ML Serving Architectures: Strong understanding of modern LLM serving architectures, disaggregation patterns, and common serving engines (e.g., vLLM, Triton, SGLang).
- Observability and SLOs: Expertise in defining, tracking, and operationalizing deep infrastructure and inference metrics (e.g., TTFT, TPOT) to drive performance improvements and meet service level objectives.
- Compensation Range:
- $200,800 - $251,000
- *This is a hybrid role
- JR: 2026-7649
- #LI-Hybrid
- Why You'll Like Working for DigitalOcean
- We innovate with purpose. You'l
Benefits
Additional Information
Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you'll find your place here. We value winning together-while learning, having fun, and making a profound difference for the dreamers and builders in the world. We are seeking a Senior Engineering Manager to lead our Inference Orchestration team, driving the strategy, execution, and scaling of our Kubernetes-based AI infrastructure. You will be responsible for balancing business needs with technical excellence, ensuring high throughput, optimal GPU utilization, and robust fault tolerance for our next-generation disaggregated inference, fine-tuning, and training workloads.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at digitalocean98? Share your experience