Principal Engineer - Perf and Benchmarking
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Tooling & Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).
Requirements
- 10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.
- Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).
- Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM).
- Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.
- Excellent communicator able to interface with executives, customers, auditors, and OSS communities.
- Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.
- Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.
- Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.
- Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).
- Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.
Benefits
Additional Information
CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com . About this role We're looking for a Principal Engineer to be the technical lead of CoreWeave's Benchmarking & Performance team. You will be responsible for our planet-scale performance data warehouse: Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure. You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications: If MLPerf (Training & Inference), Working closely with NVIDIA (Megatron-LM, TensorRT-LLM & DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate CoreWeave's performance reliability leadership in the field.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at CoreWeave? Share your experience