Principal Engineer - Perf and Benchmarking

External

Coreweave · Sunnyvale, CA

$206K–$333K/yrFull-timeOn-site2w ago

CI/CDGrafanaKubernetesLeadershipLLMsObservability

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Tooling & Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, OpenTelemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).

Requirements

10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.
Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).
Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, TensorRT-LLM, TorchServe), and distributed training frameworks (PyTorch FSDP/DeepSpeed/Megatron-LM).
Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.
Excellent communicator able to interface with executives, customers, auditors, and OSS communities.
Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.
Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.
Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.
Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).
Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.

Benefits

The range we've posted represents the typical compensation range for this role. To determine actual compensation, we review the market ratVision insuranceEquity / stock optionsPerformance bonus

Additional Information

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com . About this role We're looking for a Principal Engineer to be the technical lead of CoreWeave's Benchmarking & Performance team. You will be responsible for our planet-scale performance data warehouse: Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure. You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications: If MLPerf (Training & Inference), Working closely with NVIDIA (Megatron-LM, TensorRT-LLM & DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate CoreWeave's performance reliability leadership in the field.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at CoreWeave? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect