Senior Researcher

External

Coreweaveu · London, UK

Full-timeOn-site1w ago

Feature EngineeringForecastingLeadershipMachine LearningObservabilityPython

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Research Leadership & Strategy
Contribute meaningfully to Monolith and CoreWeave's research direction by identifying high-leverage problems in GPU infrastructure analytics, cluster reliability, workload performance, scheduling, and utilisation.
Originate novel research directions for turning raw infrastructure telemetry into actionable intelligence, rather than simply applying standard machine learning or data science techniques.
Evaluate emerging methods across statistical modelling, machine learning, observability, optimisation, simulation, reinforcement learning, anomaly detection, and autonomous diagnostics, providing well-grounded technical judgement on which approaches are most likely to create real-world impact.
Champion rigour, reproducibility, and scientific integrity across research outputs, experiments, prototypes, and production validation.
Help establish a research foundation for understanding how large-scale GPU systems behave, why workloads underperform, where bottlenecks emerge, and how reliability can be improved proactively.
Technical Depth & Execution
Lead the design and development of sophisticated statistical, machine learning, and optimisation systems for large-scale GPU infrastructure telemetry, including compute, networking, storage, workload, and distributed systems data.
Develop advanced models and methodologies to optimise GPU utilisation, workload scheduling, infrastructure efficiency, and system reliability.
Build models and methods for anomaly detection, failure prediction, distributed straggler detection, degraded workload identification, bottleneck diagnosis, and agentic root cause analysis.
Design experiments, analyse large-scale system telemetry, and prototype predictive and optimisation algorithms that directly inform production systems.
Drive technical decisions on difficult modelling problems involving noisy time-series data, high-dimensional telemetry, causal inference, uncertainty, robustness, generalisation, and out-of-distribution behaviour.
Explore simulation, digital-twin, reinforcement learning, and adaptive scheduling approaches where they can improve understanding or optimisation of GPU clusters and distributed training environments.
Take end-to-end ownership of research work from problem framing and exploratory analysis through prototype development, validation, and collaboration with engineering teams on production deployment.
Maintain deep personal technical expertise; remain a hands-on contributor in Python and modern scientific computing / machine learning tooling.
Organisational Influence & Collaboration
Serve as a strong technical voice within the research organisation, helping shape how Monolith approaches complex infrastructure intelligence problems.
Work closely with Fleet, Infrastructure, AI Pl

Additional Information

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com . We're proud to be a Living Wage accredited Employer. Role Overview We are looking for a Senior Researcher to join Monolith's Research team, now part of CoreWeave. This is a high-impact, high-ownership role for a researcher who combines deep technical expertise in machine learning, statistical modelling, optimisation, and large-scale systems data with the ability to take complex, ambiguous problems from first principles through to production. The Monolith Data Science team is building a layered reliability and intelligence platform that shifts CoreWeave from reactive troubleshooting to proactive reliability engineering. The platform spans telemetry ingestion, feature engineering, anomaly detection, failure prediction, distributed straggler detection, performance modelling, workload optimisation, and agentic root cause analysis. You will work closely with Fleet, Infrastructure, AI Platform, engineering, product, and client-facing teams to improve cluster reliability, increase effective utilisation, reduce MTTR, protect uptime, and turn large-scale GPU infrastructure telemetry into measurable operational and commercial impact. This is not a traditional data science role focused on dashboards, business metrics, or standard forecasting. The role sits at the intersection of applied research, GPU infrastructure, high-performance computing, distributed systems, reliability engineering, telemetry, optimisation, and Physical AI. It demands rigorous scientific thinking, strong execution, and comfort working in a high-ambiguity environment where the right problem framing is often as important as the final model.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at coreweaveu? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect