Skip to main content
Back to jobs

Operations Engineer, Fleet Reliability

External
CoreWeave logoCoreweave · New York, NY
$83K–$110K/yrFull-timeOn-site2w ago
BashDocumentationFiberGrafanaKubernetesLinux
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Configure and maintain large-scale high-performance supercomputing clusters running state-of-the-art GPUs
  • Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware and platform teams to drive resolution
  • Monitor and analyze system performance and take appropriate remediation actions for cloud health
  • Approach your work with flexibility and optimism anticipating shifting business and technical priorities
  • Create and maintain documentation of team processes, knowledge and best practices for system management
  • Think critically about your day-to-day work and work collaboratively to improve team processes and efficiency
  • Participate in oncall rotations which include after hours and weekend work

Requirements

  • Strong understanding of Linux system administration and internals
  • Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
  • Software development or scripting languages (bash, python, powershell, etc)
  • 2 + years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
  • Grafana, Prometheus, promsql queries or similar observability platforms
  • Data center environments including server racks, HVAC systems, fiber trays
  • Kubernetes administration
  • HPC - administering GPU-related workloads
  • Bachelor's degree in a related field or equivalent experience
  • Wondering if you're a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match.
  • Why CoreWeave?
  • Be Curious at Your Core
  • Act Like an Owner
  • Empower Employees
  • Deliver Best-in-Class Client Experiences
  • Achieve More Together

Benefits

The range we've posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.In addition to a competitive salary, we offer a variety of benefits to support your needs. The benefits below reflect our US-based offerings; for roles in other locations, benefitHealth insuranceVision insuranceRemote work optionsEquity / stock optionsPerformance bonus

Additional Information

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at CoreWeave? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect