Operations Engineer, Fleet Reliability
External$83K–$110K/yrFull-timeOn-site2w ago
BashDocumentationFiberGrafanaKubernetesLinux
Prepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Configure and maintain large-scale high-performance supercomputing clusters running state-of-the-art GPUs
- Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware and platform teams to drive resolution
- Monitor and analyze system performance and take appropriate remediation actions for cloud health
- Approach your work with flexibility and optimism anticipating shifting business and technical priorities
- Create and maintain documentation of team processes, knowledge and best practices for system management
- Think critically about your day-to-day work and work collaboratively to improve team processes and efficiency
- Participate in oncall rotations which include after hours and weekend work
Requirements
- Strong understanding of Linux system administration and internals
- Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
- Software development or scripting languages (bash, python, powershell, etc)
- 2 + years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
- Grafana, Prometheus, promsql queries or similar observability platforms
- Data center environments including server racks, HVAC systems, fiber trays
- Kubernetes administration
- HPC - administering GPU-related workloads
- Bachelor's degree in a related field or equivalent experience
- Wondering if you're a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match.
- Why CoreWeave?
- Be Curious at Your Core
- Act Like an Owner
- Empower Employees
- Deliver Best-in-Class Client Experiences
- Achieve More Together
Benefits
The range we've posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.In addition to a competitive salary, we offer a variety of benefits to support your needs. The benefits below reflect our US-based offerings; for roles in other locations, benefitHealth insuranceVision insuranceRemote work optionsEquity / stock optionsPerformance bonus
Additional Information
CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at CoreWeave? Share your experience