Skip to main content
Back to jobs

Software Engineer, Kubernetes

External
CoreWeave logoCoreweave · Livingston, NJ
Full-timeOn-site1w ago
BashDatadogExcelGrafanaIncident ResponseKubernetes
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Build, operate, and scale Kubernetes-based production infrastructure that delivers CoreWeave's products with high reliability and performance. Develop automation, tooling, and infrastructure as code in Go and other infrastructure-focused languages to enable zero-touch operations, rapid recovery, and seamless deployments. Design, implement, and maintain monitoring, alerting, and observability solutions-leveraging the Grafana ecosystem and related tools-to proactively identify and resolve production issues. Drive incident response efforts, participate in on-call rotations, and lead root cause analysis to prevent recurrence and improve incident handling processes. Partner with internal and cross-functional teams to ensure platform capabilities meet rigorous operational requirements and customer SLAs. Engineer for resiliency, implementing best practices for redundancy, fault tolerance, and disaster recovery across complex distributed systems. Advocate for security, reliability, and performance improvements throughout the stack, continuously seeking opportunities to strengthen operational standards. Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimize AI workload performance and resource utilization at scale.

Requirements

  • Bring 3+ years of experience in production engineering, SRE, or large-scale infrastructure/platform roles.
  • Are knowledgeable in Kubernetes administration, container orchestration, and microservices architectures, with a bias for automating every aspect of operations.
  • Have a proven track record managing high-uptime, customer-facing systems in a fast-moving environment, with experience delivering measurable improvements in reliability and performance.
  • Possess experience in monitoring, observability, and incident management using tools like Prometheus, Grafana, Datadog, Splunk, Loki, or VictoriaMetrics.
  • Demonstrate knowledge in infrastructure-focused programming-especially in Go and Bash-and hold a deep understanding of Linux systems.
  • Excel at troubleshooting complex production issues, from system failures to performance bottlenecks, and approach problems methodically with strong analytical skills.
  • Communicate clearly across technical and non-technical stakeholders, proactively sharing knowledge and advocating for operational best practices.
  • Are passionate about building systems that are not just functional, but robust, self-healing, and easy to operate at scale.
  • Take pride in driving continuous improvement, and helping set high standards for operational excellence and team culture.
  • What Success Looks Like
  • You deliver stable, robust, and highly-available systems that consistently meet or exceed uptime and performance targets.
  • You champion initiatives that drive automation, reduce operational toil, and increase the efficiency of incident response.
  • You actively contribute to a blameless culture of learning, mentoring others in operational best practices and production engineering principles.
  • You help CoreWeave maintain industry leadership through flawless execution in supporting demanding, AI-powered workloads at scale
  • Wondering if you're a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams - even if you aren't a 100% skill or experience match.
  • Why CoreWeave?
  • Be Curious at Your Core
  • Act Like an Owner
  • Empower Employees
  • Deliver Best-in-Class Client Experiences
  • Achieve More Together
  • We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization's growth o

Additional Information

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com .


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at CoreWeave? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect