Senior Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Collaborative autonomy is how self-tasking teams of machines will solve hard human problems, and HavocAI is an unquestioned leader in collaborative autonomy. We set the standard for autonomous surface vessels for a wide range of defense and commercial maritime missions. Success requires us to grow quickly, and we're looking for teammates who are passionate about solving hard problems, about pushing the envelope, and about preventing conflict and saving lives. Ambition is welcome to apply within. HavocAI is seeking a Senior Site Reliability Engineer with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will serve as a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads. You will work closely with Cloud Platform, DevOps, Data Engineering, and Autonomy teams to establish reliability standards, improve operational maturity, and build systems that scale safely under real-world conditions. The ideal candidate is deeply technical, calm under pressure, and experienced in owning reliability outcomes end to end.
Responsibilities
- Reliability Engineering & Architecture
- Design and evolve reliability architecture for distributed and cloud-hosted systems
- Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning
- Partner with platform and application teams to design systems for reliability, scalability, and operability
- Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines
- Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads
- Operations & Incident Management
- Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews
- Conduct root cause analysis for complex production incidents and drive long-term corrective actions
- Improve operational readiness through runbooks, automation, resilience testing, and production-readiness reviews
- Reduce operational toil through tooling, automation, and process improvements
- Help build a culture of ownership, accountability, and continuous improvement across production systems
- Observability & Performance
- Design, implement, and maintain observability systems for metrics, logging, tracing, alerting, and service health
- Ensure services and data pipelines are observable, debuggable, and performant in production
- Drive performance analysis and tuning across infrastructure, application, and service layers
- Improve alert quality, reduce noise, and ensure operational signals are actionable
- Partner with engineering teams to define meaningful reliability and performance metrics
- Automation & Platform Collaboration
- Build automation to improve system reliability, deployment safety, and recovery processes
- Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns
- Support and improve Kubernetes-based environments and containerized workloads
- Contribute to infrastructure-as-code practices and platform automation
- Help define operational standards for cloud infrastructure, deployment workflows, and production services
- Security & Resilience
- Collaborate with security teams to ensure secure and resilient system design
- Participate in disaster recovery planning, backup strategy, and resilience testing
- Maintain strong operational practices around access control, secrets management, change management, and production access
- Support secure operations for systems that may serve defense, autonomy, or mission-sensitive use cases
Requirements
- 7+ years of experience in SRE, infrastructure engineering, systems engineering, or related roles
- Strong experience operating large-scale distributed production systems
- Deep understanding of Linux systems, networking, cloud infrastructure, and distributed systems fundamentals
- Hands-on experience with Kubernetes and container orchestration
- Programming or scripting experience in Go, Python, or similar languages
- Experience designing and operating observability systems for production environments
- Proven ability to lead incident response and drive reliability improvements
- Strong communication skills and ability to collaborate across engineering teams
- Ability to operate calmly and effectively under pressure
- Must be a U.S. Citizen and eligible to obtain a U.S. Government security clearance if required
- Experience supporting autonomy, robotics, simulation, real-time systems, or data-intensive platforms
- Familiarity with AWS and large-scale cloud infrastructure
- Experience with chaos engineering, fault injection, or resilience testing
- Knowledge of CI/CD systems and progressive delivery practices
- Experience working in high-reliability, safety-critical, d
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at havocai? Share your experience