Staff Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Reliability & Operations
- Design, build, and operate large-scale cloud infrastructure and production services.
- Participate in an on-call rotation supporting highly available customer-facing systems.
- Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
- Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
- Partner with engineering teams to improve service availability, scalability, performance, and resilience.
- Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.
- Engineering & Automation
- Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
- Eliminate operational toil through automation, tooling, and platform engineering.
- Improve deployment safety and operational workflows through CI/CD and GitOps practices.
- Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
- Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.
- Technical Leadership
- Lead complex reliability initiatives spanning multiple engineering teams.
- Guide engineers in adopting operational best practices and reliability engineering principles.
- Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
- Influence architecture and operational decisions through data-driven recommendations and engineering expertise.
- Drive projects from conception through production rollout and long-term operational ownership.
- Innovation
- Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.
- Identify opportunities to leverage emerging technologies to reduce toil and improve engineering productivity.
- Our Tech Stack
- Infrastructure/Orchestration: Kubernetes (EKS/GKE), Terraform, Helm, Git, ArgoCD, GitOps
- Programming: Golang, Python
- Observability: Datadog, Splunk
- Data Stores: PostgreSQL, Redis, OpenSearch
Requirements
- Technical Excellence
- Strong experience operating large-scale production services in AWS and/or GCP.
- Deep expertise with Kubernetes in production environments.
- Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
- Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
- Strong software engineering skills in Golang and/or Python.
- Experience building automation and internal engineering platforms.
- Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, Open
Benefits
Additional Information
Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk. Get to know Okta Okta is The World's Identity Company. We free everyone to safely use any technology-anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth. At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box, we're looking for lifelong learners and people who can make us better with their unique experiences. Join our team! We're building a world where Identity belongs to you. The Engineering Opportunity We are looking for an experienced Staff Site Reliability Engineer to join Okta's Emerging Products Group (EPG). Our mission is to build highly reliable, scalable, and secure cloud services that our customers can trust. We embrace an automation-first mindset and continuously invest in platform engineering, observability, and operational excellence to enable our engineering teams to move quickly and safely. This role is ideal for an engineer who enjoys solving complex technical challenges at scale, building automation, and improving the reliability of production systems. You will serve as a technical leader within the EPG SRE organization, partnering closely with software engineers, architects, and product teams to design, build, and operate world-class cloud services. The ideal candidate exemplifies the philosophy of "if you have to do it more than once, automate it" and possesses a strong passion for continuous improvement, operational excellence, and software engineering.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Okta? Share your experience