Lead Engineer Site Reliability
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Requirements
- Required:
- 6-10 years of experience in Site Reliability Engineering (or equivalent), with demonstrated technical leadership
- Proven ability to lead technical teams and drive complex projects to completion
- Expert-level knowledge of AWS, with experience designing large-scale, multi-region architectures
- Deep Kubernetes expertise, including advanced features, security, and production-scale operations
- Mastery of Infrastructure as Code using Terraform, with experience building shared platforms and frameworks
- Strong software engineering background with production experience in Python and/or Go
- Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale
- Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines
- Proven track record leading major incidents and conducting effective postmortems
- Strong communication skills with ability to explain complex technical concepts to diverse audiences
- Experience mentoring engineers and building technical capabilities in teams
- Preferred:
- Previous technical leadership roles (Lead, Staff, or similar) in SRE or Operational Excellence
- Financial services industry experience with understanding of regulatory requirements
- Expert knowledge of complianc
Benefits
Additional Information
Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. As a Lead Site Reliability Engineer at Empower, you'll combine deep technical expertise with team leadership to drive reliability across our financial services platform. You'll lead other SREs in solving complex operational challenges, establish technical standards, and serve as a key advisor to engineering leadership on infrastructure strategy and reliability initiatives. ESSENTIAL FUNCTIONS: Technical Leadership & Strategy: Lead cross-functional reliability initiatives spanning multiple value streams, coordinating efforts across teams Define and evolve SRE best practices, tools, and methodologies for the organization Architect enterprise-scale infrastructure solutions that balance reliability, cost, performance, and security Establish Service Level Objectives (SLOs) and error budgets for critical services, using them to drive prioritization decisions Lead major incident response as incident commander, coordinating resolution across multiple teams Drive strategic improvements to observability, identifying gaps and implementing solutions at scale Design and implement disaster recovery plans for critical financial services infrastructure Evaluate and introduce new technologies and practices that improve team effectiveness Operational Excellence: Lead the design of foundational infrastructure patterns using Terraform, creating reusable modules adopted across teams Architect multi-region, highly available AWS infrastructure supporting millions of daily transactions Design and implement sophisticated Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling Build comprehensive observability strategies using Datadog and Splunk, establishing standards for metrics, logging, and tracing Establish CI/CD standards and patterns, implementing pipeline-as-code and progressive delivery at scale Lead initiatives to implement chaos engineering practices and systematic reliability testing Drive FinOps initiatives, optimizing cloud spend while maintaining reliability targets Team Leadership & Development: Lead a functional team of SREs (without direct reports) on projects and operational initiatives Mentor Senior, Intermediate, and Entry-level SREs, accelerating their technical growth Conduct design reviews and architecture discussions, providing expert guidance Lead training sessions on SRE practices, new technologies, and operational procedures Coordinate on-call schedules and drive improvements to reduce on-call burden Facilitate postmortems for high-severity incidents, ensuring organizational learning occurs Collaboration & Influence: Partner with Engineering Managers and Directors to align SRE work with business priorities Collaborate with Security teams on implementing zero-trust architecture and compliance controls Work with Product teams to balance feature velocity with reliability requirements Influence architectural decisions across the engineering organization Represent SRE in cross-functional initiatives and planning discussions Evangelize SRE culture and practices across Empower
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at empower? Share your experience