Staff Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Reliability strategy and standards
- Define and evolve reliability standards across critical services, including SLIs, SLOs, error budgets, production readiness, observability, incident response, and resilience patterns.
- Establish a reliability operating model that clarifies service ownership, operational expectations, and decision-making around reliability tradeoffs for product engineering teams.
- Use AI-assisted analysis to interpret reliability trends, detect weak operational signals, highlight capacity risks using pattern recognition, and generate actionable reliability scorecards for teams, clearly delineating where AI automates data gathering and insight generation.
- AI-first incident response and operational workflows
- Overhaul key stages of the incident lifecycle to achieve faster detection, sharper triage, richer context retrieval, clearer communication, and stronger follow-through.
- Command high-severity incidents as Incident Commander and reinforce the systems, tools, and practices that simplify incident management.
- Design and implement workflows in which AI assists with alert correlation, signal enrichment, root-cause exploration, runbook retrieval, postmortem drafting, and corrective-action tracking.
- Ensure AI-assisted incident workflows remain reviewable, auditable, and safe by requiring human verification at all critical steps and maintaining clear operational ownership with humans accountable for final decisions.
- On-call quality and toil reduction
- Elevate on-call quality by silencing noisy alerts, automating repetitive investigations, and enabling responders to rapidly digest service context.
- Build tools that gather context from systems like Datadog, CloudWatch, incident.io, Slack, runbooks, deployment history, and service metadata.
- Transition teams from reactive paging to proactive reliability enhancement.
- Architecture and resilience
- Steer service designs for graceful degradation, failure isolation, robust capacity planning, and operational safety throughout EarnIn's AWS environment.
- Apply production data, incident learnings, and AI
Benefits
Additional Information
About EarnIn As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. Our community members access their earnings as they earn them, with options to spend, save, and grow their money without mandatory fees, interest rates, or credit checks. We're fortunate to have an incredibly experienced leadership team, combined with world-class funding partners like A16Z, Matrix Partners, DST, Ribbit Capital, and a very healthy core business with a tremendous runway. We're growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of our growth journey. WHY this role exists EarnIn's products must deliver speed, reliability, resilience, and trust to community members who depend on them. As EarnIn grows, we cannot rely on heroics, tribal knowledge, manual investigation, or isolated SRE expertise. We must embed reliability practices that scale across product engineering teams, enhance customer experience, and enable rapid shipping without increasing operational risk. This role exists to lead EarnIn's next stage of reliability maturity: an AI-first operating model that uses AI to actively detect, investigate, respond to, learn from, and prevent production issues. As a Staff Site Reliability Engineer, you will guide technical direction for reliability across critical services, relying on AI-assisted workflows as key tools to reduce toil, speed incident response, improve production readiness, and enhance the operational quality of the engineering organization. The base salary range for this full-time position is $252,000-$308,000, plus equity and benefits. Our salary ranges are determined by role, level, and location. This is a hybrid position in Mountain View (Headquarters) and will require in-office work 2 days a week. HOW you will create impact Act as a Staff-level technical leader: define standards, architect solutions, mentor engineers, influence cross-team efforts, and construct reusable systems and practices that multiply your impact. You will embed AI-first thinking into reliability practices, leveraging AI to streamline alert triage, accelerate incident investigation, automate runbooks, retrieve operational knowledge, enhance postmortem quality, track corrective actions, quantify reliability with scorecards, detect capacity risks, and analyze architectural risks. You will maintain human ownership and engineering judgment at the center of operations. AI aids engineers by speeding context gathering, clarifying reasoning, and reducing repetition, but it does not replace accountability. Collaborate with SRE, product engineering, infrastructure, security, and leadership teams to embed reliability, making it easy to adopt and impossible to ignore.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at earnin? Share your experience