Senior Site Reliability Engineer

External

Earnin · Mexico City, Mexico

Full-timeRemote3w ago

DatadogDocumentationIncident ResponseLeadershipMachine LearningObservability

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Reliable system design
Engineer and refine systems focusing on resilience, graceful degradation, capacity, and understanding failure modes.
Collaborate with engineering teams to surface and address reliability risks during design, implementation, launch, and operation.
Transform services to be simpler to debug, easier to operate, and more predictable under failure.
SLOs, observability, and production signals
Define and measure SLIs and SLOs that reflect real customer experience.
Apply observability tools such as Datadog, CloudWatch, logs, metrics, traces, and APM to create signal-rich, noise-light operational visibility.
Elevate alerting quality so pages drive action, reach the right people, and warrant human intervention.
Incident lifecycle improvement
Direct and optimize incident response practices from detection and triage to communication, resolution, postmortems, and follow-up.
Extract incident learnings to implement lasting technical and process improvements.
Guide teams to reduce repeated incidents and cultivate a quieter on-call environment.
Operational tooling and AI-assisted leverage
Develop or refine tooling that eliminates toil, accelerates root-cause analysis, and streamlines infrastructure-as-code workflows.
Apply AI-assisted development and operational workflows responsibly to hasten investigations, enhance documentation, evolve runbooks, and automate repetitive engineering tasks.
Help teams adopt practical AI-assisted workflows where they measurably improve quality, speed, or operational clarity.
Mentorship and engineering enablement
Coach engineers in reliability practices, observability, incident response, and production ownership.
Write documentation and runbooks that reduce silos and make operational knowledge easier to use.
Articulate reliability tradeoffs persuasively to both technical and non-technical partners.

Requirements

Bachelor's or master's degree in Computer Science or equivalent industry experience.
4+ years of experience in SRE, Software Engineering, Infrastructure Engineering, or a related role.
Hands-on coding experience in Python, Go, or similar languages.
Experience designing, operating, and improving distributed systems in production.
Strong understanding of SLIs, SLOs, error budgets, MTTR, incident response, and how to use reliability data to drive decisions.
Strong observability and debugging skills using logs, metrics, traces, dashboards, and production signals.
Experience improving alert quality, runbooks, incident processes, and follow-through after production issues.
Ability to lead reliability initiatives across teams and mentor engineers toward better operational pract

Benefits

Health insurance

Additional Information

About EarnIn As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to paycheck. Our community members access their earnings as they earn them, with options to spend, save, and grow their money without mandatory fees, interest rates, or credit checks. We're fortunate to have an incredibly experienced leadership team, combined with world-class funding partners like A16Z, Matrix Partners, DST, Ribbit Capital, and a very healthy core business with a tremendous runway. We're growing fast and are excited to continue bringing world-class talent onboard to help shape the next chapter of our growth journey. WHY this role exists EarnIn's community members rely on our products to perform consistently, respond promptly, and instill trust. Reliability goes beyond infrastructure; it shapes the customer experience. Product teams must deploy rapidly, but they must also develop systems that are observable, resilient, easy to operate, and safe to update. This role exists to elevate the reliability of EarnIn's production systems while empowering engineering teams to advance swiftly with assurance. As a Senior Site Reliability Engineer, you will spearhead reliability enhancements that fortify services, streamline incident management, and foster sustainable on-call practices. HOW you will create impact Act as a senior technical owner for reliability initiatives. Collaborate across systems, teams, and failure modes to strengthen how EarnIn designs, observes, deploys, and manages production services. You will combine software engineering fundamentals with reliability thinking. Rather than just responding to incidents, you will apply lessons learned to improve systems, alerts, runbooks, and ownership, reducing repeat failures. Leverage AI-assisted engineering practices, such as machine learning monitoring tools and anomaly detection systems, to minimize operational toil, accelerate investigations, refine infrastructure workflows, and enable teams to analyze production behavior more effectively. Mentor engineers and coach product teams to embed reliability practices that clarify, streamline, and safeguard their services.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at earnin? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect