Skip to main content
Back to jobs

Sr. Site Reliability Engineer

External
st-labs logoSt-labs · Nyc, NY
$160K–$250K/yrFull-timeOn-site3mo ago
AWSAzureCI/CDDesign SystemsDockerGCP
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

We're looking for a Senior Site Reliability Engineer (SRE) to own the reliability, performance, and scalability of our AI-native platform. You'll operate at the intersection of software engineering and infrastructure, building systems that keep our platform highly available, observable, and resilient in production. This is a hands-on engineering role where you'll write production code (primarily in Python) while also owning on-call operations and incident response.

Responsibilities

  • Reliability & Production Ownership
  • Own the availability, latency, and performance of critical production systems
  • Participate in and improve a 24/7 on-call rotation, responding to incidents and driving resolution
  • Lead incident response, root cause analysis (RCA), and postmortems
  • Design systems that fail gracefully and recover automatically
  • Automation & Engineering (Python-heavy)
  • Write production-grade Python code to:
  • Automate infrastructure workflows
  • Build internal reliability tools
  • Improve deployment, rollback, and recovery systems
  • Eliminate manual operational work through automation and self-healing systems
  • Observability & Monitoring
  • Design and implement:
  • Metrics, logging, tracing
  • Alerting systems (reduce noise, improve signal)
  • Build dashboards and tooling to give real-time visibility into system health
  • Infrastructure & Scalability
  • Operate and improve systems running on:
  • Cloud platforms (AWS/GCP/Azure)
  • Containers (Docker, Kubernetes)
  • Scale systems to handle enterprise workloads and high-throughput traffic
  • Improve deployment pipelines, CI/CD, and infrastructure-as-code
  • Reliability Engineering & Resilience
  • Define and enforce:
  • SLAs / SLOs / error budgets
  • Conduct:
  • Load testing
  • Chaos testing
  • Build resilient systems that can tolerate failure
  • Collaboration
  • Partner with product and backend engineers to:
  • Improve system reliability
  • Embed observability into services
  • Help teams design production-ready systems from day one

Requirements

  • Core Requirements
  • Strong software engineering background (not just ops)
  • Proficiency in Python (required) for building tools and services
  • Experience operating production systems at scale
  • Infrastructure & Systems
  • Experience with:
  • Kubernetes / Docker
  • Cloud platforms (AWS/GCP/Azure)
  • Distributed systems
  • Reliability & Operations
  • On-call rotations and incident response
  • Monitoring tools (Grafana, Prometheus, etc.)
  • Debugging production issues under pressure
  • AI/ML systems or data pipelines
  • Event-driven architectures
  • High-availability systems

Benefits

Build foundational product features for an AI-first enterprise platformThe opportunity to take ownership of critical systems that scale to millions of usersA culture that values craftsmanship, autonomy, and technical excellenceCompetitive compensation, equity, and benefits packageWork from our Flatiron District, Manhattan office, where you'll be side-by-side with the founding team in a supportive, collaborative setting. Our team works on-site five days a week, growing and building together, and the location is easy to reach with plenty of public transportation options.Health insuranceEquity / stock options

Additional Information

Standard Template Labs is an AI-native startup reimagining the future of IT Service and Configuration Management. Backed by leading investors, we're leveraging AI to transform how enterprises manage and engage with their IT ecosystems.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at st-labs? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect