Senior Site Reliability Engineer

External

Rapidsos · New York (remote) OR Boston (remote)

Full-timeRemote2mo ago

AWSCI/CDDNSIAMKafkaKubernetes

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Design for system resilience: Responsibility for strengthening reliability through proactive design decisions, including safer deployment patterns, failover strategies, and redundancy approaches that improve system behavior under stress.
Build observability into system behavior: Proactively instrument services with structured logging, metrics, and alerting so systems are easier to understand and debug. The focus is on creating clear signals from production behavior before issues escalate.
Own incidents from signal to resolution: Ownership of production issues from first signal through resolution, including investigation across infrastructure and application layers, root cause identification, and implementation of fixes that restore stability and strengthen system behavior long term.
Work across the stack without a permission slip: You'll work across infrastructure-as-code, container orchestration, CI/CD pipelines, and service-level application code. When issues come up, you don't wait for a handoff-ownership is taken directly and driven through to resolution.
What we're looking for in our ideal candidate:
5+ years of professional engineering experience with deep expertise in Python
Real cloud infrastructure experience with AWS: networking, managed databases, cost implications of traffic routing decisions, IAM, DNS-based routing and failover
Hands-on kubernetes experience with containerized workloads in production across EKS, ECS, or Fargate, you can read events, understand resource limits, know when to drain vs. delete a node, and understand the tradeoffs between orchestration models
Strong understanding of distributed systems and how they fail, including resource exhaustion, replication lag, queue backpressure, and other common failure modes
Experience operating high-throughput messaging systems (RabbitMQ, Kafka, AWS SNS / SQS, etc.) and the infrastructure around them, including infrastructure-as-code (e.g., Terraform) and CI/CD pipelines, with an emphasis on improving reliability and scalability
Experience building or improving observability through logging, metrics, and alerting
Demonstrable experience in using AI to safely and securely enhance velocity, improv

Additional Information

In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies. At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We're in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose , inventing tomorrow , delivering with urgency , serving with integrity , and winning together , all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact. RapidSOS is the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams - enabling faster, smarter and more accurate emergency response. Real-time data from the world's largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at www.RapidSOS.com . What this role is about: Are you excited to work on systems where reliability directly impacts real-world outcomes? At RapidSOS, we build technology that powers emergency response, ensuring critical data gets to the right place at the right time. When these systems degrade or fail, the impact is real and reliability isn't a background function. It's fundamental to how our product shows up in critical moments. We're seeking a Senior Site Reliability Engineer to own the performance and stability of services that operate at scale in real-world, high-stakes environments. You'll work across infrastructure-as-code, container orchestration, CI/CD pipelines, and service-level application code, identifying and resolving issues at their root cause while proactively shaping how systems are built to improve reliability from the start. You'll go beyond surface-level fixes, digging into everything from service behavior in Kubernetes to application-level decisions that impact performance, cost, and reliability. You'll collaborate closely with engineering teams to improve how our systems are built, observed, and operated. Along the way, you'll help shape how we approach reliability as a discipline-closing visibility gaps, improving resilience, and ensuring our platform performs when it matters most.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at rapidsos? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect