Senior/Staff Site Reliability Engineer

External

Sage49 · New York, NY

Full-timeOn-site1mo ago

Capacity PlanningDatadogDNSGrafanaIncident ResponseIoT

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Sage is on a mission to improve care and quality of life for older adults, starting with those residing in senior living facilities. Falls are the leading cause of injury-related death among adults over 65. And yet, fall prevention and emergency response systems for older adults are archaic and ineffective. At Sage we've built a more modern way of understanding when older adults need help, including methods for residents to alert caregivers when in need of help, and corresponding software for caregivers to triage response. Our company mission is to create a product that our client counterparts love, and this role is a key part of that objective. Sage is a small, tight team of ambitious, multi-disciplinary entrepreneurs. We are a software-enabled, mission-driven company, and are focused only on the problems that are central to achieving that mission. At Sage, we work hard and fast but also know that to build a truly important company, we need to treat our work as a marathon, and not a sprint. The journey matters. About this Role Sage provides life-saving functionality that improves the lives of our older population. This role is critical to ensure Sage can live up to its mission to be a 24x7, highly available platform for elder care. As a Site Reliability Engineer, you'll partner with engineering teams across the organization to achieve four 9s of uptime for our platform.

Responsibilities

Design and evolve highly reliable system architectures , ensuring high availability, fault tolerance, and scalability across Sage's production infrastructure.
Lead complex incident response efforts , coordinating across engineering teams to quickly diagnose and resolve production issues while driving thorough post-incident reviews and long-term reliability improvements.
Define and implement organization-wide observability practices , including metrics, logging, tracing, and actionable alerting to ensure strong visibility into system health.
Establish and maintain reliability standards , including defining SLIs, SLOs, and error budgets, and partnering with engineering teams to integrate these practices into the software development lifecycle.
Drive automation and infrastructure improvements that reduce operational toil and improve the efficiency and reliability of deployments, monitoring, and operational workflows.
Partner with engineering teams on system design and architecture reviews , ensuring reliability, scalability, and operational best practices are considered early in the development process.
Evolve Sage's cloud infrastructure , including networking, compute, storage, and security practices to support scalable and resilient systems.
Operate and improve critical data infrastructure , ensuring high availability, performance, backup strategies, and disaster recovery processes for production databases.
Lead capacity planning and auto-scaling efforts , ensuring infrastructure and systems scale effectively as product usage grows.
Build internal tooling and platforms that improve the developer experience, simplify debugging, and enable safer and more reliable deployments.

Requirements

7-12+ years of experience in software engineering, infrastructure engineering, or site reliability engineering, operating large-scale distributed systems in production.
Experience operating and supporting edge or device-based systems, including managing connectivity, observability, remote updates, and reliability for distributed hardware deployments such as IoT or field devices.
Strong networking fundamentals, including experience debugging distributed system issues across load balancers, DNS, TLS, and VPC networking within platforms like Amazon Virtual Private Cloud or similar cloud networking environments.
Experience operating and scaling production databases, including performance tuning, replication, backup/recovery strategies, and high availability for systems such as PostgreSQL, MySQL, or distributed databases.
Deep expertise in cloud infrastructure, such as Amazon Web Services or Google Cloud Platform
Strong experience designing and operating highly available systems, including strategies for redundancy, failover, disaster recovery, and capacity planning.
Expertise in containerization and orchestration, particularly with Kubernetes and modern container platforms.
Advanced observability and monitoring skills, using tools such as Datadog, Prometheus or Grafana.
Strong programming ability in languages commonly used for infrastructure and reliability engineering (e.g., Go, Python, or Java), with experience building internal tooling and automation.
Deep knowledge of infrastructure-as-code practices, including tools like Terraform or Pulumi. Proven experience leading reliability initiatives, such as defining SLOs/SLIs, improving incident response processes, and driving post-incident reviews.
Ability to influence engineering teams across the organization, guiding best practices for reliability, scalabilit

Benefits

Health insuranceRemote work options

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at sage49? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect