Manager, Site Reliability Engineering

External

Ayahealthcare · Remote

Full-timeRemote1w ago

Capacity PlanningChaos EngineeringDatadogDevSecOpsHIPAAIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We're a $8+ billion, rapidly growing workforce solutions provider in the healthcare industry. We deliver tech-enabled services that help healthcare organizations meet and manage their contingent labor needs. We build and manage tech-enabled marketplaces for national and local healthcare talent and deliver contingent labor management solutions through our proprietary software platform. At Aya, we're obsessed with creating exceptional experiences for our clients, clinicians, and employees. In fact, we put employee satisfaction above all else. Our team members are responsible for incomparable customer experience and we know that happy employees are critical to maintaining happy clients. We foster an entrepreneurial, high-energy, low-bureaucracy culture and value innovative thinking and creative problem-solving. We embrace diversity in thought and backgrounds unified by a commitment to high achievement. When you join Aya, you'll be surrounded by teammates who care about you as an individual and leaders who will help you grow both personally and professionally.

Responsibilities

Lead and grow the SRE team
Lead, mentor, and grow a team of high-performing Site Reliability Engineers across hiring, performance management, career development, and on-call rotation health.
Set the operating cadence for the team - standups, incident reviews, SLO/error-budget reviews, post-incident learning, and capacity planning.
Build a culture of blameless learning, technical depth, customer empathy, and disciplined ownership.
Partner closely with DevSecOps, Security Engineering, DRE, Incident & Change Management, and product engineering leadership to remove cross-team friction.
Drive reliability, performance, and availability
Own the reliability strategy for customer-facing products and internal platforms - defining SLOs, SLIs, and error budgets in partnership with product and engineering leadership, and operationalizing them in the release process.
Lead major incident response as senior incident commander for severity-1 events; institutionalize blameless post-incident reviews and ensure systemic fixes ship.
Champion proactive reliability - chaos engineering, game days, failure-mode analysis, capacity and load testing - well before incidents force the conversation.
Manage software release support and 24/7 on-call escalation rotations across the platform surface area, with humane on-call load and clear escalation paths.
Operational intelligence and AI-native operations
Build the AIOps practice - anomaly detection, predictive alerting, intelligent correlation, and automated triage - to drive measurable reductions in MTTD and MTTR.
Operationalize AI-assisted workflows for incident summarization, runbook generation, log and trace analysis, change risk scoring, and post-incident narrative drafting.
Pilot and scale agentic remediation where appropriate, with strict guardrails, audit trails, and human-in-the-loop controls suitable for a HIPAA-regulated environment.
Evolve the observability platform (Datadog metrics, logs, traces, RUM, synthetics, CI Visibility) so engineering teams can operate their own services with confidence and clear ownership.
Platform efficiency and stakeholder trust
Treat reliability as a product with a roadmap, measurable outcomes, and an executive-credible narrative - not as overhead.
Drive platform unit economics by partnering with FinOps and platform leadership on cost-to-serve, right-sizing, capacity efficiency, and waste elimination.
Communicate outcomes to executive, product, and customer-facing stakeholders in plain language tied to clinician and client experience.
Uphold HIPAA, PHI, and security obligations across every reliability decision, change, and tool selection.
Required Qualifications:
10+ years in a combination of Site Reliability Engineering, DevOps, Platform Engineering, or related production-operations roles.
4+ years of direct people management experience - hiring, performance management, career development, and running remote on-call teams.
Demonstrated ownership of reliability outcomes for customer-facing SaaS at meaningful scale - defining and operationalizing SLOs/SLIs/er

Benefits

Health insuranceRemote work options

Additional Information

Join Aya Healthcare, winner of multiple Top Workplace awards! We're looking for a highly experienced Manager, Site Reliability Engineering to lead the team behind one of healthcare's most relied-on workforce platforms. In this leadership role, you'll guide and grow a team of engineers driving product and platform reliability - ensuring an exceptional experience for the clinicians, clients, and internal teams who depend on us every day. You'll shape our reliability architecture, lead complex operational initiatives, and drive the adoption of AI-native operations (AIOps) and automation to eliminate toil and advance performance - owning measurable business outcomes across uptime, customer trust, and platform efficiency, and leading with the radical ownership Aya expects of every leader.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at ayahealthcare? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect