Engineering Manager, Site Reliability Engineering (SRE)

External

Athenahealth · Chennai, India

Full-timeHybridToday

AgileAnsibleBashDocumentationGrafanaIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Benefits

Health insuranceVision insurance

Additional Information

Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all. Position Summary: We are seeking an Engineering Manager, Site Reliability Engineering (SRE), who is a hands-on technical people leader to lead the Service Operations Site Reliability Engineering team in Chennai within the Cloud Infrastructure Engineering (CIE) division. This role is responsible for driving reliability, observability, automation, and operational readiness across systems supporting Service Operations. The ideal candidate brings deep expertise in Linux infrastructure, observability platforms, infrastructure automation, incident management, and engineering leadership. This individual will partner closely with global engineering and operations teams to reduce toil, improve service reliability, and deliver scalable, resilient solutions that support athenahealth's mission of providing About the Team: The Service Operations Site Reliability Engineering team is part of the Network Operations Center (NOC) organization and sits within the Cloud Infrastructure Engineering (CIE) division. The team is responsible for delivering highly available SaaS infrastructure, operational tooling, observability solutions, and automation capabilities that support Service Operations and Cloud Infrastructure teams. Working closely with R&D and Infrastructure stakeholders across India and the United States, the team focuses on improving operational excellence through automation, standardized onboarding, actionable monitoring, and continuous reduction of operational toil. Essential Job Responsibilities: Lead, coach, mentor, and develop a team of Site Reliability and Infrastructure Engineers based in India. Remain technically hands-on by reviewing designs, guiding implementation efforts, troubleshooting complex issues, and contributing to technical solutions when required. Own team delivery across infrastructure management, observability, service onboarding, alerting, automation, and operational readiness initiatives. Drive observability strategy across metrics, logs, traces, synthetic monitoring, health checks, dashboards, and actionable alerting frameworks. Manage provisioning and lifecycle management of physical and virtual Linux systems using tools such as Puppet, Ansible, Terraform, and related automation platforms. Partner with engineering teams operating within SaaS, hybrid cloud, Kubernetes, and Amazon EKS environments to ensure complete monitoring and operational coverage. Identify, measure, and reduce operational toil through automation, self-service capabilities, documentation, and scalable operational processes. Lead Agile delivery practices including sprint planning, backlog prioritization, stakeholder communication, and continuous improvement activities. Additional Job Responsibilities: Build and enhance monitoring integrations across platforms including New Relic, Prometheus, Alertmanager, OpenSearch, Grafana, Icinga, Unified Assurance, and related technologies. Establish Infrastructure-as-Code (IaC), Configuration-as-Code, Monitoring-as-Code, and Alerting-as-Code standards and practices. Improve alert quality by ensuring alerts contain actionable context, ownership, severity levels, routing information, and runbook references. Partner with NOC and Service Operations teams to standardize service onboarding, escalation management, operational handoffs, and response workflows. Manage hiring, onboarding, performance management, feedback, career development, and technical growth of direct reports. Participate in incident response activities, escalation reviews, post-incident analysis, and on-call planning processes. Develop and report operational metrics including alert quality, automation coverage, service health, onboarding throughput, toil reduction, and reliability improvements. Ensure operational excellence through comprehensive documentation, SOPs, runbooks, architecture diagrams, and support procedures while collaborating effectively with global teams. Expected Education & Experience: Bachelor's degree in Computer Science, Information Technology, Engineering, or a related technical discipline; equivalent experience will also be considered. 10+ years of experience in Infrastructure Engineering, Site Reliability Engineering, Systems Engineering, Platform Engineering, or Technical Operations. 2+ years of experience managing or formally leading technical engineering teams. Strong hands-on experience administering, provisioning, and operating Linux systems in large-scale production environments. Proven experience with observability platforms, monitoring, logging, tracing, dashboarding, alerting, and synthetic monitoring solutions. Experience with Infrastructure-as-Code and configuration management tools such as Terraform, Puppet, Ansible, Chef, or similar technologies, along with scripting in Python, Go, Bash, Ruby, Java, or related languages. Experience supporting SaaS, hybrid cl

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at athenahealth? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect