Lead SRE- Observability
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
The Observability Engineering team builds and operates the telemetry, monitoring, and reliability platforms that support athenahealth's cloud infrastructure and engineering organizations. The team processes large volumes of logs, metrics, traces, and events that help teams develop, troubleshoot, and operate highly available healthcare applications. The team works closely with Cloud Engineering & Operations and R&D to improve observability, operational efficiency, and platform reliability through scalable infrastructure and automation-first engineering practices. Essential Job Responsibilities: Observability platform engineering Build and operate scalable observability and telemetry platforms that process logs, metrics, traces, and events across production environments. Support monitoring, alerting, and instrumentation strategies that improve service visibility and operational insight. Partner with engineering teams to strengthen telemetry collection and overall observability. Infrastructure and automation Design resilient, automated infrastructure and platform services that improve reliability, scalability, and efficiency. Develop Infrastructure as Code and automation solutions that reduce toil and improve consistency. Lead technical initiatives from architecture through implementation with attention to performance, reliability, security, and maintainability. Production support and incident response Troubleshoot complex production issues involving distributed systems, Linux infrastructure, networking, cloud services, and telemetry pipelines. Participate in incident response and on-call processes. Help drive operational excellence, root cause analysis, and continuous improvement. Technical leadership and mentoring Mentor engineers on SRE best practices, observability strategy, and scalable systems design. Contribute to long-term platform strategy and reliability improvements. Influence technical decisions across engineering organizations. Expected Education & Experience: 7+ years of experience operating and engineering large-scale production infrastructure and distributed systems. Strong expertise in Linux systems engineering, cloud infrastructure, and SRE practices. Proven experience designing and operating observability and telemetry platforms. Hands-on experience with tools and technologies such as OpenSearch/Elasticsearch, Kafka, Prometheus, Grafana, Vector, Fluentd, OpenTelemetry, ClickHouse, or similar. Experience building Infrastructure as Code solutions using Terraform, CloudFormation, or equivalent tooling. Strong automation and software engineering skills using Python, Golang, or Bash. Experience troubleshooting large-scale distributed systems in production with a focus on availability, performance, scalability, and resiliency. Experience operating services in cloud-native environments, including AWS and containerized platforms. Strong understanding of monitoring strategy, telemetry pipelines, incident response, root cause analysis, and operational excellence. Ability to communicate effectively across engineering organizations and influence technical decision-making. Preferred Experience: Experience operating high-scale telemetry or analytics platforms with large ingestion volumes. Experience with Kubernetes, Docker, CI/CD systems, and modern platform engineering practices. Strong networking and troubleshooting experience using tools such as tcpdump and Wireshark. Experience leading cross-functional engineering efforts and mentoring within SRE or infrastructure organizations. Familiarity with healthcare technology or other highly regulated production environments. Expected Compensation $143,000 - $243,000 The base salary range shown reflects the full range for this role from minimum to maximum. At athenahealth, base pay depends on multiple factors, including job-related experience, relevant knowledge and skills, how your qualifications compare to others in similar roles, and geographical market rates. Bas