Skip to main content
Back to jobs

Sr Site Reliability Engineer

External
globalhealthcareexchangeinc logoGlobalhealthcareexchangeinc · Hyderabad, India
Full-timeOn-site1mo ago
AWSCapacity PlanningChaos EngineeringDatadogDevSecOpsDocker
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Execution & CoE Alignment
  • Implement SRE frameworks, best practices, and playbooks provided by the CoE.
  • Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.
  • Partner with senior SREs and leadership to maintain consistency in monitoring and incident processes.
  • Contribute to automation projects that improve reliability and reduce manual work.
  • Observability & Monitoring
  • Build and maintain monitoring solutions with New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
  • Create and refine dashboards, metrics, and alerts for proactive anomaly detection.
  • Extend observability coverage across infrastructure, applications, APIs, and databases.
  • Reliability Engineering & Automation
  • Implement SLIs, SLOs, SLAs, and error budgets in partnership with product and platform teams.
  • Contribute to reducing MTTD and MTTR through improved instrumentation and automation.
  • Participate in capacity planning, resiliency testing, and scaling reviews.
  • Support chaos engineering and reliability validation activities.
  • Incident & Problem Management
  • Participate in incident response, including on-call rotations for 24x7 coverage.
  • Assist with root cause analysis (RCA) and implement corrective actions.
  • Ensure alignment with ITSM processes for incident, problem, and change management.
  • Contribute to playbooks and runbooks to strengthen on-call readiness.
  • Collaboration & Knowledge Sharing
  • Collaborate with Engineering, Product, Security, Cloud, and DevSecOps teams to embed reliability practices.
  • Provide input on instrumentation, monitoring hooks, and operational readiness for services.
  • Work with DBAs and platform teams on database observability and performance optimization.
  • Share knowledge within the SRE team and adopt practices from Staff and Principal SREs.
  • Qualifications & Experience
  • Required
  • 7+ years in SRE, Operations, or Infrastructure Engineering.
  • Strong hands-on experience with monitoring and observability platforms.
  • Experience with tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
  • Proven experience in incident response, troubleshooting production issues, and improving MTTR/MTTD.
  • Good knowledge of SLIs, SLOs, SLAs, and error budgets.
  • Hands-on experience with AWS services (EC2, ECS, EKS, networking, scaling groups).
  • Proficiency in containers & Kubernetes (Docker, EKS).
  • Scripting/programming in Python, Go, or shell scripting.
  • Understanding of networking, distributed systems, and high-availability architectures.
  • Exposure to ITIL/ITSM processes.
  • Preferred
  • Experience in SaaS or healthcare environments.
  • Knowledge of databases (MongoDB, Elasticsearch, SQL Server, Oracle).
  • Familiarity with chaos engineering and resiliency testing.
  • Certifications: AWS Solutions Architect / DevOps Engineer, CKA/CKA
  • GHX: It's the way you do business in healthcare
  • Global Healthcare Exchange (GHX) enables better patient care and billions in savings for the healthcare community by maximizing automation, efficiency and accuracy of business processes.
  • Dis

Benefits

Health insuranceVision insurance

Additional Information

Site Reliability Engineer (SRE) Position Summary The Site Reliability Engineer (SRE) will be a hands-on contributor within the Site Reliability Engineering Center of Excellence (CoE), responsible for building monitoring and observability solutions, troubleshooting production issues, and participating in 24x7 on-call operations. This role focuses on the execution of reliability practices, implementing observability tooling, improving MTTR/MTTD through automation, and ensuring production systems are resilient, observable, and performant. The SRE will collaborate closely with Principal and Senior Staff SREs, adopting best practices and frameworks defined by the CoE while directly contributing to enterprise reliability goals. This role reports to the Sr. Manager, SRE.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at globalhealthcareexchangeinc? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect