Senior Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
As a Senior Site Reliability Engineer, you work close to both the code and production environments. You design and build solutions that make our platform measurable, predictable, and resilient. You ensure a high level of user satisfaction through preventative maintenance, effective troubleshooting, and the rapid resolution of complex production issues. Ensuring robust performance in a high‑stakes environment is a key responsibility of this role. Rather than "running" systems, you engineer reliability into them. You help define and implement SLI's and SLO's, build meaningful monitoring and alerting, and automate away operational risk. You collaborate with product and platform teams to ensure reliability is treated as a core software quality. This is not a classic DevOps or operations role. We are looking for an experienced engineer who is comfortable changing application code, designing observability as part of system architecture, and driving long-term improvements in how we build and operate software. You will work primarily from our Pune office, while collaborating closely with the rest of the SRE team in our Dutch office. In Pune, you will also act as the first point of contact for production related issues within our engineering organization. You'll join a growing a Site Reliability Engineering team that is still evolving, offering significant opportunity to influence technical direction, standards, and ways of working. What will you do? Engineer reliability into a large-scale Azure SaaS platform Design, implement, and continuously improve monitoring, alerting, and observability solutions Define and improve SLI's, SLO's and error budgets together with engineering teams Build automation to reduce operational risk and eliminate manual toil Analyse incidents end-to-end and translate learnings into structural improvements Perform deep debugging and optimization of production issues across application code, services and infrastructure Improve how teams use metrics, logs and traces to understand system behaviour Collaborate closely with software engineers, platform engineers and support teams Contribute to incident response when needed, with a strong focus on learning and prevention Support deployment strategies and execution Provide advanced technical support to help user issues Your skills & experience 5+ years of experience as a Site Reliability Engineer Strong experience with monitoring, alerting and observability in production environments Experience with Datadog, Grafana, Log Analytics and/or Prometheus Proven ability to design and work with SLI's, SLO's and reliability metrics Hands-on coding experience (preferable C#/.NET, but not required) in production environments Experience building automation to improve system reliability and reduce toil Experience working with preferable Microsoft Azure or in another major public cloud providers like AWS, GCP Comfortable working with live production systems and customer data Understanding of performance optimization techniques Excellent communication skills, including direct interaction with users Strong cross‑functional collaboration skills