Senior Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We're hiring a Senior Site Reliability Engineer to lead reliability strategy and drive AI-powered automation at scale . This role involves owning complex systems, shaping architecture, and influencing cross-functional teams. You'll: Define and evolve SLOs, SLIs, and resilience patterns Build AI-driven automation for detection, remediation, and forecasting Lead cloud infrastructure and Kubernetes platforms Drive incident response and operational excellence Mentor engineers and influence org-wide reliability practices About You 8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations. Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch. Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance. Strong programming and automation skills using Python and Shell; experience building production-grade automation systems. Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting. Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments. Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact. Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale. AI & Automation Expertise Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering. Experience integrating AI capabilities into monitoring, alerting, incident response, or workflow automation systems. Strong understanding of how AI can be safely and effectively applied in production environments.
Requirements
- Experience with advanced observability platforms (Prometheus, Grafana, ELK, or similar) enhanced with AI-driven insights.
- Familiarity with predictive analytics, anomaly detection, or AIOps platforms.
- Experience influencing architectural decisions at a platform or product level.
- Prior experience operating in a 24/7, global, high-availability SaaS environment.
- The Team
Benefits
Additional Information
About Zuora At Zuora, we help businesses grow smarter and adapt faster. Our platform powers modern business models - from subscriptions and usage-based pricing to AI-driven and outcome-based offerings - helping companies launch new products, automate complex billing, and unlock predictable, recurring revenue. We've led the Subscription Economy for more than a decade. Now we're evolving again by building the definitive platform for quote to cash and helping companies monetize their products and services with an adaptable, AI-ready foundation.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Zuora? Share your experience