Senior Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
As a Senior Site Reliability Engineer, you will be a key technical leader driving the design and optimization of our Kubernetes-based infrastructure and CI/CD systems. You will also own the infrastructure behind our AI tooling - building MCP servers and defining safe, auditable AI access patterns for production systems. You'll work hands-on with engineering teams to accelerate delivery, ensure production reliability, and embed best practices for automation, observability, and resilience. Design, build, and scale Kubernetes infrastructure for secure, multi-tenant, high-availability applications. Build and operate AI tooling infrastructure - stand up MCP servers and establish secure, governed AI access and guardrails for production systems. Optimize and maintain CI/CD pipelines, improving reliability, speed, and rollback safety. Implement progressive delivery strategies such as blue/green and canary deployments. Advance Infrastructure as Code with Terraform, Helm, and Argo CD, defining reusable patterns for the org. Operate and optimize streaming and analytics infrastructure: Kafka, Flink, and ClickHouse. Build automated testing into the CI/CD lifecycle. Improve system observability - define SLOs, alerts, and dashboards. Lead incident response and postmortems, focusing on root cause and durable fixes. Mentor engineers across teams on Kubernetes, CI/CD, and cloud infrastructure. Required Qualifications: 6+ years in SRE , DevOps , or Infrastructure roles , with significant production Kubernetes experience . Hands-on experience integrating AI/LLM tooling into engineering or operational workflows (e.g., MCP servers , AI agents acting on infrastructure), and a clear grasp of the security and governance considerations of giving AI access to production. Proven success building CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI, or similar). Strong with Kubernetes internals and managed services like EKS , GKE , or AKS . Expertise with Infrastructure as Code ( Terraform , Helm , Pulumi ) and GitOps . Proficient in Python , Bash , or Go . Knowledge of observability tooling ( Prometheus , Grafana , Datadog , OpenTelemetry ). Production experience with Kafka , Flink , and ClickHouse . Strong communication and cross-team collaboration skills.