Senior Site Reliability Engineer I

External

Electronic Arts · Hyderabad, India

Full-timeOn-site2w ago

PythonGoSwiftAWSAzureGCP

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Requirements

Education: Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
Experience: 12-15 years of total IT experience, with at least 8+ years in SRE, DevOps, or large-scale systems engineering.
Technical Expertise: Strong proficiency in Linux/Unix system administration and internals.
Proven experience in cloud platforms - AWS, Azure, or GCP.
Advanced scripting and automation skills using Python, Go, PowerShell, or Bash .
Hands-on exposure to containerization and orchestration technologies (Docker, Kubernetes) and expertise on service mesh like istio etc
Deep understanding of monitoring and observability stacks (Prometheus, Grafana, ELK, Datadog, Splunk, Zabbix, Nagios).
Expertise in configuration management and IaC tools (Ansible, Terraform, Chef, Puppet).
Strong knowledge of networking, load balancing, database

Additional Information

We are seeking an accomplished Senior Site Reliability Engineer (SRE) with 12-15 years of experience to lead the reliability, scalability, and performance engineering of our critical infrastructure and production systems. As a Senior SRE, you will play a strategic and technical leadership role - driving reliability practices, mentoring SRE teams, and influencing the adoption of automation, observability, and resilience engineering across the organization. You will act as a technical thought leader and hands-on engineer , collaborating with infrastructure, application, and operations teams to build, automate, and scale reliable systems that support global business operations. This role requires deep expertise in cloud platforms, automation, monitoring, incident management, and system design for large-scale distributed environments. Roles & Responsibilities 1. Reliability Engineering & Automation Architect, implement, and manage resilient, scalable, and highly available infrastructure systems. Lead initiatives to automate manual operations, deployment, and monitoring processes to improve reliability and reduce toil. Drive the creation of observability solutions and dashboards to proactively detect and remediate potential issues. 2. Incident & Problem Management Lead critical incident response, ensuring swift mitigation and clear communication to stakeholders. Conduct detailed root cause analysis (RCA) and drive permanent corrective actions to prevent recurrence. Implement and mature incident management frameworks, including runbooks, playbooks, and post-incident reviews. 3. Infrastructure Operations & Performance Optimization Oversee system performance, capacity planning, and scalability of infrastructure across hybrid and cloud environments (AWS, Azure, GCP). Optimize system resource utilization, latency, and reliability through performance tuning and automation. Work closely with architecture and platform teams to accommodate growth, change, and modernization initiatives. 4. Leadership & Mentorship Provide technical leadership and mentorship to SRE teams and cross-functional engineering groups. Promote an SRE culture across teams - championing principles of reliability, automation, observability, and continuous improvement. Drive collaboration between development, QA, DevOps, and release teams to embed reliability into the software development lifecycle (SDLC). 5. Service Level Management Define, track, and continuously improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) . Apply the Four Golden Signals of SRE monitoring - Latency, Traffic, Errors, and Saturation - to guide system health and performance strategies. 6. Documentation & Knowledge Sharing Establish and maintain comprehensive documentation of systems, operational procedures, and best practices. Facilitate learning through technical sessions, blameless postmortems, and cross-team knowledge sharing. 7. Strategic Technology & Continuous Improvement Contribute to defining the long-term SRE strategy, tooling roadmap, and automation frameworks. Evaluate and adopt emerging technologies, tools, and methodologies to enhance system reliability and efficiency. Partner with business and technical leaders to ensure alignment of SRE objectives with organizational goals. 8. Security & Compliance Collaborate with security and compliance teams to ensure infrastructure, systems, and operations meet organizational and regulatory standards. Implement secure configuration baselines, vulnerability remediation, and access control policies. Integrate security practices into CI/CD pipelines to ensure DevSecOps alignment. 9. Strategic Leadership & Stakeholder Management Partner with executive and business stakeholders to align SRE initiatives with enterprise objectives and risk frameworks. Provide data-driven insights on reliability, capacity, and operational performance to influence strategic decision-making. Represent SRE functions in technical governance forums, audits, and architecture reviews to drive reliability-focused outcomes.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Electronic Arts? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect