Staff Site Reliability Engineer

External

Sonyinteractiveentertainmentglobal · United States, Canada

Full-timeOn-site3w ago

AWSCapacity PlanningCI/CDDatadogDNSDocker

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Review and influence service architecture and system design to improve resiliency, fault tolerance, and scalability. Establish and promote best practices across engineering teams.
Own and evolve Infrastructure-as-Code for managed services (AWS, GCP). Design and build scalable, reusable modules and automation that standardize provisioning, configuration, and operations
Improve and develop production-grade automation, and tooling to improve measureable outcomes in operational process, manual toil, MTTx reduction, etc.
Increase observability across the platform by implementing robust monitoring, logging, and tracing patterns. Build actionable dashboards and define meaningful alerting strategies that reduce MTTD and MTTR while minimizing noise.
Leverage service telemetry and historical data to anticipate capacity needs, detect anomalous behavior, and proactively prevent incidents. Develop data-informed approaches to performance optimization and reliability engineering.
Lead performance and capacity planning initiatives. Apply cloud-native patterns (e.g., auto-scaling, spot capacity, container orchestration with EKS) to optimize cost, performance, and availability at scale.
Contribute code to shared repositories and platform components improving reliability, scalability, and maintainability.
Collaborate across SIE with a variety of engineering, product, security and PMO teams to drive reliability improvements and ensure consistent operational standards across PlayStation services.
Contribute to and help evolve reliability engineering practices, including SLIs/SLOs, error budgets, and operational readiness standards.
Provide rotational on-call support, including incident detection, triage, and resolution for production systems, with a focus on continuous improvement of system reliability.
Lead post-incident reviews, producing clear root cause analyses and driving follow-through on corrective and preventative actions across teams.

Requirements

BS degree in Computer Science, Engineering, or related technical subject area.
7+ years hands-on AWS experience - integrating, developing and managing applications
10+ years of relevant SRE or operational work experience supporting a high-volume and/or critical production, software environment
10+ years of hands on software engineering or systems engineering experience (Java and/or React services)
5+ years of experience with building automation into daily operational processes through one or more programming languages (preferably Python or Go).
Hands-on experience using modern AI engineering technologies, including LLM models, MCP-based integrations, and agentic workflow patterns, to improve SRE Operations.
Strong experience in configuring, tuning and automating operational responsibilities for AWS managed data services including RDS, DynamoDB and Elasticache
Experience with monitoring and log management tools (ie: DataDog, CloudWatch, Grafana, Splunk)
Experience with container technologies and orchestration (ie: Docker, Kubernetes, EKS)
Hands-on experience in triaging and tuning Java cloud applications with integration into AWS
Solid understanding of AWS networking systems and protocols (ie: ALB, R53, API-Gateway, TCP/IP, HTTP/HTTPS, DNS)
Experience with developing or support Continuous Integration and Continuous Delivery/Deployment pipelines (CI/CD)
Excellent leadership presence, verbal and written communication
#LI-KS1
At SIE, we consider several factors when setting each role's base pay range

Benefits

Vision insurance

Additional Information

Why Sony Interactive Entertainment? Sony Interactive Entertainment isn't just the Best Place to Play - it's also the Best Place to Work. Sony Interactive Entertainment (SIE) is the company behind the PlayStation brand. As a subsidiary of Sony Group Corporation, we're part of a proud legacy of innovation and excellence. SIE is a dynamic technology company, delivering cutting-edge hardware and network services to more than 100 million people and an entertainment leader, home to some of the most beloved and recognizable intellectual properties (IP) in the world. Our role at SIE is to create and nurture the experiences under the PlayStation brand, a name synonymous with entertainment excellence and creativity. Staff Site Reliability Engineer San Diego, CA As a key leader of the Commerce - Technical Operations team, you will help drive the availability and enablement for PlayStation Store, Catalog, Entitlement, Pricing and Device Management platforms. You will partner closely with product engineering teams to deliver innovative player features and elevate operational excellence for millions of players worldwide.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at sonyinteractiveentertainmentglobal? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect