Senior Software Engineer - SRE
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
The Platform Infrastructure team ensures that all Roku systems run smoothly. These systems support over 100M+ users and billions in transaction revenue per year. We are a group of highly skilled infrastructure and software engineers who help build and operate systems at internet scale, including Platform (Kubernetes, Istio, Envoy, operators, and more) and Observability (OSS/CNCF-supported observability projects). We engage with multiple teams to achieve company-impacting results. We are seeking a talented and experienced SRE (Site Reliability Engineering) Senior Software Engineer to join our dynamic team. The ideal candidate will have a strong background in SRE practices, cloud infrastructure management, and automation. If you have a consistent track record of architecting and building large-scale systems, enjoy solving intriguing system challenges at internet-scale, and if you are innovative at heart, and have a great balance of skills in learning, organizing, building, and enjoy making an impact, this role might be a great fit for you!
Responsibilities
- Design & Infrastructure
- Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements.
- Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises.
- SRE Process & Principles Implementation
- Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions.
- Reliability Engineering & Infrastructure
- Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time.
- Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms.
- Observability, Monitoring & Reporting
- Build comprehensive observability systems that provide deep visibility into service health, performance, and user experience. Implement monitoring strategies based on the Four Golden Signals (latency, traffic, errors, saturation) and USE/RED methodologies.
- Create SRE dashboards and reporting mechanisms that provide real-time visibility into SLO compliance, error budget consumption, and system reliability metrics. Develop executive-level reporting on reliability trends, incident impact, and improvement initiatives.
- Establish alerting strategies that are actionable, symptom-based, and aligned with SLOs. Reduce alert fatigue by tuning thresholds and eliminating noise while ensuring critical issues trigger appropriate responses.
- Collaboration and Leadership
- Partner with development teams to implement reliability from the design phase using SRE principles. Conduct design reviews focused on failure modes, scalability, observability, and operational concerns. Guide teams in building services that meet SLO requirements.
- Collaborate through code reviews and design reviews, ensuring infrastructure-as-code, automation scripts, and reliability improvements follow best practices, are well-documented, and maintain high-quality standards.
- Manage project priorities using error budgets as a decision-making framework. Leverage agile methodologies while
Benefits
Additional Information
Teamwork makes the stream work. Roku is changing how the world watches TV Roku is the #1 TV streaming platform in the U.S., Canada, and Mexico, and we've set our sights on powering every television in the world. Roku pioneered streaming to the TV. Our mission is to be the TV streaming platform that connects the entire TV ecosystem. We connect consumers to the content they love, enable content publishers to build and monetize large audiences, and provide advertisers unique capabilities to engage consumers. From your first day at Roku, you'll make a valuable - and valued - contribution. We're a fast-growing public company where no one is a bystander. We offer you the opportunity to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at roku? Share your experience