Senior Database Reliability Engineer

External

Crunchyroll · San Francisco, CA

Full-timeOn-site1d ago

AWSCapacity PlanningCI/CDCloudFormationComplianceDatadog

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

The Database Operations Engineering team is dedicated to ensuring the reliability, scalability, and performance of our data infrastructure. We focus on standardizing and implementing monitoring and alerting across all datastores to track key metrics like errors, latency, and throughput, and to ensure critical systems are covered. Our team leads horizontal efforts to keep databases up-to-date, implements Infrastructure as Code (IaC) for high availability and performance, and automates key processes to

Additional Information

About Crunchyroll Founded by fans, Crunchyroll delivers the art and culture of anime to a passionate community. We super-serve over 100 million anime and manga fans across 200+ countries and territories, and help them connect with the stories and characters they crave. Whether that experience is online or in-person, streaming video, theatrical, games, merchandise, events and more, it's powered by the anime content we all love. Join our team, and help us shape the future of anime! Crunchyroll is growing and changing, presenting unique challenges and opportunities to support millions of anime fans around the world. The Database Operations Engineering team provides a seamless infrastructure foundation to our internal stakeholders, ensuring an exceptional experience for all Crunchyroll fans. As a Senior Database Reliability Engineer, you will be primarily responsible for operating, improving, and maintaining the reliability and operational excellence of our data infrastructure. Your core focus will be to introduce robust best practices for database production support, strengthen our global on-call rotation, and design and build reusable, database-specific Infrastructure as Code (IaC) components to ensure high availability, scalability, and 100% automation. Key Areas of Responsibility Database Operational Excellence & Production Support: Drive, stabilize, and own 24x7 database production support operations, processes, and incident remediation. Responsibly track database alerts, establish clear operational procedures, and bring infrastructure alerts to rapid closure. Database Infrastructure as Code (IaC) & Automation: Architect, implement, and maintain reusable database-specific IaC components and configurations using frameworks like Terraform, CloudFormation or Pulumi. Standardize configurations across multiple datastores to enable automated infrastructure deployment, sizing, and posture management. Core Configuration Management: Proactively enable and standardize mission-critical database attributes and configurations by default, including automated backups, failover strategies, timeouts, and lifecycle policies. On-Call & Platform Reliability: Strengthen and actively participate in the database on-call rotation, identifying SLAs, system vulnerabilities, and operational gaps to eliminate Single Points of Failure (SPOF). Database SRE & Site Operations: Manage large-scale data infrastructures, execute cluster management, capacity planning, data governance, compliance reviews, and handle complex data store migrations (such as MariaDB to Aurora/DynamoDB) and major version upgrades safely during non-US low traffic hours. Collaborative Growth & Development: Work alongside a seasoned team of database engineering specialists (leveraging existing senior architectural depth on the team) to systematically scale platform features while executing a continuous learning roadmap to expand personal depth in native AWS database services (RDS/Aurora) and complex SQL tuning. About You We get excited about candidates, like you, because you possess: Bachelor's degree in Computer Science, Information Technology, or a related field. 8+ years of experience in database operations, site reliability engineering (SRE), or a related role with a heavy focus on data platforms and core operational infrastructure. Strong proficiency in Automation and IaC frameworks, with extensive hands-on experience in building database IaC Proven track record in Database Production Support and Operations, with deep practical experience managing robust, highly available 24x7 runtime systems (prior experience handling large-scale database production support at scale is highly valued). Extensive experience with the AWS cloud platform and hands-on implementation of CI/CD pipelines and DatabaseOps workflows. Proficiency in monitoring and observability tools (e.g., Datadog, CloudWatch, DevOps Guru, Database Performance Insights) to track metrics, latency, throughput, and system errors. Strong understanding of various system performance metrics at a low level (such as Disk/IO saturation) and experience identifying or eliminating operational bottlenecks. Familiarity with managing large-scale database structures across various systems (SQL and NoSQL). Strong problem-solving skills, ownership mentality, proactive communication skills, and a baseline capability to document clear incident response playbooks and operational requirements.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at crunchyroll? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect