Data Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Own the reliability and stability of production data pipelines and data platform services.
- Define, improve, and enforce data SLAs/SLOs for batch and streaming products, including freshness, latency, and completeness.
- Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments.
- Investigate issues across distributed data systems, including Spark/EMR workloads, ingestion pipelines, and warehouse performance.
- Lead or support incident response, including triage, mitigation, and long-term resolution.
- Perform root cause analysis and implement durable fixes to prevent recurrence.
- Design and enhance monitoring, alerting, and observability for data systems.
- Develop automation and tooling to reduce operational toil and improve system resilience.
- Contribute to disaster recovery and resiliency planning, including backup validation and recovery workflows.
- Partner with engineering teams to improve pipeline design, reliability, and operational readiness.
- Create and maintain runbooks, Standard Operating Procedures, and operational documentation.
- Participate in occasional off-hours support for production data systems when required.
Requirements
- Bachelor's degree in Computer Science, Information Systems, Data Science, or a related field.
- 5+ years of experience in data engineering or analytics platform roles, including 3+ years operating in a production cloud data warehouse environment such as Redshift or Snowflake.
- 3+ years of experience building AWS data pipelines and supporting them through production, including exposure to real-world failures and operational challenges.
- 3+ years of experience working with production data platforms in AWS environments, with a focus on anomaly detection, reconciliation, and end-to-end validation.
- 3+ years of experience with Python and SQL in real data systems.
- Hands-on experience troubleshooting distributed data processing systems such as Spark/EMR, Redshift, and streaming systems.
- Proven ability to debug and resolve production issues in data pipelines and data platforms.
- Experience with AWS data services such as EMR, Redshift, DynamoDB, S3, or similar.
- Proven ability to handle production incidents and perform root cause analysis.
- Strong problem-solving mindset and ability to work through ambiguous production issues.
- What will set you apart:
- Experience handling real-world data issues such as pipeline delays or failures.
- Experience with data backfills and reprocessing.
- Experience with late-arriving data or incomplete datasets.
- Experience improving observability and alerting specifically for data systems.
- Experience influencing or guiding data pipeline reliability and operational practices.
- Exposure to streaming or event-driven systems such as Kafka, Kinesis, and CDC patterns.
- Experience with disaster recovery, backup validation, and resiliency testing.
- Strong communication during incidents with both technical and non-technical stakeholders.
- Prior FinOps or capacity-planning ownership for data platforms.
- Familiarity with BI semantic layers and contract enforcement at consumption, including Looker, Power BI, or Tableau.
- This job operates in a professional office environment.
Benefits
Additional Information
Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. ***Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment visa at this time, including CPT/OPT.*** The Data Reliability Engineer will own the reliability, stability, and operational excellence of an AWS-based data platform. This role will operate, troubleshoot, and improve production data systems to ensure data pipelines and analytics platforms are resilient, performant, and meet business-critical SLAs. The Data Reliability Engineer will work closely with data and platform engineering teams to diagnose issues, resolve production incidents, and improve design and operational practices across the data ecosystem.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at empower? Share your experience