Sr. Site Reliability Engineer - SRE
ExternalFull-timeRemote4d ago
AWSCI/CDDatadogDNSGitHubGitHub Actions
Prepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience.
- https://www.smartrecruiters.com/app/jobs/details/1a099a5c-2719-44ea-b9fb-43833ab4f60f/jobad/726f1bba-3ffb-4544-a5ec-d689eea24fc0 1/4
- 5/29/26, 10:48 AM Job - SmartRecruiters
- System Design & Architecture: Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability into our services from the ground up.
- Knowledge Sharing & Mentorship: Document processes, build runbooks, and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.
Requirements
- Core SRE Capabilities
- Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role.
- Proven ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains.
- Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis.
- Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions.
- Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly to both technical and non-technical audiences.
- Technical Domains
- Experience across several of the following areas:
- Kubernetes platforms, including Amazon EKS, and service mesh technologies such as Istio.
- Cloud infrastructure and services within AWS.
- Identity and access management systems, including Auth0 and AWS IAM.\
- Networking fundamentals, including DNS, load balancing, routing, TLS, and connectivity troubleshooting.
- GitOps workflows and infrastructure automation using tools such as Flux and Terraform.
- Observability platforms and practices, including metrics, logs, traces, alerting, dashboards, and synthetic monitoring.
- CI/CD systems and engineering workflows.
- Application logging and distributed system debugging.
- Engineering Mindset
- A strong SRE:
- Prioritizes service stability and customer impact during incidents.
- Slows down under pressure, gathers facts, and communicates clearly.
- Reduces operational complexity through automation and simplification.
- Identifies and eliminates toil through self-service tooling and process improvement.
- Demonstrates strong scripting and automation instincts.
- Brings a systems-thinking approach to problem-solving.
- Balances short-term remediation with long-term reliability improvements.
- Software Engineering for Reliability
- Demonstrated ability to build and maintain automation, tooling, and self-service capabilities using one or more programming or scripting languages s
Additional Information
We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function, you will play a critical role in ensuring the reliability, scalability, and performance of our mission-critical services that power our customer experience. This is an exciting opportunity to shape our SRE practices, drive automation, and significantly impact our product's operational excellence.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Qadinc? Share your experience