Senior Site Reliability Engineer

External

Outsystems · - San Francisco Bay Area

Full-timeRemoteToday

AWSCloudFormationComplianceGitHubHadoopIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Requirements

BS/MS in Computer Science or Equivalent
6+ years of experience in Site Reliability Engineering, managing infrastructure and services at scale
History of end-to-end project delivery

Additional Information

There are NO limits to your career: come shape the future and be part of a truly unique global culture at OutSystems! Hybrid Onsite in Menlo Park, CA Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals of SRE are to create scalable and highly reliable systems. Our SREs ensure our production systems' reliability, performance, and scalability while enabling rapid development and deployment of new features and services. SREs at OutSystems work closely with development teams, acting as an extension of the team, in adopting the reliability tenets with the shared goal of meeting Service Level Objectives (SLOs) and thus delivering a smooth and frictionless Customer Experience. Site Reliability Engineer Role As an SRE at OutSystems here are your key responsibilities and duties: Lead and onboard services and teams to the reliability tenets; Establish and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs); Design and implement scalable, reliable, and secure infrastructure, while ensuring cloud-native best practices; Collaborate with software development teams to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant; Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents; Lead incident response efforts, ensuring quick resolution and minimal downtime, and conduct RCA/post-mortems; Automate every operational task, with a special focus on fast incident detection & recovery; Programming in Python supported by Gen AI tooling to accelerate development of mission critical automation and tools. Foster a culture of continuous improvement and knowledge sharing; Communicate effectively with stakeholders, providing updates on system reliability and performance; Participate in on-call rotation to provide 24/7 support for production systems. Site Reliability Engineering Performance Indicators The main KPIs that aid in understanding the impact and success of the SRE function at OutSystems are: SLA and Service Level Objectives (SLO) compliance; SLO Coverage and Detection Ratio; MTTA - Mean time to acknowledge; MTTR - Mean time to resolve. Qualifications and Skills To illustrate the desired profile for a Site Reliability Engineer. Nevertheless, the selection of candidates will always vary depending on specific knowledge of the field and prior experience.

Senior Site Reliability Engineer

Requirements

Additional Information

Your Match

Company Intel

What employees say

Interested in this role?