SRE Developer

External

Aylo · Montréal, Qc, Canada

Full-timeOn-site1mo ago30+ days old, may be filled

BashCachingCI/CDComplianceConfluenceDocker

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Own the reliability, availability, and performance of production systems in a containerized, microservices-based environment
Monitor system health using Grafana dashboards, alerts, and observability tools; proactively identify and resolve issues
Manage and operate Kubernetes clusters (via Rancher), including deployments, scaling, and troubleshooting
Lead and participate in incident management using OpsGenie, including on-call rotations, escalations, and post-incident reviews
Troubleshoot issues across application, infrastructure, messaging, database, and container layers
Build and maintain automation scripts and tools using Bash, Go, and/or Python to improve operational efficiency
Support and optimize CI/CD pipelines using GitLab, ensuring smooth deployment and release processes
Collaborate with development teams to improve application reliability, performance, and observability
Work with databases and data systems (MySQL, Redis) for performance monitoring and issue resolution
Support distributed messaging systems such as Kafka and RabbitMQ
Contribute to and maintain operational documentation, runbooks, and knowledge bases using Jira and Confluence
Perform root cause analysis (RCA) and implement preventative measures
Ensure systems operate in alignment with security, compliance, and data privacy standards
Leverage AI-powered engineering tools to accelerate troubleshooting, documentation, and workflows
What you need to be successful:

Requirements

3+ years of experience in Site Reliability Engineering, DevOps, Production Support, or Systems Engineering
Bachelor's degree in computer science or related field
Hands-on experience with Grafana, Kubernetes and Docker
Experience with OpsGenie for incident management and on-call coordination
Strong experience with GitLab/Git, including CI/CD pipelines and release processes
Proficiency with Atlassian tools (Jira, Confluence) for tracking and documentation
Solid knowledge of MySQL - Experience with Kafka and/or RabbitMQ
Familiarity with Redis for caching and performance optimization
Working knowledge of Temporal or similar workflow orchestration tools
Strong scripting skills in Bash
Proficiency in Go and/or Python for automation and tooling
Familiarity with PHP applications (Symfony, Laravel) for production support
Proven ability to troubleshoot complex systems across multiple layers
Excellent documentation habits (runbooks, playbooks, system diagrams)
Knowledge of FTC data protection principles
Understanding of NIST frameworks and security best practices Familiarity with GDPR requirements (data handling, logging, retention, privacy)
As an equal opportunity employer, we celebrate diversity and are committed to creating an inclusive environment for all employees
In this role you may be exposed to adult content

Benefits

Health insuranceRemote work options

Additional Information

Established in 2004, we are a tech pioneer offering world-class adult entertainment and games on some of the internet's safest and most popular platforms. With the support of an international team of dynamic and collaborative innovators, we are on a mission to enable safe user experiences and empower our communities by celebrating diversity, inclusion, and expression - all while maintaining robust trust-and-safety protocols. We embrace the best of both worlds! Local talent can thrive in our collaborative office space with the flexibility of a hybrid work environment, while remote team members play an integral role in shaping our dynamic culture from afar. We have offices in Montreal (Quebec), Austin (Texas) and Nicosia (Cyprus). *A select number of positions require full-time in office attendance* We are seeking a highly skilled Site Reliability Engineer (SRE) to support and enhance the reliability, scalability, and performance of our production systems. In this role, you will play a key role in incident response, root cause analysis, and continuous improvement of operational processes while leveraging cutting-edge tooling and AI-assisted solutions.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at aylo? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect