Skip to main content
Back to jobs

Cloud Reliability Engineer

External
infios logoInfios · Remote
Full-timeRemote2w ago
AnsibleAWSAzureBashChaos EngineeringCI/CD
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • ▶ Cloud Infrastructure Operations
  • o Operate, maintain, and improve cloud infrastructure in AWS, Azure, or GCP environments.
  • o Manage and optimize Kubernetes clusters - deployment, scaling, patching, and upgrades.
  • o Ensure system availability, scalability, and performance through proactive monitoring and optimization.
  • o Maintain infrastructure-as-code (IaC) for consistent and repeatable deployments.
  • ▶ Automation & Continuous Improvement
  • o Identify opportunities for operational automation to eliminate manual processes ("reduce toil").
  • o Build and maintain automated pipelines for deployments, configuration, and remediation.
  • o Develop self-healing mechanisms to automatically detect and resolve common service issues.
  • o Participate in continuous improvement initiatives around reliability, performance, and efficiency.
  • ▶ Reliability Engineering
  • o Implement SRE principles: define and track SLIs, SLOs, and error budgets.
  • o Perform incident analysis and postmortems to identify root causes and prevent recurrence.
  • o Design proactive monitoring, alerting, and observability dashboards (Dynatrace, DataDog).
  • o Collaborate with DevOps and development teams to build reliable, observable, and resilient systems.
  • ▶ CI/CD and Release Operations
  • o Manage and optimize CI/CD pipelines to ensure reliable and consistent delivery.
  • o Support deployment strategies (blue/green, canary, rolling) to reduce downtime risk.
  • o Collaborate with Product and DevOps teams on release readiness and rollback automation.
  • ▶ Incident Response & Troubleshooting
  • o Monitor, troubleshoot, and resolve infrastructure and application issues
  • o Respond to production incidents and ensure rapid mitigation and resolution.
  • o Troubleshoot complex cloud, container, and networking issues across distributed systems.
  • o Drive a culture of proactive monitoring, data-driven analysis, and preventive action.
  • Required Qualifications
  • ▶ Bachelor's degree in computer science, Engineering, or related field (or equivalent experience).
  • ▶ 5+ years of experience in experience in Cloud Engineering, DevOps, or Site Reliability roles.
  • ▶ Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP).
  • ▶ Strong knowledge of Kubernetes deployment, management, and troubleshooting
  • ▶ Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog) and incident management platforms.
  • ▶ Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible).
  • ▶ Strong troubleshooting and analytical skills across infrastructure and applications.
  • ▶ Experience with incident response, RCA, and postmortem processes.
  • ▶ A mindset of continuous improvement, reliability, and self-healing automation.
  • ▶ Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices.
  • Preferred Skills
  • ▶ Experience in conducting resilience assessments and recovery drills.
  • ▶ Familiarity with ServiceNow and Dynatrace or other observability and ITSM tools.
  • ▶ Experience with chaos engineering or resiliency testing frameworks
  • ▶ Background in networking, load balancing, and performance tuning
  • ▶ Strong communication and stakeholder management skills.
  • Soft Skills & Mindset
  • ▶ Strong collaboration skills - comfortable working with developers, ops, and management.
  • ▶ Clear communicator; able to translate technical issues into business impact.
  • ▶ Self-starter with a problem-solving and automation-first mentality.
  • ▶ Resilient under pressure - thrives in a dynamic, fast-paced environment.
  • ▶ Passionate about operational excellence and continuous learning.
  • Key Success Metrics
  • ▶ SLA/SLO compliance for critical services
  • ▶ Reduction in MTTR (Mean Time to Recover)
  • ▶ Increase in automated incident resolution rates
  • ▶ Reduction in customer-impacting incidents
  • ▶ Frequency and outcomes of resilience testing exercises
  • ▶ Service uptime / availability
  • Why join us ?
  • We believe the future is better when supply chains work better.
  • We are an equal-opportunity employer and committed to inclusion in the workplace.
  • At Infios , we believe that inclusion is a fundamental cornerstone of our success. We are committed to creating a safe and

Benefits

Vision insurance

Additional Information

If you are looking for a meaningful career where people work and act with passion, rethink the existing and always strive to find the best solution - you have come to the right place. We develop future technologies to relentlessly make supply chains better. We are a leader in supply chain software solutions, helping organizations streamline operations, reduce costs, and improve efficiency.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at infios? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect
Cloud Reliability Engineer at Infios