Cloud Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Reliabiliy Engineering
- Design, implement, and maintain reliability practices for cloud infrastructure and platform services.
- Define and monitor service-level objectives (SLOs), service-level indicators (SLIs), and operational metrics.
- Identify reliability risks and implement solutions that improve availability, scalability, and resilience.
- Drive continuous improvement initiatives focused on operational excellence and system stability.
- Monitoring, Observability & Performance
- Design and maintain monitoring, logging, alerting, and observability solutions across AWS environments.
- Develop dashboards and reporting that provide visibility into platform health and performance.
- Analyze system behavior, identify bottlenecks, and implement performance improvements.
- Establish proactive monitoring practices that detect issues before they impact customers.
- Incident Response & Operational Excellence
- Participate in incident response, troubleshooting, and root cause analysis activities.
- Lead post-incident reviews and identify corrective actions to prevent recurrence.
- Improve operational processes, runbooks, and recovery procedures.
- Support disaster recovery and business continuity initiatives.
- AWS Platform Reliability
- Support the reliability and operational health of large-scale AWS environments utilizing AWS Organizations, Control Tower, and Identity Center.
- Partner with cloud engineering teams to improve platform architecture, resiliency, and operational consistency.
- Assist in maintaining secure, scalable, and highly available cloud services.
- Automation & Infrastructure as Code
- Develop automation that reduces operational toil and improves system reliability.
- Support infrastructure-as-code solutions using Terraform, CloudFormation, and related technologies.
- Automate operational workflows, monitoring, remediation, and recovery activities.
- Contribute to CI/CD pipelines and deployment automation initiatives.
- Media & Digital Platform Reliability
- Support the reliability of streaming platforms, content delivery systems, media workflows, APIs, and customer-facing applications.
- Collaborate with engineering teams to improve application reliability and operational readiness.
- Assist in capacity planning and scaling efforts for high-traffic events and media workloads.
- Collaboration & Continuous Improvement
- Partner with cloud, networking, security, and application teams to identify and address operational risks.
- Promote reliability engineering best practices throughout the organization.
- Contribute to documentation, standards, and operational procedures.
- Evaluate emerging technologies and recommend improvements to platform reliability and observability.
- Bachelor's degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
- 3-7 years of experience in Site Reliability Engineering, Cloud Engineering, DevOps, Infrastructure Engineering, or related roles.
- Strong hands-on experience with AWS cloud services and enterprise-scale AWS environments.
- Experience with:
- Monitoring and observability platforms
- Incident management and root cause analysis
- Operational troubleshooting and performance tuning
- AWS Organizations, Control Tower, and Identity Center
- Experience with Infrastructure as Code:
- Terraform
- CloudFormation
- Experience with CI/CD platforms and deployment automation.
- Experience with scripting and automation using Python, PowerShell, Bash, or similar languages.
- Strong understanding of AWS networking, resiliency, and cloud architecture concepts.
- Experience with logging, metrics, tracing, and alerting technologies.
- Strong troubleshooting, communication, and collaboration skills.
- Additional Information
- Location: New York City, NY or Englewood Cliffs, NJ - (Hybrid - 3 days onsite)
- Employees ba
Benefits
Additional Information
The Cloud Reliability Engineer is responsible for ensuring the availability, performance, scalability, and operational excellence of VERSANT's cloud platforms and services. This role works closely with cloud engineering, application development, networking, security, and operations teams to build and maintain highly reliable systems across a large multi-account AWS environment. The engineer will leverage automation, observability, and reliability engineering practices to improve platform resilience, reduce operational risk, and enhance the customer experience. As a leading media company, VERSANT operates digital products, streaming platforms, content delivery systems, and media workflows that demand high levels of uptime and performance. The Cloud Reliability Engineer will help ensure these services remain resilient, scalable, and operationally mature. The ideal candidate has strong experience with AWS, monitoring and observability platforms, incident management, automation, infrastructure as code, and operational best practices. Experience with AWS Organizations, Control Tower, Identity Center, Terraform, and modern cloud operations tooling is highly desirable.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Versant3? Share your experience