Cloud Reliability Engineer

External

Versant3 · Englewood Cliffs, NJ

Full-timeOn-site1d ago

AWSBashCapacity PlanningCI/CDCloudFormationDocumentation

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Reliabiliy Engineering
Design, implement, and maintain reliability practices for cloud infrastructure and platform services.
Define and monitor service-level objectives (SLOs), service-level indicators (SLIs), and operational metrics.
Identify reliability risks and implement solutions that improve availability, scalability, and resilience.
Drive continuous improvement initiatives focused on operational excellence and system stability.
Monitoring, Observability & Performance
Design and maintain monitoring, logging, alerting, and observability solutions across AWS environments.
Develop dashboards and reporting that provide visibility into platform health and performance.
Analyze system behavior, identify bottlenecks, and implement performance improvements.
Establish proactive monitoring practices that detect issues before they impact customers.
Incident Response & Operational Excellence
Participate in incident response, troubleshooting, and root cause analysis activities.
Lead post-incident reviews and identify corrective actions to prevent recurrence.
Improve operational processes, runbooks, and recovery procedures.
Support disaster recovery and business continuity initiatives.
AWS Platform Reliability
Support the reliability and operational health of large-scale AWS environments utilizing AWS Organizations, Control Tower, and Identity Center.
Partner with cloud engineering teams to improve platform architecture, resiliency, and operational consistency.
Assist in maintaining secure, scalable, and highly available cloud services.
Automation & Infrastructure as Code
Develop automation that reduces operational toil and improves system reliability.
Support infrastructure-as-code solutions using Terraform, CloudFormation, and related technologies.
Automate operational workflows, monitoring, remediation, and recovery activities.
Contribute to CI/CD pipelines and deployment automation initiatives.
Media & Digital Platform Reliability
Support the reliability of streaming platforms, content delivery systems, media workflows, APIs, and customer-facing applications.
Collaborate with engineering teams to improve application reliability and operational readiness.
Assist in capacity planning and scaling efforts for high-traffic events and media workloads.
Collaboration & Continuous Improvement
Partner with cloud, networking, security, and application teams to identify and address operational risks.
Promote reliability engineering best practices throughout the organization.
Contribute to documentation, standards, and operational procedures.
Evaluate emerging technologies and recommend improvements to platform reliability and observability.
Bachelor's degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
3-7 years of experience in Site Reliability Engineering, Cloud Engineering, DevOps, Infrastructure Engineering, or related roles.
Strong hands-on experience with AWS cloud services and enterprise-scale AWS environments.
Experience with:
Monitoring and observability platforms
Incident management and root cause analysis
Operational troubleshooting and performance tuning
AWS Organizations, Control Tower, and Identity Center
Experience with Infrastructure as Code:
Terraform
CloudFormation
Experience with CI/CD platforms and deployment automation.
Experience with scripting and automation using Python, PowerShell, Bash, or similar languages.
Strong understanding of AWS networking, resiliency, and cloud architecture concepts.
Experience with logging, metrics, tracing, and alerting technologies.
Strong troubleshooting, communication, and collaboration skills.
Additional Information
Location: New York City, NY or Englewood Cliffs, NJ - (Hybrid - 3 days onsite)
Employees ba

Benefits

Health insurance

Additional Information

The Cloud Reliability Engineer is responsible for ensuring the availability, performance, scalability, and operational excellence of VERSANT's cloud platforms and services. This role works closely with cloud engineering, application development, networking, security, and operations teams to build and maintain highly reliable systems across a large multi-account AWS environment. The engineer will leverage automation, observability, and reliability engineering practices to improve platform resilience, reduce operational risk, and enhance the customer experience. As a leading media company, VERSANT operates digital products, streaming platforms, content delivery systems, and media workflows that demand high levels of uptime and performance. The Cloud Reliability Engineer will help ensure these services remain resilient, scalable, and operationally mature. The ideal candidate has strong experience with AWS, monitoring and observability platforms, incident management, automation, infrastructure as code, and operational best practices. Experience with AWS Organizations, Control Tower, Identity Center, Terraform, and modern cloud operations tooling is highly desirable.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Versant3? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect