Sr SRE/Dev Ops Engineer

External

Madisonreedcolorbar · Hq

Full-timeOn-site2d ago

CI/CDIncident ResponseObservabilitySAFeSite Reliability Engineering

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Infrastructure Provisioning & Automation
Design, provision, and manage cloud infrastructure for AI-powered services, agents, orchestration systems, and supporting platforms.
Automate environment setup and configuration across development, staging, and production environments.
Build reusable infrastructure-as-code patterns that improve consistency, security, scalability, and maintainability.
Partner with engineering teams to ensure production systems are resilient, observable, performant, and cost-efficient.
Participate in on-call support, incident response, root cause analysis, and continuous reliability improvement.
CI/CD & Deployment Engineering
Build, maintain, and optimize CI/CD pipelines for services, agents, orchestration layers, and supporting infrastructure.
Implement automated testing, validation, security, and reliability gates within deployment workflows.
Design safe deployment patterns including blue/green deployments, canary releases, feature flags, and automated rollback mechanisms.
Integrate health checks, service readiness checks, and reliability signals into release processes.
Improve deployment speed and confidence while reducing production risk.
AI Platform Operations & Deployment Governance
Package, version, deploy, and manage AI models, agent services, and orchestration components across environments.
Support safe rollout, rollback, refresh, and retirement workflows for AI-powered services.
Monitor AI service performance across latency, throughput, availability, cost, quality, and business-critical reliability signals.
Implement operational controls for AI systems, including version tracking, environment promotion, access management, and change governance.
Partner with data, engineering, product, and support teams to ensure AI systems are production-ready and operationally accountable.
Telemetry, Observability & Data Pipelines
Design and operate scalable telemetry pipelines for logs, metrics, traces, model events, agent interactions, and operational signals.
Enable structured observability for AI services and orchestration systems to support real-time monitoring, alerting, and diagnostics.
Build dashboards, alerts, and reporting that provide actionable insight into system health, performance, reliability, and cost.
Improve incident detection, triage, and resolution through high-quality telemetry and operational data.
Support data-driven reliability practices, including SLOs, error budgets, service health reviews, and post-incident analysis.
AIOps Platform Integration
Implement intelligent monitoring, alert correlation, anomaly detection, and automated incident response capabilities.
Integrate AIOps tools and workflows into existing DevOps, SRE, and engineering operations.
Build automation that reduces manual operational work and improves mean time to detect and resolve issues.
Identify opportunities to use AI and automation to improve platform reliability, observability, supportability, and operational efficiency.
Production Reliability & SRE Excellence
Define and maintain reliability standards for AI-powered production systems.
Establish and track service-level indicators, service-level objectives, and operational readiness requirements.
Lead reliability reviews, production readiness assessments, and infrastructure risk assessments.
Drive improvements in system resilience, scalability, security, performance, and cost optimization.
Champion SRE best practices across engineering teams.

Requirements

Required Experience

Benefits

Health insuranceVision insurance

Additional Information

Role Description Madison Reed is seeking a hands-on Senior SRE / AI Platform DevOps Engineer to build, operate, and scale the infrastructure behind our AI-powered services, agents, and orchestration platforms. This role sits at the intersection of site reliability engineering, cloud infrastructure, DevOps automation, observability, and AI operations. You will own the systems and practices that ensure our AI-enabled services are reliable, secure, scalable, cost-effective, and production-ready. The ideal candidate is infrastructure-first and operationally minded, with deep experience in cloud environments, CI/CD, production monitoring, incident response, and automation. You will help operationalize AI systems by building reliable deployment workflows, telemetry pipelines, monitoring frameworks, and governance processes for models, agents, and orchestration services. This is a highly hands-on engineering role for someone who enjoys building resilient platforms, reducing operational risk, improving deployment velocity, and making advanced technology dependable in real-world production environments. The base range for this position is between $170k-175k. At Madison Reed, we aim to pay competitively. Factors which may affect starting pay within this range may include geography/market, skills, education, experience, and other qualifications of the successful candidate. This role must be based in the United States.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at madisonreedcolorbar? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect