Senior/Staff DevOps Engineer

External

Ethos · Worldwide

Full-timeRemote1mo ago

Capacity PlanningCI/CDComplianceDocumentationHelmIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

You'll lead the deployment and operationalization of our SaaS products across Commercial Cloud , government networks , and bespoke/air-gapped customer environments. As a Senior engineer, you'll own end-to-end infrastructure delivery, elevate DevOps practices, and collaborate closely with Software and Product. As a Staff engineer, you'll additionally shape platform engineering strategy , set technical direction for distributed systems at scale, and influence design patterns that enable AI workloads and complex data pipelines. You'll treat AI tooling as core to your daily workflow - for IaC, pipelines, incident response, and toil reduction - and help shape the agentic operations patterns and AI workloads our platform runs. If you love solving hard deployment problems, care deeply about security and reliability, can scale modern cloud platforms with rigor, and embrace AI-augmented operations as the way forward, this role is for you.

Responsibilities

Design & Operate the Platform: Architect, implement, and run secure, scalable, multi-tenant infrastructure (infra as code, immutable artifacts, GitOps).
AI-Augmented Operations & Platform Work: Use AI coding and agentic tools (Claude Code, Cursor, Copilot, MCP-based ops agents) for IaC authoring, pipeline development, log/trace analysis, postmortem drafting, and toil reduction; build and improve agentic workflows for the team.
CI/CD & Release Engineering: Build and harden pipelines (build, test, scan, sign, promote, deploy) for multi-environment delivery-including disconnected/air-gapped workflows.
Observability & Reliability: Establish SLOs; instrument systems for metrics/logs/traces; drive incident response and postmortems; reduce MTTR and change failure rate.
Security & Compliance by Design: Integrate supply-chain security (SBOMs, signing, provenance), secrets management, and baseline hardening (CIS/STIG-aligned).
Cost & Performance: Optimize infrastructure spend and performance (capacity planning, autoscaling, right-sizing, storage/egress strategies).
Technical Leadership: Lead design reviews, author RFCs, mentor engineers, and raise the quality bar for platform changes.
Gov/Constrained Deployments: Support IL-4/IL-5-aligned patterns, RMF documentation support, and offline artifact promotion processes where needed.
(Staff) Strategy & Standards: Define platform roadmaps, establish consistent deployment and infrastructure patterns, and guide cross-team adoption of best practices.
Measures of Success (First 6-12 Months)
Availability & Reliability: Meet or exceed service SLOs; reduce MTTR by ≥30%.
Delivery Velocity: Increase deployment frequency by ≥2× while keeping change failure rate ≤15%.
Pipeline Efficiency: Cut CI pipeline duration by ≥25% and reduce flaky tests significantly.
Security Posture: Achieve ≥95% pass rate for supply-chain/security gates (image signing, SBOM scans, vulnerability thresholds); reduce MTTR for CVEs to ≤14 days for high severity.
Cost & Drift: Deliver ≥15% infra cost savings without performance regressions; keep infra drift near zero via GitOps and policy as code.
Gov/Offline Readiness: Stand up an artifact promotion flow (build → scan → sign → export) suitable for disconnected deployments with documented runbooks.
30/60/90 Day Plan
First 30 Days - Map & Baseline
Deep-dive on current cloud topology, CI/CD, observability, security controls, and on-call.
Inventory build and runtime artifacts; document deployment environments and promotion paths.
Baseline reliability and delivery metrics (SLOs, MTTR, deploy frequency, CFR, pipeline timing).
Establish and prove the effectiveness of your personal workflow with AI tooling.
60 Days - Design & Deliver
Harden CI/CD: add SBOM generation, signing (e.g., Cosign/Sigstore), and policy gates.
Implement or refine infrastructure modules (Terraform) and Helm/Kustomize charts with GitOps flows.
Establish service SLOs and golden signals; wire alerts and dashboards for top services.
Pilot artifact export/import flow for air-gapped/disconnected deployments; write runbooks.
90 Days - Scale & Standardize
Standardize CI/CD pipelines and infrastructure modules across existing services.
Migrate priority services to hardened delivery paths; deprecate legacy workflows.
Land cost/performance wins (e.g., autoscaling policies, instance/stor

Additional Information

About Ethos Ethos is on a mission to bridge the human readiness gap by transforming how training is developed, consumed, and aligned with strategic business outcomes. As a well-funded Series A startup ($40M+ raised), we're a trusted partner to 150+ enterprise customers across the U.S. military, life sciences, manufacturing, supply chain, and professional sports. We're expanding our engineering team to deliver a best-in-class learning platform-smarter, faster, and more optimized. We've gone all-in on AI tooling in our development process, and we're accepting and expanding upon the best new practices for creating software in this era.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at ethos? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect