Forward Deployed Site Reliability Engineer

External

Twenty · Fort Meade, MD

Full-timeOn-site1mo ago

AWSCapacity PlanningComplianceDevSecOpsDockerGrafana

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

America is under sustained cyber attack. Our adversaries infiltrate our networks, steal our IP, and degrade the digital infrastructure that modern life runs on. They've learned-correctly-that those attacks rarely produce consequences. Twenty was founded to change that, by making our adversaries think twice before they attack us. Our vision is American and allied primacy in cyberspace-a future where they cannot contest us, deterrence is assured, and the free world remains secure. Founded in 2024, Twenty Technologies ( www.twenty.io ) industrializes offensive cyber operations for the U.S. and its allies. Headquartered in Arlington, Virginia, Twenty has raised $38M from Caffeinated Capital, General Catalyst, and In-Q-Tel. Role Summary You'll be our eyes, ears, and hands on the ground at a government customer site, ensuring the reliability and performance of Twenty's mission-critical platform running in a restricted, air-gapped AWS environment. This role sits at the intersection of deep technical ownership and customer-facing engineering: you'll define how we measure reliability, lead incident response in a constrained environment, and serve as the primary technical link between what's happening on-site and the engineering team back in Arlington. You'll work closely with the DevSecOps engineer to ensure the platform operates within government security and compliance requirements, and with product engineers to translate operational reality into actionable feedback. You'll report directly to the VP of Engineering. If you thrive operating with autonomy in high-stakes environments and find satisfaction in making complex systems provably reliable, this role is for you.

Responsibilities

Reliability Engineering
Define, track, and report on SLIs and SLOs for platform services running in the customer environment.
Use error budgets to drive reliability conversations with the Arlington engineering team, translating operational data into prioritized engineering work.
Identify and eliminate toil: build automation for repetitive operational tasks within the constraints of the secure environment.
Conduct post-incident reviews, own root cause analysis, and drive durable fixes in partnership with the engineering team.
Observability & Incident Response
Own the observability posture for the on-site deployment - dashboards, alerting thresholds, and log pipelines using the LGTM stack (Grafana, Loki, Tempo, Mimir).
Lead incident response on-site: triage, containment, coordination with Arlington, and customer communication.
Maintain and continuously improve runbooks for operational procedures and emergency response protocols.
Serve as the on-call anchor for the customer environment, with clear escalation paths to the engineering team.
Deployment & Infrastructure Operations
Work with the customer deployment team to get Twenty's platform stood up and updated within the restricted environment.
Manage containerized services (Docker, Docker Compose) across deployment lifecycle - configuration, updates, rollbacks.
Apply and validate Terraform-based infrastructure changes within the enclave, in coordination with the DSO engineer who owns IaC policy and guardrails.
Perform capacity planning and flag scaling requirements to the Arlington team before they become incidents.
Customer Liaison & Engineering Feedback
Serve as the primary technical interface between the government customer and Twenty's engineering team - translating operational requirements, constraints, and issues in both directions.
Represent the operational environment accurately in engineering discussions: what the team in Arlington can't see, you make visible.
Partner with the DevSecOps engineer on compliance, logging, and audit requirements specific to the customer environment.
Provide technical guidance and support to customer stakeholders on system behavior and troubleshooting procedures.

Requirements

You own reliability outcomes, not just uptime dashboards - you define what "healthy" means and hold the system to it.
You're as comfortable writing a runbook as you are deep in a production incident with limited tooling and no safety net.
You operate well with minimal remote support - ambiguity doesn't paralyze you, and you know when to escalate versus when to solve it yourself.
You build trust naturally with external stakeholders, including government customers, and can translate complex technical situations into plain language under pressure.
You treat toil as a bug: if you're doing something manually more than twice, you automate it.
You communicate with precision - your incident reports and runbooks are read by people who weren't in the room, and they need to be right.
You understand that in a restricted environment, you are the feedback loop - and you take that responsibility seriously.
5+ years of professional experience in site reliability engineering, production operations, or a closely related infrastructure role.
Prov

Benefits

Health insuranceVision insuranceRemote work options

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at twenty? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect