Skip to main content
Back to jobs

Senior Site Reliability Engineer (Arlington, VA)

External
onebrief logoOnebrief · Northern Virgina (dc Metro)
Full-timeOn-site2mo ago
AnsibleAWSComplianceGrafanaIncident ResponseKubernetes
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You'll work closely with fellow SREs, security, and customer success. You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation. In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise. About You You care deeply about reliability and treat it as a core feature of any application or platform, with a bias toward "reliability over novelty." You think about infrastructure and operability as products to be automated, well-documented, and continuously improved, and you aim to leave systems easier to operate than you found them. You are equally comfortable leading a post-incident review, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts. You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality.

Responsibilities

  • You'll own the reliability, scalability, and security of the production application and/or platform. You will do this by:
  • Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents who will lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence.
  • Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. You will partner with other teams to share best practices for air-gapped environments and support their readiness for production.
  • What We Look For
  • An active Top Secret clearance
  • 5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
  • Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly.
  • A deep understanding of incid

Benefits

Remote work options

Additional Information

About Onebrief Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smarter, and more efficient. We take ownership, seek excellence, and play to win with the seriousness and camaraderie of an Olympic team. Onebrief operates as an all-remote company, though many of our employees work alongside our customers at military commands around the world. Founded in 2019 by a group of experienced planners, today, Onebrief's team spans veterans from all forces and global organizations, and technologists from leading-edge software companies. We've raised $320m+ from top-tier investors, including Battery Ventures, General Catalyst, Sapphire Ventures, Insight Partners, and Human Capital, and today, Onebrief is valued at $2.15B. With this continued growth, Onebrief is able to make an impact where it matters most. Security Clearance, Location, and Onsite Notice: This role requires regularly working on-site at customer locations in Arlington, VA. If you are not currently within commuting distance, you must be willing to relocate (note that Onebrief will provide relocation assistance). Active Top Secret Clearance required with the ability to obtain SCI eligibility.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at onebrief? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect