Senior Reliability Engineer

External

Barbaricum · Washington, DC

Full-timeOn-site6d ago

AnsibleAWSAzureChaos EngineeringDocumentationIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements.
Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility.
Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery.
Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability.
Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions.
Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization.
Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact.
Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence.
Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards.
Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance.
Automate common operations tasks to reduce manual workloads, improve consistency, and increase system efficiency.
Implement security best practices across operational activities, infrastructure automation, monitoring, incident response, and system administration functions.
Required Skills:
Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience.
Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices.
Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies.
Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks.
Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments.
Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification.
Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making.
Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments.
Strong problem-solving skills and the ability to work under pressure dur

Benefits

Health insurance

Additional Information

Barbaricum is a rapidly growing government contractor providing leading-edge support to federal customers, with a particular focus on Defense and National Security mission sets. We leverage more than 17 years of support to stakeholders across the federal government, with established and growing capabilities across Intelligence, Analytics, Engineering, Mission Support, and Communications disciplines. Founded in 2008, our mission is to transform the way our customers approach constantly changing and complex problem sets by bringing to bear the latest in technology and the highest caliber of talent. Headquartered in Washington, DC's historic Dupont Circle neighborhood, Barbaricum also has a corporate presence in Tampa, FL, Bedford, IN, and Dayton, OH, with team members across the United States and around the world. As a leader in our space, we partner with firms in the private sector, academic institutions, and industry associations with a goal of continually building our expertise and capabilities for the benefit of our employees and the customers we support. Through all of this, we have built a vibrant corporate culture diverse in expertise and perspectives with a focus on collaboration and innovation. Our teams are at the frontier of the Nation's most complex and rewarding challenges. Join our team. Barbaricum is seeking an experienced Senior Site Reliability Engineer to support the reliability, availability, automation, and operational performance of IT and cloud systems under the Military Community and Family Policy (MC&FP) Outreach and Digital Enterprise Services (MODES) contract. You will help ensure MC&FP systems are reliable, scalable, resilient, and efficiently managed through proactive monitoring, automated incident response, performance optimization, and operational dashboards that support rapid decision-making

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at barbaricum? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect