Skip to main content
Back to jobs

Vice President, Head of Infrastructure Resiliency

External
assetmark logoAssetmark · Charlotte, NC
Full-timeHybrid2w ago
Budget ManagementCI/CDIncident ResponseLeadershipMoveObservability
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Production Operations & Reliability Transformation
  • Own 24/7 production operations for mission-critical systems, including incident management, batch processing, and environment stability
  • Lead the transformation of production operations from manual, reactive processes to automated, engineering-driven systems
  • Establish an engineering-first mandate to eliminate manual toil and operational overhead
  • Drive systematic improvements in reliability, scalability, and operational efficiency
  • Reliability Engineering & Error Budget Management
  • Define and operationalize Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across all critical systems
  • Establish and govern Error Budgets to balance product velocity with platform stability
  • Drive measurable reduction in operational toil through automation and engineering solutions
  • Embed reliability targets into planning and decision-making across teams Apply Site Reliability Engineering (SRE) principles to quantify and manage reliability
  • Observability & Resilience Engineering
  • Build full-stack observability (metrics, logs, traces) to improve detection and diagnosis of issues
  • Evolve monitoring into deep observability with actionable alerting and reduced alert fatigue
  • Establish resilience testing practices (e.g., game days, fault injection)
  • Drive automated incident response and self-healing systems
  • Institutionalize blameless post-mortems focused on systemic improvement
  • Leverage SRE practices for incident learning and continuous improvement
  • Platform Engineering & Infrastructure
  • Ensure all infrastructure is managed via Infrastructure as Code (IaC) for consistency, scalability, and recovery
  • Own reliability and operational integrity of CI/CD pipelines, including automated release gating
  • Build self-service platforms and tooling that enable engineering teams to deploy and operate services safely
  • Modernize batch processing and environment management through automation and engineering rigor
  • Shared Reliability Ownership with Engineering
  • Establish shared accountability for reliability between Platform, SRE, and Software Engineering teams
  • Partner with Engineering to co-deliver reliability improvements and conduct joint post-incident reviews
  • Influence engineering practices including production readiness, safe deployments, and observability standards
  • Ensure reliability is embedded early in the software development lifecycle
  • Ecosystem & Vendor Reliability
  • Define and enforce reliability standards for third-party vendors and platform dependencies
  • Establish SLIs/SLOs for external services and manage vendor performance accordingly
  • Map and govern system dependencies to prevent cascading failures
  • How Success Is Measured
  • Sustained improvement in platform reliability as measured by SLO attainment
  • High availability and resiliency of client-facing systems
  • Reduction in operational toil and manual intervention across teams
  • Increased deployment velocity without degradation of reliability
  • Adoption of Infrastructure as Code and self-service platform capabilities
  • Reduction in incident frequency and improved detection (MTTD) and recovery (MTTR)
  • Demonstrated transformation from manual operations to engineering-led reliability

Requirements

  • Engineering-First Mindset & Technical Depth
  • Strong background in Software Engineering or Systems Engineering; you lead reliability through code, not process alone
  • Deep expertise in distributed systems, failure modes, and large-scale platform architecture
  • Passionate about observability, SLOs, and data-driven reliability management
  • Proven Leadership Across Operations and Engineering
  • Experience owning production operations for mission-critical systems
  • Track record of transforming manual, operations-heavy environments into automated, engineering-led pl

Additional Information

Job Description: As the Head of Platform Resiliency & Operations, you are accountable for operating and engineering the reliability, scalability, and resilience of AssetMark's platform. This role owns production operations today -including environments, batch processing, incident response, and day-to-day platform management-which are currently operationally intensive. Your mandate is to transform this reality by driving an engineering-first approach to production management and infrastructure. You will lead a fundamental shift: from reactive, manual operations to proactive, automated, and engineered reliability-while continuing to deliver a high-quality, always-on platform for our clients. This role has a twofold mandate : Deliver on our client commitment by operating a high-availability, high-resiliency platform where reliability is a defining feature of the product Enable high-velocity product development by building systems, tooling, and practices that allow Product & Engineering to move fast without compromising stability We can only consider candidates for this position who are able to accommodate a hybrid work schedule and are close to our Charlotte, NC office.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at assetmark? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect