Vice President, Head of Infrastructure Resiliency

External

Assetmark · Charlotte, NC

Full-timeHybrid2w ago

Budget ManagementCI/CDIncident ResponseLeadershipMoveObservability

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Production Operations & Reliability Transformation
Own 24/7 production operations for mission-critical systems, including incident management, batch processing, and environment stability
Lead the transformation of production operations from manual, reactive processes to automated, engineering-driven systems
Establish an engineering-first mandate to eliminate manual toil and operational overhead
Drive systematic improvements in reliability, scalability, and operational efficiency
Reliability Engineering & Error Budget Management
Define and operationalize Service Level Indicators (SLIs) and Service Level Objectives (SLOs) across all critical systems
Establish and govern Error Budgets to balance product velocity with platform stability
Drive measurable reduction in operational toil through automation and engineering solutions
Embed reliability targets into planning and decision-making across teams Apply Site Reliability Engineering (SRE) principles to quantify and manage reliability
Observability & Resilience Engineering
Build full-stack observability (metrics, logs, traces) to improve detection and diagnosis of issues
Evolve monitoring into deep observability with actionable alerting and reduced alert fatigue
Establish resilience testing practices (e.g., game days, fault injection)
Drive automated incident response and self-healing systems
Institutionalize blameless post-mortems focused on systemic improvement
Leverage SRE practices for incident learning and continuous improvement
Platform Engineering & Infrastructure
Ensure all infrastructure is managed via Infrastructure as Code (IaC) for consistency, scalability, and recovery
Own reliability and operational integrity of CI/CD pipelines, including automated release gating
Build self-service platforms and tooling that enable engineering teams to deploy and operate services safely
Modernize batch processing and environment management through automation and engineering rigor
Shared Reliability Ownership with Engineering
Establish shared accountability for reliability between Platform, SRE, and Software Engineering teams
Partner with Engineering to co-deliver reliability improvements and conduct joint post-incident reviews
Influence engineering practices including production readiness, safe deployments, and observability standards
Ensure reliability is embedded early in the software development lifecycle
Ecosystem & Vendor Reliability
Define and enforce reliability standards for third-party vendors and platform dependencies
Establish SLIs/SLOs for external services and manage vendor performance accordingly
Map and govern system dependencies to prevent cascading failures
How Success Is Measured
Sustained improvement in platform reliability as measured by SLO attainment
High availability and resiliency of client-facing systems
Reduction in operational toil and manual intervention across teams
Increased deployment velocity without degradation of reliability
Adoption of Infrastructure as Code and self-service platform capabilities
Reduction in incident frequency and improved detection (MTTD) and recovery (MTTR)
Demonstrated transformation from manual operations to engineering-led reliability

Requirements

Engineering-First Mindset & Technical Depth
Strong background in Software Engineering or Systems Engineering; you lead reliability through code, not process alone
Deep expertise in distributed systems, failure modes, and large-scale platform architecture
Passionate about observability, SLOs, and data-driven reliability management
Proven Leadership Across Operations and Engineering
Experience owning production operations for mission-critical systems
Track record of transforming manual, operations-heavy environments into automated, engineering-led pl

Additional Information

Job Description: As the Head of Platform Resiliency & Operations, you are accountable for operating and engineering the reliability, scalability, and resilience of AssetMark's platform. This role owns production operations today -including environments, batch processing, incident response, and day-to-day platform management-which are currently operationally intensive. Your mandate is to transform this reality by driving an engineering-first approach to production management and infrastructure. You will lead a fundamental shift: from reactive, manual operations to proactive, automated, and engineered reliability-while continuing to deliver a high-quality, always-on platform for our clients. This role has a twofold mandate : Deliver on our client commitment by operating a high-availability, high-resiliency platform where reliability is a defining feature of the product Enable high-velocity product development by building systems, tooling, and practices that allow Product & Engineering to move fast without compromising stability We can only consider candidates for this position who are able to accommodate a hybrid work schedule and are close to our Charlotte, NC office.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at assetmark? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect