Design, implement, and maintain infrastructure-level disaster recovery processes for compute, storage, network, identity, virtualization, and cloud platforms.
Develop and validate recovery runbooks, sequencing plans, failover procedures, and dependency maps.
Lead and execute technical DR tests, including component-level restores, partial failovers, and full integrated recovery exercises.
Identify and document Tier 0/Tier 1 dependencies across identity, DNS, networking, hypervisors, storage arrays, backup platforms, and application stacks.
Collaborate with platform owners to define, enforce, and maintain realistic RTO/RPO objectives; validate backup recoverability and integrity.
Coordinate engineers across infrastructure, cloud, network, security, backup, and application teams during tests and real incidents; drive remediation of gaps/technical debt.
Serve as primary point of contact for Internal Audit and external assessors for DDR scope; own audit evidence and compliance documentation.
What You Will Need to Be Successful:
Bachelor's degree in an IT-related field (e.g., Information Technology, Computer Science, Cybersecurity, Information Systems) or equivalent practical experience.
Five years in enterprise IT (infrastructure, SRE, operations, security); direct DR or large-scale incident recovery; experience running DR tests or recovery events.
Understanding of backup and recovery principles; evaluation of RTO/RPO considerations; proficiency in mapping application and infrastructure dependencies; foundational knowledge of identity management, DNS, networking, and storage systems.
Strong technical documentation and execution; ability to translate technical risk into business impact; confident facilitator of cross-functional exercises.
Requirements
Preferred experience in cyber recovery or ransomware response; regulatory and audit-driven DR familiarity; certifications such as ITIL or CBCP.
Other Key Skills:
Strong understanding of cyber recovery concepts, including clean-room restores, immutable backups, air-gapped architectures, and ransomware recovery.
Strong technical knowledge of compute (VMware, HyperV, cloud compute), storage systems, SAN/NAS, replication technologies, and data protection platforms.
Strong understanding of identity and access (AD, Azure AD), DNS, networking fundamentals, and how these control planes impact recovery.
Skilled at coordinating large groups of engineers across multiple domains; proficiency in dependency mapping and failover/failback procedures.