Skip to main content
Back to jobs

Director, IT Resilience and Modernization Lead

External
fwd logoFwd · - Taikoo Shing (group Office), Hong Kong
Full-timeOn-siteToday
LeadershipObservability
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Benefits

Health insuranceVision insuranceEquity / stock options

Additional Information

About FWD Group FWD Group (1828.HK) is a pan-Asian life and health insurance business that serves approximately 40 million customers across 10 markets, including BRI Life in Indonesia. FWD's customer-led and tech-enabled approach aims to deliver innovative propositions, easy-to-understand products and a simpler insurance experience. Established in 2013, the company operates in some of the fastest-growing insurance markets in the world with a vision of changing the way people feel about insurance. FWD Group is listed on the main board of the Hong Kong Stock Exchange under the stock code 1828. For more information, please visit www.fwd.com Purpose Own the Group-wide strategy, policy, and outcomes for IT resilience, service reliability, and platform modernization across FWD Group's infrastructure, cloud platforms, and core system applications Set reliability objectives, govern error-budget policy, and hold decision rights over production change risk for NB and Customer facing related services Chair the Group Resilience Council and drive a federated SRE operating model across Group Office and Business Units (BUs) Strategize, Design, enforce and govern the group resilience standard, all systems must be Highly Availability, with DR plan, with failover plan Lead modernization of legacy platforms and production services by defining target-state architectures, resilience patterns, upgrade roadmaps, and remediation priorities to improve availability, scalability, security, and maintainability IT SRE provides advice to different teams on how to fix P1/P2 RCAs, drives troubleshooting / analyze / identify where in the code need to be fixed and oversee the entire troubleshooting process Drive modernization through observability, automation, SRE practices, and engineering enablement, ensuring incident learnings translate into platform hardening, architectural simplification, and faster, safer delivery Act as multi-SME / Generalist team, SMEs of different areas ( Security, Network, Cloud, Application, Infrastructure, etc) IT SRE manages and owns the new P1/2 escalation protocol Key accountabilities Modernization Define and drive the modernization roadmap for core system applications, including lifecycle management, upgrade strategy, technical debt reduction, platform simplification, and resilience-by-design requirements Lead modernization reviews for core systems to assess architecture fitness, recoverability, scalability, supportability, and security, and translate incident learnings into prioritized remediation and refactoring plans Establish modernization guardrails for core application estates covering observability, automation, release engineering, resilience patterns, decommission planning, and adoption of cloud-native or platform-standard capabilities where appropriate Enterprise governance & policy: Develop and own SRE Standards, Error Budget Policy, On-Call & Incident Command framework; enforce release gates based on reliability risk Enterprise governance & policy: Develop and own Group standards covering resilience, modernization guardrails, error budgets, on-call and incident command frameworks, and production change controls; enforce release gates based on reliability and modernization risk Reduce MTTR and incident recurrence; scale SLO coverage to ≥ 90% of critical services Drive reliability‑by‑design reviews for NB and Customer Facing system changes, preventing recurrence through architectural guardrails and automated release gates Platform ownership: Product-own Observability & AIOps platforms and drive modernization enablers including telemetry standards, engineering guardrails, automated release controls, and reusable patterns for cloud-native adoption Resilience: Approve DR tiers (RTO/RPO), lead chaos/DR exercises, and ensure cyber-resilience alignment with Security Financials: Own SRE platform budget, FinOps targets, and vendor SLA outcomes Org & talent: Build a global SRE leadership bench, run the SRE Academy, and operate a follow-the-sun model with healthy on-call Stakeholder engagement: Prepare and present executive reporting to GMT on production reliability risk and major incidents Provide leadership and decisioning across Group IT, local BUs IT, local BUs users and Group Digital & Data team by providing troubleshooting direction and approach Provide thought leadership in performing root case analysis and develop long-term prevention measures Governance Framework : Develop publish Group SRE Standards including SLO/SLI definitions, error budgets, release gates, on-call health, and post-incident review policies, track and monitor the standard is executed across Group IT, local BU, Group Digital & Data and the relevant stakeholders Engage and intimately involved in technical leadership decision-making and collaboration with other key Technology leaders within Group and local BUs in setting the governance framework in enhancing SRE and production reliability Embedded Chapters : Create embedded SRE cha


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at fwd? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect