Incident Management Lead
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Incidents are inevitable. How fast you detect them, how quickly you act, and whether the organization actually learns from them - that is what separates payments companies that scale from ones that spiral. Forward processes payments for thousands of merchants across dozens of partner platforms. When something breaks - a submission failure blocks merchants mid-onboarding, a processing outage hits a partner's book, a compliance flag freezes accounts at scale - the impact lands on real businesses in real time. The question is not whether incidents will happen. It is whether Forward detects them in minutes or hours, resolves them with coordination or chaos, and fixes the root cause or patches the symptom. The Incident Management Lead owns that answer. This is a modern role for a modern problem. You will build an AI-assisted incident intelligence layer that gives Forward signal before issues become incidents, run coordinated response when they do, and drive the post-incident work that makes the organization genuinely more resilient - not just less embarrassed. You will own the closed loop between reactive resolution and proactive prevention: the governance function that ensures Forward gets faster and smarter after every incident rather than repeating the same failures. This role sits at the intersection of Engineering, GTM, Support, and Operations, and is directly accountable to the CRCO. When things go wrong at Forward, you are the named owner - before, during, and after.
Responsibilities
- Build the Detection Layer
- Design and operate a proactive monitoring and alerting infrastructure: SLO burn-rate alerting, synthetic health checks, deployment risk scoring, and real-time anomaly detection across submission, processing, and compliance pipelines.
- Build and maintain AI-assisted signal intelligence: use AIOps platforms (PagerDuty, Incident.io , or equivalent) to correlate alerts, suppress noise, and surface high-confidence incident precursors before they manifest as partner escalations.
- Own the governance loop over Support: review support ticket themes and escalation patterns on a weekly cadence to identify systemic issues before they cross into incident territory.
- Establish alert-to-noise discipline: define what a true signal looks like for each incident type, tune alerting thresholds, and drive Alert-to-Noise Ratio above 80% - the team acts on signals, not volume.
- Build and maintain the runbook library: pre-written, AI-augmented playbooks for the most common incident classes - submission failures, processing outages, ACH return spikes, TM system failures, compliance freezes - so the first 15 minutes of every incident are not spent figuring out who does what.
- Run Incident Response
- Serve as the named incident owner when an incident is declared - responsible for coordinating Engineering, Support, GTM, and Operations from detection through resolution.
- Declare incidents using a consistent severity framework (Sev-1 through Sev-3) with defined, documented SLAs for each tier.
- Drive MTTD (Mean Time to Detect) and MTTA (Mean Time to Acknowledge) toward P1 targets: detection under 5 minutes, acknowledgment under 15 minutes for Sev-1 and Sev-2.
- Manage communications during incidents on a defined cadence: internal stakeholder updates, partner-facing status, and merchant-level communications where required - proactive, not reactive.
- Classify merchant and partner impact in real time: GPV-at-risk, number of affected merchants, partner SLA implications, and any regulatory reporting obligations under DORA or card network rules.
- Use AI-assisted investigation tooling to compress diagnosis time: automated root cause hypothesis generation, timeline reconstruction, and runbook suggestion reduce the first 30 minutes of investigation to seconds.
- Drive Post-Incident Learning
- Own the post-incident review (PIR) for every Sev-1 and Sev-2: complete root cause analysis, contributing factor mapping, timeline reconstruction, and remediation item ownership - delivered within 48 hours.
- Track remediation commitments to closure - architectural fixes, tooling gaps, process changes, and partner education. Not just documented: done. Verified. Closed.
- Produce partner-facing incident summaries for high-impact events: clear, factual, and accountable. Where DORA or card network reporting obligations apply, own those submissions on deadline.
- Build and maintain the incident knowledge base: a searchable, AI-indexed record of every incident, RCA, and remediation action that the full team can learn from and reference.
- Track Incident Recurrence Rate as a primary quality signal. A repeated incident is a failed post-incident review.
- Quantify Partner and Merchant Impact
- Develop and maintain a GPV-at-risk classification framework: when an incident fires, the team knows immediately which partners and merchants are affected, what volume is at risk per hour, and what the SLA clock looks like.
- Build per-partner SLA attainme
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Forward? Share your experience