SRE Architect, AI-Powered Reliability

External

Wex · Portland, ME

Full-timeOn-site2w ago

Capacity PlanningChaos EngineeringLeadershipLoad TestingMoveObservability

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Benefits

Health insurance

Additional Information

About the Team & Role WEX operates across multiple lines of business, Mobility, Benefits, and Travel, serving enterprise customers globally with payment and technology solutions that demand uncompromising reliability. These are mission-critical systems handling high-volume financial transactions where availability, transactional integrity, and low latency are non-negotiable. Our SRE practice is in its early stages, and the decisions made now will define how we build, operate, and continuously improve reliable systems for years to come. This person will define and enforce the reliability standards, operational practices, and architectural guardrails that every line of business at WEX must meet, and will use AI as a primary tool to establish, scale, and continuously improve those standards faster than traditional approaches alone can achieve. This is not a role embedded in a single business unit. It sits at the center of WEX engineering with a mandate that spans all LOBs. You will set the bar, and you will hold it , working with engineering leadership, platform teams, and LOB architects to make reliability a consistent, measurable, and continuously improving property of every system we operate. How you'll make an impact Enterprise Standards & Governance Define, publish, and enforce enterprise-wide SRE best practices and operational standards covering observability, incident management, resilience, capacity planning, and reliability architecture, applicable across all WEX lines of business. Define and lead WEX's AI-Powered Reliability Engineering strategy, driving adoption of SRE agents across the software lifecycle-from design and development through deployment and operations, to improve reliability, automation, and operational efficiency. Architect and oversee the implementation of mission-critical systems, ensuring that reliability, availability, and transactional integrity requirements are designed in from the start, not bolted on after the fact. Establish and govern SLO, SLI, and error budget frameworks across LOBs, partnering with engineering leadership to align reliability targets with business and commercial expectations. Own the production readiness review process, defining the criteria every service must meet before going live and driving accountability for remediation when gaps are found. Serve as the primary technical advisor to engineering leadership across WEX on matters of reliability, resilience architecture, and operational excellence. Observability Define the enterprise observability standard, what good looks like for metrics, distributed tracing, structured logging, and alerting, and hold all LOBs accountable to it. Use AI-powered tooling to move beyond static dashboards: deploy intelligent anomaly detection, dynamic baselining, and automated signal correlation to reduce noise and surface actionable signals at scale. Drive instrumentation practices that give engineering teams genuine insight into the health of high-availability, low-latency systems, including real-time payment flows and transaction pipelines where latency and consistency are critical. Lead the evaluation and adoption of AI-assisted observability platforms that reason across telemetry sources to accelerate detection and diagnosis. Incident Management Establish the enterprise incident management framework: severity definitions, response playbooks, escalation paths, on-call standards, and cross-LOB communication protocols. Integrate AI into the full incident lifecycle, intelligent triage and automated runbook suggestions at detection, real-time signal correlation during active incidents, and AI-assisted timeline and impact summaries at resolution. Reduce cognitive burden on on-call engineers through tooling that surfaces relevant context, prior incidents, and likely remediation paths automatically during high-pressure situations. Define, track, and report on incident metrics (MTTD, MTTR, recurrence rate) across all LOBs, using trends to drive systemic improvement rather than one-off fixes. Resilience Engineering & Self-Healing Systems Lead cross-functional initiatives to enhance system resilience and performance across WEX, advocating for circuit breakers, bulkheads, graceful degradation, retry strategies, and fault isolation as enterprise standards. Design self-healing and auto-recovery mechanisms that allow systems to detect, respond to, and recover from common failure modes without human intervention, reducing toil and improving mean time to recovery. Build and operate chaos engineering programs appropriate for WEX's financial systems, running controlled failure experiments that expose resilience gaps safely and systematically before they manifest as production incidents. Use AI to proactively identify resilience risks: analyze production telemetry, deployment signals, and dependency graphs to surface systems most likely to fail under stress before incidents occur. Capacity Planning & Load Testing Develo

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at WEX Inc? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect