Support Engineer, AWS Incident Response
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
As a Support Engineer on AIR's Seattle team, you'll be on the front line of AWS incident response. You'll lead high-severity calls, triage complex failures across distributed systems, coordinate resolver teams, and drive incidents to mitigation while millions of customers depend on the outcome. Between incidents, you'll obsess over metrics and detection analysis, building dashboards and mechanisms that surface problems before customers notice. You will drive operational improvements that make the incident management ecosystem faster and more accurate. This isn't a role where you watch dashboards and robotically follow runbooks. You'll deep-dive the largest, most complex technical environment in the world. You'll develop expertise across AWS services, networking, and infrastructure. You'll own operational processes end-to-end and use data to find the next leap in how we protect the cloud. If interested, you'll also have the opportunity to grow your development skills by taking on coding projects matched to your ability level. This role includes participation in an on-call rotation, including some weekends and holidays. Key job responsibilities Incident Response Lead high-severity incident response calls. Triage, coordinate resolvers across AWS service teams, communicate clearly under pressure, and drive incidents to mitigation. Manage escalations and ensure accurate documentation throughout. Operational Excellence and Detection Own and run operational health reviews. Build and maintain dashboards, metrics, and monitoring that surface trends before they become incidents. Obsess over detection accuracy and speed. Detect patterns across events and drive proactive mechanisms to prevent recurrence. Metrics and Analysis Deep-dive operational data to identify systemic issues, measure response effectiveness, and prioritize improvements. Use metrics to tell the story of what's working, what's degrading, and where the next risk is hiding. Process and Tooling Improvement Identify gaps in operational processes, documentation, and tooling. Build or improve mechanisms that reduce time-to-detection and time-to-mitigation. Use data to prioritize where effort has the highest impact. Automation and Generative AI Leverage scripting, generative AI, and automation to accelerate incident response, improve detection, and reduce toil. Identify opportunities where AI can augment human judgment during incidents or surface insights from operational data at scale. Driving Continuous Improvement Ensure each incident makes AWS stronger. Work with service teams to ensure learnings from incidents drive corrective actions and that follow-through happens. Close the loop between what broke and what gets fixed.