Skip to main content
Back to jobs

Senior Director, AI Operations (AI/LLM Production Systems)

External
Aecom2 logoAecom2 · Dallas, TX
Full-timeOn-site1mo ago30+ days old, may be filled
AWSAzureDatadogGrafanaLangChainLeadership
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Define and scale the enterprise AI Operations practice, including operating model, standards, and governance
  • Establish production readiness and operability standards across AI/LLM and agentic systems
  • Own production reliability, including SLAs/SLOs, incident management, and support models
  • Implement observability and monitoring for AI systems (latency, drift, behavior, failures, cost)
  • Ensure clear ownership, escalation paths, and accountability across production AI systems
  • Build controls for agent behavior, model usage, and operational risk
  • Drive performance, reliability, and cost optimization across AI workloads
  • Lead operational reviews and reporting, providing visibility into system health, risks, and trends
  • Identify systemic issues and drive continuous improvement across AI systems and processes
  • Partner with Engineering, Product, and Platform teams to ensure production readiness and alignment

Requirements

  • Bachelor's Degree plus extensive years of SRE, MLOps, production operations, or platform engineering experience, including 6 years of leadership experience, or demonstrated equivalency of experience and/or education
  • Experience operating AI/ML/LLM systems in production (serving real users at scale) with clear ownership and accountability
  • Background in SRE, MLOps, or distributed systems, with depth in reliability and operational excellence
  • Strong understanding of AI production failure modes (e.g., drift, hallucinations, orchestration issues, cost inefficiencies)
  • Experience building and scaling observability, monitoring, and telemetry systems (e.g., OpenTelemetry, Datadog, Prometheus, Grafana)
  • Proven track record defining SLAs/SLOs, incident management, and operational frameworks for complex systems
  • Experience leading cross-functional efforts across engineering, platform, and product teams
  • Ability to operate at both strategic and hands-on levels, setting direction while driving execution
  • Experience with LLM platforms or frameworks (e.g., Azure AI, AWS Bedrock, LangChain)
  • Experience with agentic systems, RAG pipelines, or orchestration frameworks
  • Background in ITIL or service management, applied to modern distributed systems
  • Familiarity with Responsible AI and governance frameworks
  • Relocation assistance is not available for this position
  • Sponsorship for US work authorization is not available for this position, now or in the future.
  • About AECOM
  • What makes AECOM a great place to work
  • You will be part of a global team that champions your growth and career ambitions. Work on groundbreaking projects - both in your local community and on a global scale - that are transforming our industry and shaping the future. With cutting-edge techn

Benefits

Health insuranceDental insuranceVision insuranceFlexible scheduleEquity / stock options

Additional Information

We're defining how AI runs in production across the enterprise. As AI adoption scales, the challenge shifts from building models to operating them reliably. This role owns how AI/LLM and agentic systems are run, supported, and governed, ensuring they are reliable, observable, cost-efficient, and continuously improving in real-world environments. You will lead the development of the enterprise AI Operations practice, establishing the standards, operating model, and visibility required to support AI at scale. This includes defining how systems are monitored, how incidents are managed, how risks are controlled, and how performance is continuously improved. Working closely with Engineering, AI Platform, Product, and Delivery teams, you will ensure all production AI systems meet clear operational standards and that leadership has consistent visibility into system health, performance, and risk. This is a hands-on, senior leadership role with end-to-end accountability for how AI systems perform in production. This position will offer flexibility for hybrid work schedules to include both in-office presence and telecommute/virtual work, to be based from either Houston or Dallas, TX.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Aecom2? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect