Sr Machine Learning/AI Engineer /ML OPS Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Sitting at the critical intersection of ML engineering, platform engineering, and observability, the Peakon MLOps Engineer serves as the central operational link between development and production. You will work closely with ML engineers, backend engineers, and the central Agent Forge / ML Runtime platform teams to ensure our agents run reliably for customers, providing clear runbooks, robust observability, on-call and predictable incident response. Your primary objective is to enable ML engineers to safely ship changes and understand their impact by providing streamlined tooling and operational support. The Peakon MLOps Engineer is responsible for operating, hardening, and continuously improving the production infrastructure that powers the Peakon Agent, AI Features and related ML workloads. This ownership spans deployments, monitoring, and on-call workflows. Key operational functions include managing the entire deployment lifecycle for the Peakon Agent and other AI Features, contributing to driving operational excellence across the ML platform. Furthermore, the role involves building and maintaining tooling to surface evaluation data, supporting performance testing and load simulations, and collaborating on the automation of essential security upgrades for ML dependencies. About You Basic Qualifications Summary Minimum 8 years of relevant industry experience. Holds a Bachelor's/Master or PhD in Computer Science, Data Science, Statistics, Mathematics, Engineering, or equivalent practical experience. Proven track record as an MLOps or ML-savvy SRE/Platform Engineer supporting production-grade LLM and agentic systems. Proficient in Python and frameworks like LangChain and LlamaIndex. Hands-on experience operating containerized services (Docker, Kubernetes) using Git-based workflows (GitOps, GitHub Actions). Solid understanding of modern ML stacks (platforms, feature stores, registries, messaging layers) or deep platform engineering background. Demonstrated ability to own production infrastructure end-to-end-managing monitoring, incident response, rollbacks, and continuous reliability/uptime improvements. Deep understanding of the model development lifecycle, specifically regarding model monitoring, regression tracking, and automated evaluation using tools like LangSmith. Strong communication and collaboration skills under pressure, acting as a bridge between ML engineers, backend teams, and central platform/security specialists. Other Qualifications Summary Solid knowledge of data science principles and ML algorithms applied directly to LLMs, RAG, and autonomous decision-making agents. Experience leading model-building processes, including advanced fine-tuning, alignment techniques, prompt engineering, and simulations of agent behaviors. Strong understanding of software development principles coupled with demonstrated proficiency in System Design and Architectural Governance to deploy, scale, and maintain high-availability ML models. Expertise in threat modeling and security for ML/agent systems to enforce strict behavioral guardrails. Proven experience navigating highly regulated enterprise environments to ensure data auditability, clear ownership boundaries, and strict compliance. Track record of strong technical decision quality and leadership, with the ability to coordinate cross-functional initiatives, translate complex business needs into resilient implementations, and mentor team members while bridging the gap between ML engineering and platform teams. Workday Pay Transparency Statement (For EU Locations Only) Listed below is the base salary range applicable to this position. Workday pay ranges (and the precise pay offered to the successful candidate) are based on a number of objective criteria such as relevant