Director, Reinforcement Learning & Agentic Post-Training

External

Blue Yonder · Paris, France

Full-timeOn-siteToday

LLMsMachine LearningMentoringObservabilityPythonPyTorch

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Lead the technical strategy for reinforcement learning, post-training, and tool-using LLM agents within the AI Studio.
Build and manage a team of machine learning engineers working on agent training, RL environments, reward modeling, evaluation, data generation, and training infrastructure.
Design environments where LLM agents learn to operate Blue Yonder software through APIs, tools, workflows, simulations, and human feedback.
Develop training and evaluation systems for multi-step supply chain workflows across planning, warehouse management, transportation, commerce, and network operations.
Define what "good" looks like for operational agents: correct tool use, constraint adherence, business outcome quality, latency, cost, robustness, escalation behavior, and human trust.
Build reward models, verifiers, preference pipelines, automated graders, and evaluation harnesses for agent behavior.
Create evaluation frameworks that measure real agent performance, including tool-call correctness, workflow completion, recovery from bad state, long-horizon reliability, and failure modes.
Partner with product, engineering, architecture, and domain experts to turn real supply chain workflows into trainable agent environments.
Guide model improvement across supervised fine-tuning, preference optimization, reinforcement learning from human or AI feedback, rejection sampling, synthetic data generation, and policy optimisation.
Make practical technical tradeoffs between model capability, inference cost, latency, reliability, product timelines, and operational safety.
Establish engineering standards for experiment tracking, reproducibility, observability, rollout safety, and production monitoring.
Document what works and what fails so the team compounds learning over time.

Requirements

We want to talk if you:
Have led a team to ship LLM models trained with Reinforcement Learning, SFT, DPO, RLHF/RLAIF and other post-trained models in production.
Have led a team to train models to use tools, call APIs, interact with software environments, or complete multi-step tasks.
Have a strong machine learning engineering background and can credibly lead engineers because you have built systems like this yourself.
Have managed or technically led high-performing Reinforcement Learning ML engineering teams.
Are highly proficient in Python and PyTorch.
Understand modern LLM post-training workflows, including supervised fine-tuning, preference data, reward modeling, policy optimisation, evaluation, and deployment.
Have hands-on experience with reinforcement learning methods such as reward shaping, PPO-style optimisation, GRPO, offline RL, policy evaluation, rejection sampling, or environment design.
Know how to evaluate open-ended agent behaviour beyond static benchmark scores.
Can reason about production constraints: latency, inference cost, safety, observability, rollback, and reliability.
Can balance frontier-orient

Additional Information

About The AI Studio The AI Studio's mission is to find the fastest possible path to an autonomous supply chain. We build AI agents, learning systems, model training pipelines, evaluations, simulations, and decision-making systems for some of the hardest problems in global supply chain. The work spans LLMs, reinforcement learning, agentic workflows, software automation, optimization, and production engineering. In short, we are having a lot of fun. Your Mission We are looking for a deeply technical Director of Reinforcement Learning & Agentic Post-Training to lead how Blue Yonder trains LLM-based agents to operate supply chain software. This role sits at the center of our Model Training Factory, built with NVIDIA, where we develop specialized AI agents for the autonomous supply chain. These agents must reason over supply chain state, use tools, interact with Blue Yonder workflows, execute multi-step operational tasks, and improve through feedback, evaluation, and reinforcement learning. Tool use is not a side feature here. Our agents must learn to work inside real enterprise software: querying state, proposing actions, invoking APIs, respecting constraints, handling exceptions, escalating uncertainty, and collaborating with human operators. The challenge is not simply making a model sound knowledgeable about supply chain. The challenge is training models that can reliably act. We are looking for someone who has personally gone through the hard parts: post-training LLMs, designing tool-use environments, building reward models or verifiers, creating evaluations that catch real failures, shipping reinforced models into production, and leading strong machine learning engineers through that process. This is not a pure research management role, and it is not a project management role. You should be comfortable setting strategy, writing and reviewing technical designs, mentoring senior engineers, challenging weak assumptions, and staying close enough to the work to know whether the system is actually learning.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Blue Yonder? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect