Distinguished, Data Scientist - Quality & LLM Judging Systems in Conversational Commerce

External

Walmart · San Jose, CA

Full-timeOn-site10mo ago

ClassificationLLMsMachine LearningNLPPrompt Engineering

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Walmart's Next Gen Commerce team is shaping the future of conversational shopping by building intelligent agents that not only respond, but reason, recommend, and proactively assist customers. As a Distinguished Data Scientist for Quality & LLM Judging Systems in Conversational Commerce , you will serve as the key IC partner to the Director of Data Science for this space. You will lead the technical vision and model development for cutting-edge evaluation methodologies to measure and improve the quality of AI-powered conversations and tool outputs. You'll help define how we evaluate our agents and their dependent tools using a combination of human-labeled benchmarks, LLM-as-a-judge systems, and scalable automated pipelines. You'll design prompts, validate agreement with human judgment, and develop LLM distillation strategies to replicate high-quality judgment cost-effectively. This is a high-impact, hands-on technical role requiring deep expertise in LLM prompting, evaluation frameworks, and structured experimentation. You will work closely with modeling, product, and platform teams to ensure that measurement drives improvement, and that the agent's behaviors align with quality, safety, and relevance at every step.

Responsibilities

Design evaluation pipelines for conversational agents and their tool outputs using LLM-as-a-judge, human annotation, and hybrid methods
Develop high-quality prompts for structured evaluation tasks and iterate based on inter-rater reliability with human judges
Develop novel techniques to assess non-textual or subjective outputs-such as recommendations, summaries, and agent-driven actions-where standard metrics fall short
Guide the modeling team to distill or fine-tune smaller LLMs to act as scalable evaluation proxies
Work with engineering partners to integrate evaluation hooks into model training, validation, and production workflows
Conduct in-depth failure mode analysis and define actionable quality signals that inform model and production iteration.
Uphold statistical rigor in metric design, validation, and experimental analysis to ensure reliable and interpretable results
Foster a culture of principled measurement and trustworthy AI throughout the organization

Requirements

7+ years of experience in data science or machine learning, preferably in evaluation, NLP, or conversational AI
Hands-on experience with large language models, including prompt engineering, response grading, and structured generation tasks
Familiarity with both human annotation workflows and automated evaluation strategies using LLMs
Deep understanding of metric design, evaluation reliability, and statistical validity
Strong software engineering fundamentals and ability to own end-to-end pipelines
Excellent communication skills and the ability to influence without authority across functions
Graduate degree (M.S./Ph.D.) in Computer Science, Machine Learning, NLP, or a related field
Experience with conversational AI, summarization, retrieval-augmented generation, or recommendation evaluation
Knowledge of model distillation, LoRA, instruction tuning, or parameter-efficient adaptation techniques
Familiarity with evaluating open-ended outputs where ground truth is subjective or contextual
Publications, patents, or open-source contributions in LLM evaluation or applied AI
Why Join Us?
This is a rare opportunity to shape the science behind how intelligent agents are judged-literally. Your work will directly define what "quality" means in conversational commerce and enable AI systems that are not only functional but truly helpful, engaging, and aligned with human expectations.
You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.
For information about PTO, see https://one.walmart.com/notices.
Eligibility require

Additional Information

Position Summary... What you'll do...

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Walmart? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect