Skip to main content
Back to jobs

Distinguished, Data Scientist - Quality & LLM Judging Systems in Conversational Commerce

External
Walmart logoWalmart · San Jose, CA
Full-timeOn-site10mo ago
ClassificationLLMsMachine LearningNLPPrompt Engineering
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Walmart's Next Gen Commerce team is shaping the future of conversational shopping by building intelligent agents that not only respond, but reason, recommend, and proactively assist customers. As a Distinguished Data Scientist for Quality & LLM Judging Systems in Conversational Commerce , you will serve as the key IC partner to the Director of Data Science for this space. You will lead the technical vision and model development for cutting-edge evaluation methodologies to measure and improve the quality of AI-powered conversations and tool outputs. You'll help define how we evaluate our agents and their dependent tools using a combination of human-labeled benchmarks, LLM-as-a-judge systems, and scalable automated pipelines. You'll design prompts, validate agreement with human judgment, and develop LLM distillation strategies to replicate high-quality judgment cost-effectively. This is a high-impact, hands-on technical role requiring deep expertise in LLM prompting, evaluation frameworks, and structured experimentation. You will work closely with modeling, product, and platform teams to ensure that measurement drives improvement, and that the agent's behaviors align with quality, safety, and relevance at every step.

Responsibilities

  • Design evaluation pipelines for conversational agents and their tool outputs using LLM-as-a-judge, human annotation, and hybrid methods
  • Develop high-quality prompts for structured evaluation tasks and iterate based on inter-rater reliability with human judges
  • Develop novel techniques to assess non-textual or subjective outputs-such as recommendations, summaries, and agent-driven actions-where standard metrics fall short
  • Guide the modeling team to distill or fine-tune smaller LLMs to act as scalable evaluation proxies
  • Work with engineering partners to integrate evaluation hooks into model training, validation, and production workflows
  • Conduct in-depth failure mode analysis and define actionable quality signals that inform model and production iteration.
  • Uphold statistical rigor in metric design, validation, and experimental analysis to ensure reliable and interpretable results
  • Foster a culture of principled measurement and trustworthy AI throughout the organization

Requirements

  • 7+ years of experience in data science or machine learning, preferably in evaluation, NLP, or conversational AI
  • Hands-on experience with large language models, including prompt engineering, response grading, and structured generation tasks
  • Familiarity with both human annotation workflows and automated evaluation strategies using LLMs
  • Deep understanding of metric design, evaluation reliability, and statistical validity
  • Strong software engineering fundamentals and ability to own end-to-end pipelines
  • Excellent communication skills and the ability to influence without authority across functions
  • Graduate degree (M.S./Ph.D.) in Computer Science, Machine Learning, NLP, or a related field
  • Experience with conversational AI, summarization, retrieval-augmented generation, or recommendation evaluation
  • Knowledge of model distillation, LoRA, instruction tuning, or parameter-efficient adaptation techniques
  • Familiarity with evaluating open-ended outputs where ground truth is subjective or contextual
  • Publications, patents, or open-source contributions in LLM evaluation or applied AI
  • Why Join Us?
  • This is a rare opportunity to shape the science behind how intelligent agents are judged-literally. Your work will directly define what "quality" means in conversational commerce and enable AI systems that are not only functional but truly helpful, engaging, and aligned with human expectations.
  • You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable.
  • For information about PTO, see https://one.walmart.com/notices.
  • Eligibility require

Additional Information

Position Summary... What you'll do...


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Walmart? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect