Skip to main content
Back to jobs

Research Scientist, LLM Evaluation & Post-Training

External
centific logoCentific · Remote
Full-timeRemote2w ago
Cross-functional CollaborationDocumentationLeadershipLLMsRAGSAFe
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

Research Scientist, LLM Evaluation & Post-Training Company: Centific Location: Palo Alto, CA or Seattle, WA (Hybrid/Remote) Type: Full-time Role Overview As a Research Scientist, LLM Evaluation & Post-Training , you will be at the frontier of how evaluation design, measurement strategy, and feedback signals drive model improvement across Centific's AI platform products. This is a high-impact individual contributor and collaborative research role that sits at the intersection of applied ML research, enterprise AI product development, and customer-facing scientific consulting. You will lead research programs that define next-generation evaluation-driven post-training workflows, develop rigorous benchmark frameworks, and partner directly with leading AI organizations to deliver credible, actionable model improvement insights. This role offers the opportunity to shape Centific's internal research agenda, build reusable scientific assets, and publish at top-tier venues.

Responsibilities

  • Research Agenda & Experimentation: Define and execute a rigorous research agenda focused on LLM evaluation and post-training, with emphasis on evaluation-driven model improvement. Design experiments to study how evaluation methodologies impact fine-tuning and post-training outcomes.
  • Evaluation Framework Development: Develop and validate comprehensive evaluation frameworks for LLM and multimodal systems, covering benchmark and task design, scoring methods, judge/model-assisted evaluation, human evaluation protocols, and robustness/stress testing.
  • Advanced Evaluation Research: Lead research on frontier evaluation domains including long-context, cross-modal, and dynamic multi-turn evaluations. Study effectiveness and limitations of existing techniques and propose improved methodologies with clear validity and scalability tradeoffs.
  • Model Behavior Analysis: Analyze model behavior and failure patterns; generate actionable recommendations for model improvement and evaluation redesign. Translate findings into practical improvements for customer solutions and Centific's internal platforms.
  • Cross-Functional Collaboration: Partner with Language Data Scientists to integrate human-in-the-loop and synthetic data/evaluation strategies, and with AI/ML Research Engineers to translate research methods into scalable evaluation and post-training pipelines.
  • Customer Engagement: Engage with customer technical stakeholders at leading AI organizations to understand evaluation goals, review methodologies, and provide expert scientific recommendations. Serve as a credible technical peer to research and engineering leaders.
  • Knowledge & IP Creation: Contribute to internal benchmark datasets, reusable evaluation frameworks, and research assets. Produce high-quality technical documentation, internal research reports, and client-facing materials explaining methods, results, assumptions, and limitations.
  • Thought Leadership: Contribute to Centific's position as a leader in LLM evaluation and post-training through publications, conference presentations, and open-source contributions.
  • Core Technical Competencies
  • You will provide technical depth and leadership across the following domains:
  • Evaluation Science & Benchmarking
  • Expert-level benchmark dataset and test suite design for language and multimodal models
  • Deep understanding of metric design, scoring reliability, and measurement validity
  • Experience with human evaluation methods and quality assurance (rubric design, inter-rater reliability, adjudication frameworks)
  • LLM & Post-Training Methods
  • Strong understanding of post-training techniques (SFT, RLHF, RLAIF, DPO, PPO, GRPO) and how training objectives interact with evaluation outcomes
  • Ability to reason about model behavior, failure modes, and performance tradeoffs across tasks and domain

Benefits

Remote work options

Additional Information

About Centific Centific is a frontier AI data foundry that curates diverse, high-quality data, using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe, scalable AI deployment. Our team includes more than 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We harness the power of an integrated solution ecosystem-comprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 markets-to create contextual, multilingual, pre-trained datasets; fine-tuned, industry-specific LLMs; and RAG pipelines supported by vector databases. Our zero-distance innovation™ solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster. Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers. We aim to help these organizations unlock significant business value by deploying GenAI at scale, helping to ensure they stay at the forefront of technological advancement and maintain a competitive edge in their respective markets.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at centific? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect