Applied AI Evaluation Scientist
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We're looking for an Applied AI Evaluation Scientist - someone who sits at the intersection of data science, information retrieval, machine learning, and product thinking. This person will own the quality and trustworthiness of our AI/ML systems by designing, building, and running rigorous evaluation frameworks. The primary focus will be on our Agentic Retrieval-Augmented Generation (RAG) pipelines - optimizing how we chunk, embed, retrieve, rank, and generate - but the role extends to evaluating other AI/ML systems across the company. The ideal candidate has the judgment to know what's worth evaluating, what isn't , and the statistical grounding to make sure the evaluations they do run are sound, realistic, and actionable. Balancing resource capacity and velocity is key--knowing what to measure and how to measure it to drive improvements for our customers is paramount. You will work closely with Product and Engineering. Your code doesn't need to be production-hardened, but it must achieve intended outcomes - think research-quality Python, clear notebooks, and reproducible experiments, not bulletproof microservices.
Responsibilities
- Agentic RAG Pipeline Evaluation & Optimization (Primary Focus)
- Design and curate evaluation datasets for retrieval quality - including synthetically generated query-answer-context pairs, adversarial test cases, and gold sets drawn from real user queries.
- Measure retrieval quality using metrics like Recall@k, Precision@k, MRR, and NDCG@k. Know when each metric matters and when it doesn't for a given use case.
- Evaluate and optimize chunking strategies - run grid searches over chunk size, overlap, and segmentation methods. Understand how chunking decisions cascade into retrieval and generation quality.
- Assess embedding and re-ranking strategies - benchmark embedding models, evaluate re-rankers, and measure the downstream impact on generation quality.
- Evaluate generation quality in context - measure faithfulness, relevance, hallucination rates, and omissions using a combination of code-based checks, LLM-as-judge, and targeted human review.
- Attribute failures across the pipeline - determine whether a bad answer is caused by poor data cleanliness/normalization, retrieval, bad chunking, a generation error, or an interaction between components. Build diagnostic tooling to isolate root causes.
- Broader AI/ML Evaluation
- Conduct systematic error analysis on AI/ML system outputs - read traces, identify failure modes through open and axial coding, and build structured failure taxonomies.
- Design and validate LLM-as-Judge evaluators where appropriate - write judge prompts, split data into train/dev/test sets, iteratively refine, and measure TPR/TNR against human-labeled ground truth.
- Estimate true success rates using imperfect judges - apply bias-correction techniques (e.g., Rogan-Gladen) and bootstrap confidence intervals to provide statistically grounded performance estimates.
- Build and maintain golden datasets for CI regression testing of AI pipelines.
- Prioritize ruthlessly - assess which failure modes are worth investing evaluation effort into versus which can be fixed by clarifying a prompt or adjusting a tool description.
- Collaboration & Data Review
- Partner with Product to understand what "good" looks like for specific use cases and translate qualitative product requirements into measurable evaluation criteria.
- Partner with Engineering to instrument pipelines for observability, design trace logging, and integrate evaluation checks into CI/CD workflows.
- Design and build li
Benefits
Additional Information
Applied AI Evaluation Scientist Location: Remote (U.S.) Team: AIML Quality - reporting into Engineering leadership Level: Senior (IC) About Jump Jump's mission is to empower financial advisors, firms, and clients to thrive in the age of AI. We automate meeting prep, note-taking, compliance documentation, CRM updates, client recaps, and follow-up tasks - allowing advisors to process meetings in minutes, not hours. Since launching in January 2024, Jump has grown to 30,000+ users at firms ranging from solo practitioners to enterprise RIAs and independent broker-dealers, including partnerships with LPL Financial, Sanctuary Wealth, Osaic, and others. Jump is a Series A company, having raised $30M in venture capital from Battery Ventures (lead), Citi Ventures, Sorenson Capital, and Pelion Venture Partners. Our team of 100+ includes leaders from Google, Stripe, JP Morgan, Snowflake, Fidelity, BILL, Apple, Harvard, Stanford, and other top companies and schools. Our team values: Velocity - World Class - Direct and Kind with No Drama
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at jump-app? Share your experience