Architect our evaluation platform from first principles - the observability, scoring, golden datasets, verification agents, and CI/CD integration that define standards of quality. You'll work shoulder-to-shoulder with backend engineers, product managers, and a growing bench of subject matter experts, including practicing CFPs, CPAs, and tax planners, to translate fiduciary-grade requirements into automated quality signals.
Responsibilities
Design and build Hazel's evals platform end-to-end - online scoring, offline benchmarks, regression suites, LLM-as-judge pipelines, and human-in-the-loop review workflows across every Hazel surface.
Build production observability and monitoring for AI quality: hallucination rates, factual accuracy, refusal behavior, latency, cost, and domain-specific quality signals across tax planning, financial planning, investment analysis, and operational AI workflows.
Architect data curation pipelines that turn real advisor interactions into evaluation datasets - with rigorous sampling strategies, labeling protocols, dataset versioning, and the privacy and consent controls required for regulated finance.
Build and steward Hazel's golden datasets in close partnership with SMEs and a network of practicing advisors, CFPs, and tax professionals - translating their tacit expertise into precise, measurable eval criteria.
Develop LLM verification agents that catch hallucinations, computational errors, and compliance violations before they ever reach an advisor or client.
Integrate evals into our deployment pipeline so that every prompt change, model swap, harness modification, or RAG pipeline tweak runs against regression and acceptance criteria before shipping - making evals a first-class deployment gate, not a quarterly audit.
Partner with the team building Hazel's model-agnostic orchestration harness to evaluate cross-model and cross-provider performance, surface tradeoffs, and inform routing decisions across Anthropic, OpenAI, and self-hosted models.
Define quality SLOs for each Hazel surface and build alerting that catches regressions in production before our customers do - especially for high-stakes flows like tax and financial planning.
Establish Hazel's eval methodology as a defensible competitive advantage - infrastructure good enough that model upgrades from frontier labs become accelerants for us, not threats.
What you bring:
8+ years of engineering experience, with at least 2 years focused on evaluation infrastructure, model quality, fine-tuning, or ML platform work for production systems.
Deep familiarity with evaluation and scoring methodologies for modern AI systems - RAG evaluation, document processing, fine-tuned model assessment, agentic and tool-use system evaluation, LLM-as-judge frameworks, and human evaluation protocols.
Experience designing and curating golden datasets - sampling strategies, inter-rater agreement, dataset versioning, and managing the long tail of edge cases.
Comfort working across the stack - data engineering (SQL, dbt, warehouses), backend integration (APIs, async pipelines, queues), and observability tooling.
Strong communication skills. You can translate fuzzy domain requirements from advisors a
Benefits
Health insurance
Additional Information
About Altruist
Altruist is transforming the multi-trillion dollar wealth management industry by building an AI platform for wealth professionals. We partner with financial advisors nationwide, empowering them to grow, optimize time and resources, and deliver superior outcomes for their clients.
We're looking for exceptional talent to help us achieve our mission of making financial advice better, more affordable, and accessible to all. If you're passionate about challenging the status quo and want to do the most important work of your life, we'd love to meet you!
But first, our values
Kindness - Kindness doesn't just equal niceness. We listen to understand. We embrace, and encourage healthy debate and diverse perspectives. We approach conflict openly, honestly, and respectfully.
Brilliance - Humility is the skill we're most proud of and possessing a growth mindset is always top of mind. We take ownership in everything we touch; regularly using our unique superpowers to reach a common goal as a team. We succeed and fail as one.
Grit - When challenges arise, we stay laser focused on achieving our mission and finding a way forward, even when it's hard. We are nimble and maintain a sense of urgency, swiftly adapting to change and overcoming obstacles.
About Hazel:
Hazel.ai is building the AI engine for wealth management that unlocks 10x growth, efficiency and value for financial advisors and their clients in a regulated industry. Since its launch last September, Hazel has organically and rapidly grown its user base.
Hazel is a part of Altruist's broader mission to make financial advice better, more affordable, and accessible to all.
This role is hybrid, with four in-office days per week at our San Francisco FiDi location.