Skip to main content
Back to jobs

Director, Evaluations

External
lawzero logoLawzero · Montreal, Canada
Full-timeOn-site4w ago
LeadershipLLMsMachine LearningSAFe
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Define LawZero's evaluations strategy and roadmap, prioritising what needs to be measured and when, in close coordination with both research and product teams.
  • Build up the Evaluations Team during your first 3-6 months, scaling to roughly 8-10 people across research, engineering, dataset and benchmark design, and red‑teaming.
  • Operate the team independently of the main research and product streams in order to avoid conflicts of interest, including designing novel benchmarks that can be applied apples‑to‑apples to evaluate both the Scientist AI and frontier LLMs.
  • Oversee the design and construction of new datasets, tasks, and virtual or interactive environments to measure performance of the Scientist AI across capabilities, safety (including honesty and goal-directedness), explainability, causal mechanisms and detecting adversarial attacks.
  • Lead evaluation of the Scientist AI when deployed as a guardrail around frontier models, including its ability to comply with harm specifications, detect and block harmful responses, explain its decisions, and resist adversarial attacks such as jailbreaks, prompt injection, and data poisoning.
  • Establish and lead our automated and manual red‑teaming programmes, both in‑house and in partnership with external providers, to stress test the Scientist AI as a general‑purpose model and as a guardrail.
  • Lead the construction of internal tooling and infrastructure needed to run evaluations at scale, automating and standardizing the pipeline wherever possible.
  • As needed and where possible, directly support research and product streams with their own internal requirements w.r.t. evaluations and benchmarking to unblock and speed up.
  • Own LawZero's public communication of evaluation results, including model and system cards, technical reports, peer‑reviewed publications and blog posts, to build trust with the wider AI safety community.
  • Represent LawZero externally on evaluations and AI safety measurement, including engagements with AI safety institutes, research collaborators, and grant funders.

Requirements

  • An advanced degree (MSc or higher) in machine learning, computer science, or a closely related field.
  • 10+ years of experience in machine learning, with at least 5 years in a leadership role building or scaling technical teams working on real-world ML products.
  • Hands‑on expertise in designing and running large‑scale evaluations of LLMs or other frontier ML systems across capabilities, safety, and adversarial robustness.
  • A track record of building evaluation datasets, benchmarks, or interactive environments from scratch, including for safety‑relevant properties such as honesty, sycophancy, refusal behaviour, and adversarial robustness.
  • Strong written and verbal communication skills, including the ability to translate technical results for non‑technical audiences such as executives, funders, and policymakers.
  • Comfortable operating in a research‑driven, fast‑moving environment with significant ambiguity, and able to bring structure to it without slowing it down.
  • Experience leading red‑teaming exercises (automated, manual, or both) and working with third‑party evaluation or red‑teaming partners is a bonus.
  • Experience working with third-party partners for benchmark and dataset creation
  • Experience releasing open‑source datasets, benchmarks, or evaluation tooling is a bonus.
  • Familiarity with current AI safety policy and standards work (UK AISI, CAISI, NIST, EU AI Act, etc.) is a bonus.
  • Experience contributing to or coordinating with external safety institutes, grant funders, or government bodies is a bonus.

Benefits

The opportunity to contribute to a unique mission with a major impactComprehensive health benefitsA minimum of 20 days vacation per year upon startA minimum retirement savings employer contribution of 4%Generous flexible benefits designed to contribute to your well-beingA team of passionate experts in their fieldA collaborative and inclusive work environment with offices in the heart of Little Italy, in the trendy Mile-Ex district, close to public transportation.About LawZeroLawZero is a non-profiHealth insurancePaid time offFlexible schedulePerformance bonus

Additional Information

LawZero is a non-profit building safe-by-design AI systems. We're building the Scientist AI, an advanced AI system designed from the ground up to be both highly capable and safe. As we develop both general‑purpose Scientist AI models and safety guardrails for frontier LLMs, we need rigorous, independent evaluation of every capability and safety claim we make. We are looking for a Director of Evaluations to build, lead, and grow LawZero's Evaluations Team. This is a foundational hire. You will define what world‑class evaluation looks like at LawZero, build the team and infrastructure to deliver it, and ensure that evaluations remain independent of the main research stream so that capability and safety claims can be trusted both internally and externally by the wider AI and AI safety community.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at lawzero? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect