Skip to main content
Back to jobs

Staff Software Engineer- AI Agent Evaluations

External
idme logoIdme · Mountain View, CA
Full-timeOn-site1w ago
CI/CDJavaLangChainLeadershipObservabilityPython
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

This Staff Engineer role sits at the intersection of engineering, applied AI, testing and developer experience. You will define and lead the discipline of testing AI agents, evaluating LLM behavior, and ensuring the reliability of agentic systems operating in production. It requires deep engineering rigor, original thinking about what "correctness" means for non-deterministic systems, and the ability to build eval infrastructure and developer tooling that the entire engineering org depends on. Expert in building and maintaining Retrieval-Augmented Generation (RAG) pipelines, with a deep focus on strategic data chunking and data quality enforcement. Experience in establishing pre-retrieval data quality gates to optimize vector search accuracy, minimize retrieval-induced noise, and significantly reduce LLM hallucination rates in production-deployed agent systems. You will establish quality standards for how ID.me ships AI-powered features safely, mentor engineers across teams on AI testing best practices, and partner directly with product and platform teams to embed quality into every stage of agent development.

Responsibilities

  • Define AI Quality Standards: Own the framework for how ID.me evaluates, validates, and monitors AI agents - from prompt-based features to fully autonomous multi-step workflows.
  • Build Eval Infrastructure: Design and maintain evaluation pipelines for LLM outputs, agent behavior, tool use, and multi-turn interactions across development, staging, and production environments.
  • Production Observability for Agents: Instrument agentic systems for behavioral drift, regression, and failure modes that traditional metrics miss - latency, correctness, hallucination rate, tool misuse, and policy adherence.
  • Agentic Test Strategy: Lead the design of test suites that handle non-determinism - red-teaming agents, golden dataset construction, LLM-as-judge pipelines, and property-based testing for AI outputs.
  • Drive AI-First Engineering Culture: Raise the quality bar across the engineering org by establishing patterns, tooling, and education for how teams write, test, and deploy AI features responsibly.
  • Cross-Team Collaboration: Partner with Security, Platform, Product, and AI/ML teams to embed quality gates into agent development workflows.
  • Mentorship: Guide senior and mid-level engineers through evaluation design, observability strategy, and testing approaches specific to AI systems.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent experience
  • 8+ years building and operating production software systems
  • Demonstrated experience evaluating or testing LLM-powered features or autonomous agents in production
  • Proficiency with AI-assisted development tools (Claude Code, Cursor, or equivalent) - you build with AI every day
  • Strong backend engineering fundamentals in Python, Java, Go, or equivalent
  • Experience designing test infrastructure, CI/CD quality gates, or evaluation pipelines at scale
  • Experience improving developer experience - building internal tooling, reducing toil, or accelerating engineering workflows
  • Proven ability to lead cross-team technical initiatives and influence engineering standards
  • Strong written and verbal communication across engineering, product, and leadership
  • Experience building eval frameworks for LLM agents (e.g., correctness graders, LLM-as-judge, human-in-the-loop evals, benchmark dataset curation)
  • Familiarity with agentic frameworks (Claude API / Anthropic SDK, BrainTrust, LangChain, LangGraph, CrewAI, or similar)
  • Production monitoring experience for AI systems: behavioral drift detection, output sampling, shadow scoring
  • Red-teaming or adversarial testing

Benefits

Health insurance

Additional Information

Company Overview ID.me is the next-generation digital identity wallet that simplifies how individuals securely prove their identity online. Consumers can verify their identity with ID.me once and seamlessly login across websites without having to create a new login and verify their identity again. Over 152 million users experience streamlined login and identity verification with ID.me at 20 federal agencies, 45 state government agencies, and 70+ healthcare organizations. More than 600+ consumer brands use ID.me to verify communities and user segments to honor service and build more authentic relationships. ID.me's technology meets the federal standards for consumer authentication set by the Commerce Department and is approved as a NIST 800-63-3 IAL2 / AAL2 credential service provider by the Kantara Initiative. ID.me is committed to "No Identity Left Behind" to enable all people to have a secure digital identity. To learn more, visit https://network.id.me/ .


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at idme? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect