Define reference architectures for GenAI apps, RAG systems, and agent ecosystems (single/multi-agent) on GCP using ADK.
Establish domain and platform standards: model selection, RAG/generation patterns, memory architectures, security baselines, observability, and LLMOps.
Lead portfolio-wide technical decisions (build/buy, vendor selection, SLAs, quotas) with a focus on reliability, safety, and cost control.
Solution Design & Delivery
Architect and lead implementation of production-grade GenAI solutions (Vertex AI models, Grounding, Pipelines, Evaluation) and agentic services (planning, tools, memory, HIL).
Design multi-tenant and hub-and-spoke patterns with Okta/IAP/Apigee for secure API exposure and tenant isolation.
Drive end-to-end delivery across teams: data ingestion (Dataflow/Composer), indexing (BigQuery vectors/Vertex Vector Search), services (Cloud Run/Workflows), events (Pub/Sub).
Platformization & Reuse
Build and maintain prompt libraries, tool catalogs, agent templates, and evaluation harnesses for organization-wide reuse.
Standardize LLMOps: CI/CD for prompts/models/agents, model registry, traceability, rollback, canaries, cost/performance scorecards.
Enable a marketplace of agents/services with productized APIs, documentation, chargeback, and KPIs.
LLMOps/MLOps: Vertex AI Pipelines, registry, CI/CD, trace correlation, cost/performance monitoring.
Security & Compliance: IAM, Secret Manager, VPC-SC, private service connect, DLP, Okta/IAP, Apigee API policies.
Observability & Cost: Central telemetry, user feedback loops, drift/outlier detection, quota/capacity planning.
Requirements
12-15+ years in software/data/ML engineering; 2+ years hands-on with LLMs/GenAI and agentic systems.
Proven delivery of enterprise-scale GenAI/agent platforms on GCP (Vertex AI, BigQuery, Cloud Run, Pub/Sub, Workflows).
Demonstrated impact in platformization, governance, and multi-team technical leadership.
Strong proficiency in Python/TypeScript (or equivalent) and infrastructure-as-code (Terraform/GCP Deployment Manager).
Experience in security-by-design, privacy, and compliance audits.
Outcomes & KPIs (What "Great" Looks Like)
Reliability: SLOs met (e.g., p95 latency, error budget adherence); audited HA/DR playbooks; zero Sev1 incidents due to preventable guardrail gaps.
Quality & Safety: Sustained improvements on faithfulness/toxicity/grounding scores; red-team findings resolved within agreed SLAs.
Cost & Performance: ≥ 30% reduction in run-cost via routing, caching, and prompt/template optimization; budget adherence per tenant.
Productivity & Reuse: ≥ 50% reuse of tools/templates across teams; time-to-market reduced by ~40% for new AI featur
Benefits
Paid time off
Additional Information
Powering the agentic revolution in travel. Sabre is an AI-native technology leader, backed by one of the world's largest travel data clouds. Built on an open, modular, cloud-native architecture, Sabre serves as the backbone for both established leaders and bold, new disruptors, guiding them to the next age of travel retailing through intelligent, connected, and personalized experiences. With AI at its core and operating at unparalleled scale, Sabre transforms insights into innovation, empowering airlines, hoteliers, agencies and other partners to retail, distribute and fulfill travel worldwide.
The Principal AI/ML Engineer is the technical leader responsible for designing, building, and scaling AI systems that combine LLM-powered GenAI and ADK-based agentic workflows on Google Cloud Platform. This role sets architecture standards, leads multi-team delivery, and governs safety, reliability, builds and manages the platform, and cost at enterprise scale-accelerating product teams to achieve 10× productivity through reusable patterns, platforms, and guardrails.