Skip to main content
Back to jobs

Site Reliability Engineer - AI Applications (M/W/X)

External
Ubisoft2 logoUbisoft2 · Saint-mandé, France
Full-timeOn-siteToday
CachingCapacity PlanningCI/CDDockerGitLabGitLab CI
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Operate and improve the reliability, scalability, and performance of the AI platform, including the AI Gateway, MCP servers, LLM serving, retrieval services, and related platform components.
  • Define and maintain SLOs, SLIs, and error budgets to guide reliability decisions and balance delivery speed with platform stability.
  • Build strong observability across infrastructure, services, and AI-specific signals, such as latency, token throughput, model/provider errors, cost, saturation, and quality indicators.
  • Improve the performance, efficiency, and cost profile of model and API serving across cloud, containerized, and high-throughput environments.
  • Design graceful degradation, fallback, and failover strategies, including model, provider, and region-level resilience patterns.
  • Maintain safe, repeatable deployment practices using GitLab CI/CD, infrastructure as code, automated testing, and progressive delivery where appropriate.
  • Lead or contribute to incident response, post-incident reviews, capacity planning, and reliability improvements.
  • Partner closely with software engineers, data scientists, ML engineers, security teams, and product stakeholders to productionize AI services responsibly.
  • Help establish operational standards, runbooks, dashboards, and best practices for AI application reliability.

Requirements

  • Proven experience in SRE, platform engineering, backend engineering, or infrastructure roles in compute-intensive, data-intensive, or distributed environments.
  • Strong understanding of reliability engineering practices, including SLOs, error budgets, observability, capacity planning, incident management, and post-incident analysis.
  • Strong programming skills in Python and practical experience with Rust, or a willingness to work deeply with Rust-based services.
  • Hands-on experience with cloud platforms, Docker, Kubernetes, and production-grade service operations.
  • Experience building and maintaining CI/CD pipelines, preferably with GitLab, and working with infrastructure as code.
  • Solid understanding of distributed systems, microservices, APIs, networking fundamentals, and production debugging.
  • Ability to move quickly in a fast-changing AI ecosystem while applying the rigor, discipline, and risk awareness expected from an SRE.
  • Clear communication skills and the ability to collaborate across engineering, data, ML, security, and product teams.
  • Experience operating AI infrastructure, such as AI gateways, MCP servers, LLM serving stacks, model routers, or provider abstraction layers.
  • Familiarity with RAG and agentic architectures, including embeddings, vector databases, hybrid search, query processing, tool use, and agent orchestration.
  • Experience with GPU-backed inference, high-throughput serving, batching, caching, autoscaling, or model performance optimization.
  • Experience with serverless, event-driven systems, queues, streaming platforms, or asynchronous processing.
  • Familiarity with AI safety, governance, auditability, data privacy, or secure-by-design platform practices.
  • Open-source contributions, technical writing, talks, or publications in relevant engineering, reliability, or AI infrastructure fields.
  • You will work on projects that shape the future of AI at Ubisoft. You will stay close to emerging AI infrastructure and reliability practices, influence how we operate AI platforms at scale, and help teams deliver useful, safe, and dependable AI-powered applications.
  • Ubisoft's perks
  • 💰 Profit Sharing, yearly company saving plan. 25 paid time off + 12 additional paid days off. 50% of your transportation pass is paid by the company, lunch vouchers (9€/day), healthcare for you and your family, and lots of Ubisoft additional perks.
  • 👶 Maternity leaves of 20 weeks, paternity/co-parental leaves of 7 weeks.
  • 📍 Our office is located in Saint Mandé, (Metro line 1, Saint Mandé station). Gym available in the building.
  • Information about Ubisoft
  • Ubisoft offers the same job opportunities to all, without any distinction of gender, ethnicity, religion, sexual orientation, social status, disability, or age. Ubisoft ensures the developme

Benefits

Health insuranceParental leave

Additional Information

At Ubisoft, we are reshaping how teams work with information through AI. We build and operate applications for document-rich, interactive, and intelligent use cases, including hybrid search, question answering, document generation, conversational assistants, agentic workflows, and coding assistants. You will join the team responsible for the reliability, scalability, and operational excellence of the AI platform powering these experiences. This includes the AI Gateway, MCP servers, LLM serving infrastructure, retrieval services, and the surrounding platform components that help teams bring AI applications to production safely. This role sits at the intersection of fast-moving AI engineering and disciplined site reliability engineering. You will help the platform evolve quickly while ensuring it remains observable, resilient, cost-efficient, and dependable at scale.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Ubisoft2? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect