Additional Information
We're looking for a Senior DevOps Engineer to join our Applied AI practice and work at the intersection of platform engineering and AI delivery. This is a hands-on role where you'll lead the optimisation and evolution of our cloud infrastructure, deployment pipelines, and operational practices to ensure we consistently deliver high-quality outcomes for our clients.
This isn't a standard DevOps role. You'll be building and operating the infrastructure that production AI systems actually run on, agentic pipelines, LLM integrations, retrieval systems, in enterprise environments across financial services, government, insurance, and retail. That means bringing the same rigour you'd apply to any critical system, and then going further: LLMOps, inference cost engineering, evaluation harnesses, and resilience patterns purpose-built for non-deterministic APIs.
You'll work closely with AI Engineers, delivery teams, and client stakeholders to uplift platform capability, improve delivery velocity, and embed quality through automation, observability, and strong engineering standards.
Core DevOps
Architect, build, and continuously enhance CI/CD pipelines to automate and accelerate software delivery across the team.
Lead the management and optimisation of cloud infrastructure (AWS), ensuring scalability, security, and reliability while championing best practices.
Design, implement, and maintain Infrastructure as Code (IaC) with tools such as Terraform and CloudFormation, enabling the team to deploy with confidence and agility.
Proactively monitor, troubleshoot, and enhance system performance, availability, and security, ensuring operational excellence across client environments.
Drive the adoption of containerisation and orchestration technologies like Docker and Kubernetes to enable scalable, high-performance solutions.
Improve system observability by implementing advanced logging, monitoring, and alerting with tools such as Prometheus, Grafana, Datadog, CloudWatch and the ELK stack.
Lead the implementation of security best practices, including IAM, secrets management, and vulnerability assessments.
Collaborate closely with developers to continuously optimise build, deployment, and scaling strategies for seamless integration and continuous delivery.
Automate key operational tasks and apply SRE principles to enhance system reliability, uptime, and overall performance.
Take ownership of incident response and lead root cause analysis for production issues, ensuring swift resolution and ongoing improvement.
AI-Specific Responsibilities
Practise LLMOps: implement prompt versioning, model evaluation pipelines, and controlled promotion gates before anything reaches production.
Instrument beyond standard metrics: design observability for token costs, inference latency, retrieval quality, and model drift detection.
Build agentic resilience: implement rate limiting, circuit breakers, and graceful fallbacks for non-deterministic LLM APIs.
Own inference cost engineering: design throughput management, caching strategy, and cost-per-query alerting to keep AI systems economically viable at scale.
Design AI-native CI/CD pipelines with evaluation harnesses and golden dataset regression tests baked in before any model or prompt change reaches production.
5+ years of hands-on experience in DevOps, SRE, or Cloud Engineering.
Extensive expertise in AWS cloud platforms and services.
Practical experience with Kubernetes and containerisation technologies.
Strong scripting and automation skills with Bash, Python, or Go.
In-depth knowledge of CI/CD tools including Jenkins, GitHub Actions, GitLab CI/CD, and ArgoCD.
Solid experience with Infrastructure as Code tools including Terraform and CloudFormation.
Comprehensive understanding of Linux administration and networking fundamentals.
Experience implementing security best practices including IAM, SSL/TLS, and compliance frameworks such as SOC2, ISO 27001, and GDPR.
Proficiency in monitoring and logging tools including the ELK Stack, Prometheus, Grafana, or Datadog.
Exceptional problem-solving skills and the ability to operate in a fast-moving, ambiguous environment.
Strong communication and collaboration skills to work effectively across cross-functional teams, including client stakeholders.