Infrastructure Tooling & Observability Engineer( UK)
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We're a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable. Role Overview We are seeking an Infrastructure Tooling & Observability Engineer to act as a key engineering force within our global Infrastructure Operations organisation. Working closely with our SRE teams, you will translate high-level reliability objectives into scalable, production-ready systems that directly improve the resilience, efficiency, and performance of our global infrastructure. This role goes beyond traditional monitoring. You will help design and build the internal control plane that enables operations at scale across a rapidly growing GPU fleet. Your work will focus on transforming complex, high-volume telemetry-spanning logs, metrics, and events across HPC, networking, and platform layers-into actionable insight that drives operational excellence and proactive reliability. A core part of your responsibility will be developing intelligent observability and automation systems, including advanced alerting strategies, anomaly detection, and AI-driven tooling that reduces L1/L2 escalations and removes operational toil. You will also contribute to Continual Service Improvement (CSI) initiatives by building frameworks for reliability measurement, automated remediation, and system health evaluation. In addition, you will play a central role in turning SRE reliability initiatives into scalable engineering solutions. This includes designing and delivering capabilities such as inventory management systems, performance testing frameworks, and automated performance result collection. You will also help eliminate manual workflows involved in onboarding new regions, facilities, and clusters, embedding automation and standardisation into every stage of infrastructure deployment. As the organisation scales, you will act as a critical interface between operations and engineering teams. You will evaluate and mature internally built tooling-from capacity planning systems to autonomous remediation pipelines-and help integrate these capabilities into core infrastructure platforms to ensure consistent, high-performance, and highly reliable global operations. What's In it for you? Join a team building the internal platforms that enable large-scale infrastructure to operate reliably, efficiently, and at speed. As an Infrastructure Tooling & Observability Engineer, you will design and develop the systems that power visibility, automation, and operational intelligence across complex distributed environments. This role goes beyond traditional monitoring. You will build the internal control plane that transforms high-volume telemetry-logs, metrics, and events-into actionable insight for engineering and operations teams. Your work will improve observability across infrastructure systems, strengthen signal quality, and help teams understand and respond to system behaviour in real time. Working closely with SRE and infrastructure engineering teams, you will translate reliability goals into scalable, production-grade tooling. This includes frameworks for observability, alerting, anomaly detection, capacity planning, and service health tracking. A key focus of the role is automation. You will help eliminate manual processes across infrastructure operations, including environment provisioning, cluster onboarding, inventory management, and recurring operational workflows. You will also contribute to performance engineering initiatives, building tooling for testing, benchmarking, and automated results collection at scale. You will play a central role in turning SRE reliability initiatives into reusable engineering solutions, including automated remediation systems and tooling that reduces operational toil while improving system resilience. You can also expect: Exposure to large-scale distributed infrastructure systems Opportunities to shape foundational internal platforms A collaborative, engineering-led culture with strong ownership High-impact work spanning observability, automation, and reliability Close partnership with SRE and infrastructure engineering teams A fast-moving environment where tooling directly improves operational performance
Responsibilities
- Design, build, and evolve internal tooling and observability platforms that support large-scale infrastructure operations across distributed environments.
- Develop systems that turn high-volume telemetry (logs, metrics, events) into actionable insight, improving visibility, alerting quality, and operational decision-making.
- Translate SRE reliability requirements into scalable, production-ready software solutions, including automation for incident detection, prevention, and remediation.
- Drive automation across infrastructure operations, reducing manual effort in a
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Radiant? Share your experience