HPC Infrastructure Site Reliability Engineer

External

Radiant · Gloucestershire

Full-timeRemote1w ago

Incident ResponseLinuxMachine LearningMoveObservabilitySAFe

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We're a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable. Role Overview We are looking for a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This is an operations-first SRE role, working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure. This role sits within a cross-functional organisation spanning network engineering, infrastructure SRE, Platform SRE, infrastructure tooling engineers (software) and data centre operations. The ideal candidate has progressed through large-scale, globally distributed or multi-site infrastructure environments and has more recently specialised in GPU-accelerated HPC systems. This role provides exposure to the latest high-density AI compute platforms , including next-generation GPU infrastructure at significant scale. You will bring strong breadth across bare metal, networking, storage, virtualisation, and orchestration, alongside deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking. Strong Linux and distributed systems expertise is essential. Alongside operational ownership, this is a deeply technical Infrastructure SRE role centred on advanced operational troubleshooting and performance evaluation across large-scale HPC systems. You will investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration, building a clear understanding of system behaviour under real production AI and HPC workloads. A key responsibility is performance evaluation, testing, and operational acceptance of new HPC environments, ensuring platforms meet defined reliability, scalability, and performance expectations before entering production. You will work across hardware, network, and software layers to validate readiness of high-density GPU infrastructure and support safe, predictable deployment at scale. You will also play a central role in continuous service improvement (CSI)-reducing operational toil, increasing automation, and improving reliability, consistency, and operational efficiency across the platform. This includes strengthening observability, refining operational workflows, and eliminating repetitive or failure-prone processes. Over time, you will help shape future infrastructure design and deployment approaches, feeding operational insight back into infrastructure engineering decisions and ensuring production learnings directly influence next-generation HPC platform evolution.

Responsibilities

Operate and improve high-density AI/HPC infrastructure in a 24/7 production environment
Participate in a 24x7x365 on-call rotation , supporting mission-critical systems and incident response
Troubleshoot complex issues across compute, networking, storage, and orchestration layers in GPU-accelerated environments
Lead performance evaluation, testing, and operational acceptance of new HPC infrastructure before production release
Drive continuous service improvement (CSI), reducing toil through automation, tooling, and process refinement
Build and maintain infrastructure automation and tooling (IaC and scripting) to improve reliabi

Benefits

We move quickly, solve meaningful challenges, and give people the space to make an impact. If you thrive in fast-paced environments, enjoy working with advanced technology, and want to help shape the future of high-performance compute, you'll find both challenge and opportunity here.You can also expect:Exposure to industry-leading GPU and AI infrastructureOpportunities to grow alongside a rapidly scaling global businessA collaborative, inclusive, and supportive engineering cultureReal ownership and the ability to influence operational excellenceWork that sits at the intersection of people, performance, and technologyA modern, flexible, globally connected workplace with ambitious goalsFlexible schedule

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Radiant? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect