Infrastructure Site Reliability Engineer

External

Radiant · Gloucestershire

Full-timeRemote2mo ago

AnsibleBashDNSGrafanaKubernetesLinux

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads
Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc
Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
Maintain and enhance 's observability stack: Prometheus, Grafana, and custom monitoring integrations
Operate and support services in 24x7 production environments, including on-call rotation
Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
Mentor junior engineers and act as an Operational requirements consultant to other departments
Communicate technical decisions clearly to non-technical stakeholders and customers
Uphold a culture of: do, document, automate
Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our
HPC supportability offering
What you bring:
5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model
Expert-level Linux administration, especially Ubuntu distributions
Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
Deep understanding of observability principles and tools (Prometheus, Grafana)
Hands-on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell)
Strong grasp of ITSM and service operation best practices
Excellent communication and mentorship skills
Comfortable interfacing with internal stakeholders and external customers
Bonus: Knowledge of HPC workloads and GPU-based infrastructure
Bonus: Experience with InfiniBand networks and HPC performance tuning

Requirements

Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
LPIC Certifications
ITIL Foundation level qualification or equivalent experience
How you work:
You approach problems with a systems mindset - balancing practical execution with long-term scalability
You elevate the team, setting high standards for technical quality and engineering excellence.
You hold yourself and others accountable - giving direct feedback and expecting the same
You take initiative, owning challenges end-to-end and proactively driving solutions.
You invest in others, mentoring to build both capability and confidence.
You communicate clearly - translating complexity into clarity across engineering and business audiences
Why should you join us?
What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.
Here are just some of the great things you can expect from us:
25 days of annual leave
A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
Learning Time : we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.
Health & Wellbeing: we want everyone to feel healthy and happy, so we o

Benefits

Health insuranceVision insurancePerformance bonus

Additional Information

About Radiant Radiant is redefining how AI infrastructure is built. We design and operate AI-native cloud platforms engineered for sovereignty, performance, and scale. Our infrastructure powers GPU-native workloads, multi-tenant control planes, and high-performance AI systems designed for the most demanding environments. We are not building a generic cloud. We are building purpose-built AI infrastructure - from powered land, to compute, to software . As we scale our platform and expand our engineering organisation, we are looking for leaders who can build strong teams, uphold high standards, and deliver reliably at pace. Job Summary: We're looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You'll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable and secure 24/7 x 365 - all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Radiant? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect