Infrastructure Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads
- Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
- Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc
- Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
- Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
- Maintain and enhance 's observability stack: Prometheus, Grafana, and custom monitoring integrations
- Operate and support services in 24x7 production environments, including on-call rotation
- Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
- Mentor junior engineers and act as an Operational requirements consultant to other departments
- Communicate technical decisions clearly to non-technical stakeholders and customers
- Uphold a culture of: do, document, automate
- Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
- Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our
- HPC supportability offering
- What you bring:
- 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model
- Expert-level Linux administration, especially Ubuntu distributions
- Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
- Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
- Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
- Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
- Deep understanding of observability principles and tools (Prometheus, Grafana)
- Hands-on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell)
- Strong grasp of ITSM and service operation best practices
- Excellent communication and mentorship skills
- Comfortable interfacing with internal stakeholders and external customers
- Bonus: Knowledge of HPC workloads and GPU-based infrastructure
- Bonus: Experience with InfiniBand networks and HPC performance tuning
Requirements
- Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
- LPIC Certifications
- ITIL Foundation level qualification or equivalent experience
- How you work:
- You approach problems with a systems mindset - balancing practical execution with long-term scalability
- You elevate the team, setting high standards for technical quality and engineering excellence.
- You hold yourself and others accountable - giving direct feedback and expecting the same
- You take initiative, owning challenges end-to-end and proactively driving solutions.
- You invest in others, mentoring to build both capability and confidence.
- You communicate clearly - translating complexity into clarity across engineering and business audiences
- Why should you join us?
- What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.
- Here are just some of the great things you can expect from us:
- 25 days of annual leave
- A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
- Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
- Learning Time : we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day-to-day job.
- Health & Wellbeing: we want everyone to feel healthy and happy, so we o
Benefits
Additional Information
About Radiant Radiant is redefining how AI infrastructure is built. We design and operate AI-native cloud platforms engineered for sovereignty, performance, and scale. Our infrastructure powers GPU-native workloads, multi-tenant control planes, and high-performance AI systems designed for the most demanding environments. We are not building a generic cloud. We are building purpose-built AI infrastructure - from powered land, to compute, to software . As we scale our platform and expand our engineering organisation, we are looking for leaders who can build strong teams, uphold high standards, and deliver reliably at pace. Job Summary: We're looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You'll contribute across bare-metal, virtualization, and orchestration layers, keeping things stable and secure 24/7 x 365 - all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Radiant? Share your experience