Site Reliability Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Support in the deployment, configuration, and maintenance of various high-end GPU servers, storage servers, networking equipment and software components in highly secure environments.
- Perform hardware diagnostics, systems functionality and firmware updates as required.
- Collaborate with engineering teams to assist in tailored customer environments deployment (eg: bare-metal systems, HPC Clusters, Kubernetes, Slurm etc).
- Serve as first line of engineering support for onsite operational issues, including troubleshooting hardware, network and software problems, and firmware compliance.
- Troubleshoot incidents, escalate critical issues and provide feedback to appropriate teams for improvements.
- Participate in an on-call rotation to ensure 24/7 availability and responsiveness to critical issues.
- Provide technical support to the GOC Support Specialist team in troubleshooting compute infrastructure related problems.
- Document incident details, resolutions, and lessons learned to enhance future problem-solving.
- Maintain clear, accurate, and up-to-date documentation to promote effective knowledge sharing across the team.
- Communicate effectively with GOC, HPC Engineers, internal teams, stakeholders, and end-users to ensure alignment on issue resolution.
- Take part in team meetings and knowledge-sharing sessions to foster collaboration and continuous learning.
Requirements
- Bachelor's degree in computer engineering, computer science, or a related technical field.
- 5+ years of experience in field service technical areas.
- Strong understanding of server hardware technology, firmware lifecycle, Linux environments and troubleshooting hardware problems, with adherence to physical and system-level security standards.
- Experience with scripting languages ( eg : Bash, Python)
- Familiarity with using configuration management, CICD tools, workload manager and cluster softwares ( eg : Slurm , Kubernetes, Nvidia BCM) and Observability tools ( eg : Prometheus, Grafana, ELK, etc)
- Excellent problem-solving and analytical skills.
- Ability to work independently and as part of a team.
- Strong communication skills, both written and verbal.
- Location & Reporting
- Based in : Singapore
- Reporting to: Senior Operations Manager
- Employment Basis
- Full-time
- Diversity
- At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
- Join us in our mission to revolutionize the AI industry
Benefits
Additional Information
Firmus Technologies Firmus Technologies is a global leader pioneering the solution to AI's energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs. Our mission is to create the most energy-efficient AI infrastructure, combining cutting edge technology with a steadfast commitment to sustainability . Through ground-breaking research and development, we invented a verticalized AI Factory - a new class of digital infrastructure that replaces traditional data centres . Built on new approaches to liquid cooling, energy management, water use and modular construction methodology , the Firmus AI Factory delivers low-cost AI tokens across Asia-Pacific . Firmus AI Cloud We provide customers with access to energy savings via our large-scale GPU cloud, Firmus AI Cloud. R ated Silver in The GPU Cloud ClusterMAX ™ Rating System, our cloud empowers developers, enterprise, education and government users to train AI models with unmatched efficiency and cost savings. With an ever-growing list of services and applications, we are committed to building a cloud experience for our customers that is market-leading, proprietary and built to scale. Why you'll love working here A fast-paced and dynamic environment working with next-gen technology. You'll be operating at the intersection of sustainability and artificial intelligence - helping to transform an industry. Working with and access to colleagues who are true innovators and leaders in their field. As an emerging company, we work as a close-knit team. Work with the founders, grow a strong network, and witness the impact you make first-hand as we democratise AI tools for everyone - more sustainably, and more affordably. We believe that people from diverse backgrounds come together to do their best work, be their authentic selves, and build great things . We are proud to be an equal opportunity employer. ROLE SUMMARY Firmus Technologies is seeking a skilled Site Reliability Engineer to join our Operations team, supporting the daily operations and maintenance of our AI-accelerat ed H igh- P erformance C omputing (HPC) infrastructure. This role will work closely with Field Service Engineers, HPC and Network Engineering teams, and assist the Global Operations Centre (GOC). This is a unique opportunity to contribute directly to the stability and growth of cutting-edge AI infrastructure.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Firmus Technologies? Share your experience