Site Reliability Engineer (SRE)

External

Talentvis Singapore · Orchard Gateway @ Emerald, Singapore

S$60K–S$96K/yrContractUnknownToday

Information Technology

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We are seeking a Site Reliability Engineer (SRE) to support the reliability, availability, and operational performance of mission-critical systems and platforms. The successful candidate will be responsible for monitoring system health, responding to incidents, improving operational resilience, and driving performance optimisation to ensure high system availability and service continuity.

Responsibilities

Design, implement, and maintain monitoring, alerting, and observability solutions for production systems and services.
Monitor system performance, availability, and reliability to proactively identify and resolve operational issues.
Respond to production incidents, perform root cause analysis, and implement corrective and preventive actions.
Collaborate with software engineering, infrastructure, and operations teams to improve system resilience and operational efficiency.
Develop and maintain automation scripts and operational tools to reduce manual effort and improve service reliability.
Support capacity planning, performance tuning, and infrastructure optimisation initiatives.
Participate in incident management, problem management, and post-incident review activities.
Implement reliability engineering best practices, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets where applicable.
Develop operational runbooks, standard operating procedures, and disaster recovery documentation.
Support system upgrades, deployments, maintenance activities, and business continuity exercises while minimising service disruption.

Requirements

Bachelor's degree in Computer Science, Information Technology, Software Engineering, or a related discipline.
Demonstrated experience in Site Reliability Engineering, Systems Engineering, DevOps, Platform Engineering, or Production Operations.
Experience with system monitoring, observability, and performance analysis tools.
Experience supporting incident response and production troubleshooting in high-availability environments.
Experience in performance optimisation and ensuring system reliability and operational continuity.
Knowledge of Linux/Unix operating systems and networking fundamentals.
Familiarity with scripting or programming languages such as Python, Bash, Go, or similar.
Strong analytical and problem-solving skills with the ability to troubleshoot complex production issues.
Experience with cloud platforms (AWS, Azure, or Google Cloud Platform).
Experience with containerisation technologies such as Docker and Kubernetes.
Familiarity with Infrastructure as Code (Terraform, Ansible, or similar).
Experience implementing CI/CD pipelines and automation practices.
Knowledge of monitoring platforms such as Prometheus, Grafana, Splunk, ELK, Datadog, or equivalent.
Experience supporting mission-critical, financial services, telecommunications, government, or critical infrastructure systems would be advantageous.
Interested candidates are encouraged to submit their resumes along with a cover letter outlining their relevant experience and achievements to apply88@talentvis.com or click apply now!
**We regret to inform that only shortlisted candidates would be notified**
EA License No: 04C3537
EA Personnel No: R22106683
EA Personnel Name: Yang Hui Shan, Sherri

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at TALENTVIS SINGAPORE PTE. LTD.? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect