Lead Site Reliability Engineer (SRE)

External

Nab · 15 Tran Bach Dang An Khanh Ward

Full-timeHybridToday

AgileApacheAWSCapacity PlanningChaos EngineeringCloudFormation

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Benefits

Health insurancePerformance bonus

Additional Information

Job Posting End Date: Closing Date: 30 December 2026 Worker Type: Maximum Term/Fixed Term (Fixed Term) We're looking for a hands-on, forward‑thinking Lead Site Reliability Engineer to elevate the reliability, automation, and scalability of one of our most strategically important domains. You'll combine strong engineering capability with servant leadership to guide the team, automate production processes, improve resilience, and drive operational excellence. You'll enjoy solving complex operational challenges with code and mentoring others on engineering best practice in a high‑stakes production environment. YOUR JOB RESPONSIBILITIES: Reliability & Resilience Engineering Design and automate production operational processes, including deployments, monitoring, alerting, and self‑service capabilities. Relentlessly optimise best practice, balancing between ITIL process rigour and lean principles. Improve system resilience, incident recovery, observability, and performance. Deliver resilience and recovery testing, including chaos engineering and performance scenarios. Balance development speed with reliability targets through well-defined SLOs and engineering standards. Operational Excellence & Observability Analyse metrics across OS, platform, and application layers to support tuning, fault diagnosis, audits, and capacity planning. Oversee the SDLC for reliability-focused features, including code reviews, white-box testing, and maintaining test frameworks. Change & Incident Management Participate in automated change delivery, including resilience testing, verification, change control, and user communication. Ensure operational readiness as workload and use cases scale, optimising both human and technical resources. Leadership & Production Ownership Act as champion for production resilience and scale. Provide data‑driven assessments and readiness reports to support program‑level go/no‑go decisions for major releases, migrations, and customer cutovers. Provide technical leadership and mentorship to engineers earlier in their career journey. Facilitate blameless post‑mortems and drive engineering‑first problem resolution. YOUR SKILLS AND EXPERIENCE 8+ years' extensive experience of DevOps or Site Reliability Engineer across all Phases of the software lifecycle Have an in-depth understanding of microservice architecture, API management, and distributed systems concepts Experience with cloud services is essential , in particular, our core AWS Technologies (EC2, ECS, EKS, S3, DynamoDB, Lambda, CloudFormation, CloudWatch, SQS, SNS, ...) Proficiency with build and automation tools such as Dockers , Jenkins, Python/Jython, Artifactory, Terraform, SonarQube . Knowledge of event-driven architectures with experience in Apache Kafka or similar stack . Excellent English communication skills, with an ability to collaborate across engineering and business stakeholders. Specialist Skills (Highly Desirable) Performance Testing: Ability to measure and validate response time, throughput, and reliability under expected concurrency levels. Resilience Engineering: Assess whether current patterns withstand unexpected scenarios and ensure services recover automatically. Stress Testing: Identify breaking points and understand system behaviour during and after failure conditions. Reliability Engineering: Validate that critical operations (e.g., key rotations, scaling events) occur with zero customer impact and maintain stability under load. Observability: Skill in proving production reliability, resilience, and performance using metrics, logs, traces, dashboards, and SLOs. THE BENEFITS AND PERKS We appreciate and reward our colleagues who do great work every day - from excelling for our customers, to taking ownership of an issue to get it resolved. Here's how we support our people with a range of exclusive benefits. 1. Generous compensation and benefit package Attractive salary 20-day paid annual leave and 7-day paid sick leave 13th month salary and Annual Performance Bonus Premium healthcare for yourself and family members Monthly allowance for team activities Premium welcome kit and occasional gifts of appreciation Extra benefits on your work anniversary 2. Exciting career and development opportunities Large scale products with modern technologies in banking domain Clear roadmap for career advancement in both technical and leadership pathways Access to digital learning platform such as Udemy Consistent and high-quality leadership training through the Distinctive Leadership program (DLP) Specialist capabilities and accreditations in key skill areas such as Cloud Engineering, Digital, Data, Security and SREs (Site reliability engineers) Sponsored English course with native teachers Opportunity for training in Australia 3. Professional and engaging working environment Hybrid working model and excellent work-life balance State-of-the-art & modern Agile office Food and beverages in the office

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at nab? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect