Proficiency in SRE fundamentals: SLIs/SLOs, error budgets, capacity planning, chaos testing, and toil reduction.
Demonstrated experience leading incidents and collaborating across teams.
Strong interpersonal skills with the ability to communicate effectively and interact appropriately with management, other Team Members and outside contacts of different backgrounds and levels of experience.
Must be available to work varied shifts including nights, weekends, and holidays, to ensure 24/7 coverage.
Provide off-hours support on an infrequent, but as needed basis during critical incidents. (Potential shifts may run 24/7 due to the need of the business.)
Team Members are required to be on site within the IT Command Center.
Certifications & Training
AZ-400: Azure DevOps Engineer Expert
AZ-305: Azure Solutions Architect Expert or AZ-104: Azure Administrator
AZ-500: Azure Security Engineer Associate
ITIL v4 for operational ri
Benefits
Health insurance
Additional Information
Job Description:
Position Overview
The primary responsibility of the Senior Site Reliability Engineer (SRE) to lead reliability engineering initiatives across our Azure estate and Command Center operations. This role focuses on scripting, automation, and observability to ensure uptime, performance, and rapid incident response. The Senior SRE will design and implement monitoring-as-code, optimize alerting, and build self-healing automation that reduces toil and accelerates recovery.
As part of our journey from traditional operations toward a mature SRE model, the Senior SRE will partner with product engineering, platform teams, and the Command Center including Service Desk and Major Incident Command (MIC) to deliver measurable improvements in service reliability.
All duties are to be performed in accordance with departmental and Las Vegas Sands Corp.'s policies, practices, and procedures. All Las Vegas Sands Corp. Team Members are expected to conduct and carry themselves in a professional manner at all times. Team Members are required to observe the company's standards, work requirements and rules of conduct.
Essential Duties & Responsibilities
Observability & Monitoring
Architect end-to-end monitoring using Azure Monitor, Log Analytics, Application Insights, and ITRS Geneos.
Implement monitoring-as-code with Terraform/Bicep, including alerts, dashboards, and diagnostic settings.
Create actionable dashboards (Azure Workbooks, Grafana) for SLIs/SLOs and real-time service health.
Alerting & Incident Response
Design alert taxonomies with severity mapping (P0-P4), dynamic thresholds, and escalation policies.
Reduce alert noise and ensure 100% alert-to-runbook mapping.
Support Major Incident Command (MIC) during P0/P1 bridges with technical expertise and rapid remediation.
Automation & Tooling
Build automation using PowerShell, Python, and Azure Functions for alert lifecycle, runbooks, and self-healing workflows.
Integrate with ITSM (ServiceNow/Jira) for automated ticket enrichment and routing.
Eliminate repetitive operational tasks and reduce toil through automation-first practices.
Reliability Engineering
Define and enforce SLIs/SLOs, error budgets, and resilience patterns (bulkheads, retries, timeouts).
Conduct production readiness reviews, chaos drills, and failover rehearsals.
Partner with app teams to embed instrumentation and structured logging.
Governance & Compliance
Enforce desired state with Azure Policy, DSC/Guest Configuration, and drift detection.
Harden networking (VNet, NSGs, Private Link, Firewall), identity (Entra ID), and secrets (Key Vault).
Ensure auditability and compliance across environments.
Perform job duties in a safe manner.
Attend work as scheduled on a consistent and regular basis.
Perform other related duties as assigned.