Senior Principal Infrastructure Services (SRE Practice)
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Reliability‑Focused System Design & Architecture
- Lead the design and evolution of highly reliable, scalable, and performant distributed systems , applying SRE principles across infrastructure and application layers.
- Partner with engineering and architecture teams to influence system design decisions that improve resilience, fault tolerance, and operational simplicity .
- Define and promote reliability patterns, architectural best practices, and non‑functional requirements aligned with business criticality.
- SRE Operations & Automation
- Drive an automation‑first approach by designing and developing tools, scripts, and platforms that reduce manual effort, operational toil, and human error.
- Embed reliability engineering into the software delivery lifecycle through CI/CD integration, safe deployments, and repeatable operational workflows.
- Establish clear operational metrics and service health indicators to ensure transparency and accountability.
- Incident Management & Root Cause Analysis
- Participate in and lead incident response for production systems, ensuring timely mitigation and minimal customer or business impact.
- Conduct and drive blameless post‑incident reviews , focusing on identifying systemic causes rather than individual faults.
- Implement long‑term corrective actions to prevent recurrence and measurably improve system reliability.
- Monitoring, Alerting & Observability
- Architect and implement end‑to‑end observability across systems using metrics, logs, and traces to enable rapid diagnosis and proactive issue detection.
- Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to balance reliability with feature velocity.
- Build and maintain actionable dashboards and alerts that provide real‑time insights into system health, performance, and risk.
- Continuous Reliability Improvement
- Identify reliability gaps through data analysis, failure reviews, and resilience testing, driving targeted improvement initiatives.
- Lead efforts such as capacity planning, load testing, chaos engineering, and fault injection to validate system behavior under stress.
- Continuously reduce operational toil, improve mean time to detect (MTTD) and mean time to recover (MTTR), and raise overall service maturity.
- Documentation & Knowledge Sharing
- Create and maintain clear, accurate, and actionable documentation including system architectures, runbooks, operational standards, and incident playbooks.
- Ensure documentation supports operational readiness, repeatability, and effective knowledge transfer across teams.
- Cross‑Functional Collaboration & Influence
- Work closely with product, development, platform, security, and operations teams to embed SRE principles into roadmap planning and delivery.
- Act as a trusted advisor, translating reliability data and operational risk into business‑relevant insights for technical and non‑technical stakeholders.
- Advocate for SRE best practices and help build a strong reliability culture across the organization.
- Project & Initiative Leadership
- Manage and prioritize multiple reliability‑focused initiatives, balancing short‑term operational needs with long‑term system health.
- Drive execution of strategic SRE programs that measurably improve system resilience, scalability, and operational efficiency.
- Qualifications & Experience
- Bachelor's degree in Computer Science,
Benefits
Additional Information
About Northern Trust: Northern Trust, a Fortune 500 company, is a globally recognized, award-winning financial institution that has been in continuous operation since 1889. Northern Trust is proud to provide innovative financial services and guidance to the world's most successful individuals, families, and institutions by remaining true to our enduring principles of service, expertise, and integrity. With more than 130 years of financial experience and over 22,000 partners, we serve the world's most sophisticated clients using leading technology and exceptional service. About the Company & Role Northern Trust is seeking an experienced Sr. Principal Site Reliability Engineer with a strong focus on developing observability and automation. This role will play a pivotal part in ensuring the reliability and performance of the company's systems and services. As a Site Reliability DevOps Engineer, you will be responsible for defining and deploying key observability services with a deep focus on architecture, production operations, capacity planning, performance management, deployment, and release engineering. You will work with cross-functional teams to assist with providing efficiency of our services. Your expertise in both software engineering and system operations will enable our partners to drive continuous improvements in our platform's reliability. This role will focus on bringing complete observability across all technologies. This role will be responsible for a number of key functions that both support and drive improvements to the reliability of Northern Trust's IT Landscape.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at ntrs? Share your experience