Possesses extensive knowledge in own area of expertise and extensive in-depth knowledge of the broader portfolio for comprehensive understanding of up/downstream impacts across technology infrastructure
Responsibility for the design of technology solutions to prevent or minimize service disruptions
Prevents technology service disruptions through technology solution recommendations and automations
Fosters a culture of deep learning through blameless post-mortems to improve the shared goal of reliability across services
Transform operations teams by facilitating internal change to adopt SRE standard methodologies across the organization and driving strategic growth in this area within Global Technology
Analyzes incidents impacting technology availability for high-level trends across the broad portfolio
Drive initiatives to reduce or prevent technology failures in a complex, distributed technology environment
Pulls together information from disconnected systems into cohesive views of the technology portfolio for identifying trends, redundancies, and risk
Demonstrates outstanding awareness of the complexities of the tech and asset management industries
May lead initiatives of varying degrees of complexity that span multi-functional areas and of varying degrees of complexity
Contributes to definition of target state architecture and design of the technology environment
Requirements
Required:
Bachelor's degree or the equivalent combination of education and relevant experience AND 10+ years of experience designing and operating cloud infrastructure with senior‑level impact.
5+ years building and supporting solutions in Amazon AWS
5+ years of experience building and running a DevOps and/or SRE function
Experience with implementation and operation of the chaos model at scale
Strategic and program-level implementation experience
Demonstrable experience implementing new technology, tools, and platforms
System administration and scripting experience
Demonstrable experience leveraging automation to proactively prevent or quickly remediate incidents
Fluent in multiple programming languages (e.g., Python, Java, GO, Node.js, .Net Core, etc)
Proficiency with database development (SQL Server, PostgreSQL, MySQL, etc)
Proficiency with defining, right-sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and the progress and outcomes related to reliability
Experience with implementing and managing Error Budgets
Proficiency with understanding and explaining incident situations and their recovery plans to prevent recurrence
Knowledge/experience driving dashboard standardization across the ecosystem for observability, APM and infrastructure monitoring, and application-specific logging
Knowledge/experience with observability tools such as New Relic, SolarWinds DPA, Elastic Stack, Prometheus, Grafana, Splunk, and cloud native tools
Knowl
Benefits
Equity / stock options
Additional Information
At T. Rowe Price, we identify and actively invest in opportunities to help people thrive in an evolving world. As a premier global asset management organization with more than 85 years of experience, we provide investment solutions and a broad range of equity, fixed income, and multi-asset capabilities to individuals, advisors, institutions, and retirement plan sponsors. We take an active, independent approach to investing, offering our dynamic perspective and meaningful partnership so our clients can feel more confident.
We believe doing the right thing for our clients and our associates is good business . With a career at the firm, y ou can expect opportunities to create real impact at work and in your community. Y ou'll enjoy resources to support your career path, a s well as compensation , benefits , and flexibility to enrich your life. Here, you'll find a collaborative culture that respect s and valu e s differences and colleagues who share a spirit of generosity .
Join us for the opportunity to g row and make a difference in ways that matter to you .
Role Summary
In this role as Principal Site Reliability Engineer, Infrastructure Observability you will help formulate, develop, and implement a team of Site Reliability Engineers (SREs) focused on the observability, sustainability, scalability, measurability and recoverability of T. Rowe Price's innovative cloud & on-prem solutions by leveraging automation and best-of-breed tools. The successful candidate will have a strong operations & engineering background, is hands-on when needed, and has expertise in the cloud environments (public, private), infrastructure operations, DevOps practices, CI/CD toolchain and systems, code build and deployment, incident response, and 24x7 monitoring and support.
The candidate will also have extensive experience operating within a SRE function within a complex, distributed environment. They will have a demonstrated ability to work horizontally and vertically within an organization with diverse partners and sponsor groups.