Successful candidate will collaborate with various product, infrastructure, operations, security, and production control teams to elicit and fulfill technical requirements, while driving site reliability, system observability, and operational excellence across the platform.
Primary Duties and Responsibilities:
To perform this job successfully, an individual must be able to perform each primary duty satisfactorily.
Guides the implementation using CI/CD pipelines in Kubernetes environment
Directs review, configuration, and execution of Terraform and Ansible automation pipelines delivered by product teams
Guides the setup of common infrastructure platforms like multi-region Kubernetes and Kafka clusters
Elicits requirements for application deployment and sizing to manage expected workloads
Defines and enforces Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets in collaboration with product teams
Leads blameless post-mortems and drives resolution of action items to reduce repeat incidents
Designs and implements observability frameworks covering metrics, logs, and distributed tracing across all platform services
Drives toil reduction initiatives by identifying and automating repetitive operational work
Partners with product teams to embed reliability requirements and non-functional requirements (NFRs) early in the software development lifecycle
Monitors application performance and tunes systems working with product teams
Confers with product team leads and practitioners to create deployment and reliability plans
Confers with Enterprise Architecture and Renaissance architecture teams to devise implementation architecture
Promotes standards across application configuration towards the highest security posture
Collaborates with access management and security teams on setting up roles and permissions using least privilege strategies
Collaborates with integration/performance testing teams to leverage integrated release testing in the Release Acceptance environment
Collaborates with production controls teams on monitoring, failover, logging, and alerting strategies
Owns and continuously improves incident response runbooks, on-call rotations, and escalation procedures
Conducts capacity planning and load forecasting to proactively address scalability needs
Implements and validates infrastructure failover scenarios
Confers with Network team on all connectivity plans and issue resolution (including between on-premises and AWS)
Follows and enables program-level agile practices for efficient collaboration and delivery
Develops documentation for ORT technical infrastructure, architecture, and reliability support
Supervisory Responsibilities
None
Requirements
The requirements listed are representative of the knowledge, skill, and/or ability required. Reasonable accommodations may be made to enable individuals with disabilities to perform the primary functions.
[Required] Understanding of Kanban and/or Agile methodologies
[Required] Familiarity with SRE principles as defined by Google SRE practices (error budgets, toil elimination, reliability hierarchy)
[Required] Able to succeed in a fast-paced environment with frequent changes
[Required] Comfortable communicating with both technical and non-technical audiences
[Required] Self-starter - takes initiative to research, learn, and deliver; anticipates the play
[Required] Team player - humble, collaborative, and focused on making the entire team succeed
[Required] Fluent with different data formats and structures: JSON, Protobuf, Avro
[Required] SQL and NoSQL databases, in-memory data stores
[Required] Java/Python/Scala/Golang software development
[Required] Two or more of the following: web/mobile application
Additional Information
To be considered for this position, applications and resumes are accepted only through our careers site by directly applying to the posted job. We do not accept unsolicited resumes or sales solicitations from staffing agencies. Any OCC employee wishing to submit a referral must do so through their Workday account. Any resume submitted outside of an active job posting will not be considered for employment.