Design, deploy, and operate AKS clusters that host production application workloads - including ingress modernization, cluster upgrades, and workload optimization.
Own Azure infrastructure provisioning through Terraform (IaC), with strong discipline around module design, state management, peer review, and change controls.
Build and operate GitOps-based deployment pipelines using ArgoCD, partnering with application teams to deliver safe, repeatable releases.
Operate the platform's observability stack (Prometheus, Grafana, Loki, and alerting) and partner with application teams to drive Mean Time to Detect below 15 minutes.
Participate in the 24x7 on-call rotation for platform services. Respond to incidents promptly, lead root cause analysis, and drive durable fixes back into the platform.
Implement security controls in Azure (network segmentation, secrets management, audit logging, identity integration) appropriate for a regulated enterprise environment.
Document operational runbooks, decision records, and platform standards so knowledge scales beyond any one engineer.
Identify and reduce toil through automation - scripting, pipeline improvements, and self-service tooling.
Required Skills and Experience
6+ years of hands-on experience architecting and operating workloads on Microsoft Azure, with deep familiarity across compute, networking, identity, storage, and security services.
3+ years of production experience with Azure Kubernetes Service (AKS) - workload design, ingress and service mesh patterns, cluster upgrades, autoscaling, RBAC, and troubleshooting.
Strong Terraform skills, including module authoring, remote state, drift management, and CI-integrated plan / apply workflows.
Production experience with ArgoCD (or comparable GitOps tooling) - Application and ApplicationSet design, sync policies, multi-cluster deployment patterns, and rollback strategies.
Production experience with GitHub and GitHub Actions for source control and CI/CD - branching strategy, reusable workflows, secrets handling, and pipeline governance.
Strong scripting and automation skills in Bash and Python (PowerShell and Azure CLI also useful given the Azure footprint).
Demonstrated ability to diagnose complex production issues across the full stack - application, container, cluster, network, and cloud - and to drive a clean root cause analysis to closure.
Willingness to participate in a 24x7 on-call rotation is a mandatory requirement of this role. Candidates must be able to acknowledge and respond to high-severity incidents promptly during their on-call shifts.
Requirements
Experience operating under a formal change-management, SOC 2, SOX, or HIPAA-style control environment. Familiarity with audit evidence, peer review, and segregation-of-duties expectations.
Experience with Auth0 (or comparable identity platforms - Okta, Entra ID B2C) in a multi-application, multi-tenant configuration.
Hands-on experience with cloud-native observability tooling - Prometheus, Grafana, Loki, OpenTelemetry, or comparable stacks.
Familiarity with Backstage or other internal developer platforms.
Experience with Couchbase or other distributed NoSQL databases at production scale.
Experience with PagerDuty as both a responder and a service configuration owner.
Knowledge of the Palantir Foundry platform is a strong plus.
Experience with healthcare data platforms or other regulated data environments.
For this US-based position, the base pay range is $50,640.00 - $171,851.56 per year . Individual pay is determined by role, level, location, job-related skills, experience, and relevant education or training.
This job is eligible to participate in our annual bonus plan at a target of 10.00%
The healthcare system is always evolving - and it's up to us to use our shared expertise to find new solutions that can keep up. On our growing team you'll find the opportunity to
Benefits
Health insuranceVision insuranceRemote work optionsPerformance bonus
Additional Information
R1 is the leading provider of technology-driven solutions that transform the patient experience and financial performance of hospitals, health systems, and medical groups. We are the one company that combines the deep expertise of a global workforce of revenue cycle professionals with the industry's most advanced technology platform, encompassing sophisticated analytics, AI, intelligent automation, and workflow orchestration.
R1 is hiring a Platform Engineer III to join the Platform Engineering team. The team owns the cloud platform that R1's modern applications run on - AKS clusters, identity, Terraform-managed Azure infrastructure, observability tooling, databases, and incident response.
This engineer will operate at the intersection of cloud infrastructure, security, and operational excellence. We are looking for a senior individual contributor with deep Azure and Kubernetes experience who is comfortable owning critical services end-to-end, participating in a 24x7 on-call rotation, and partnering with application teams to keep the platform fast, secure, and reliable.