Senior DevOps Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We are looking for a Staff DevOps Engineer to join our DevOps team at K Health. You will own and evolve the infrastructure underpinning a healthcare AI platform serving patients and enterprise health system partners. This is a high-ownership role: you will architect and operate cloud environments across K Health and its enterprise partners, lead complex infrastructure migrations, drive disaster recovery programs, and help build the next generation of AI-powered operations tooling. You will also mentor junior engineers and collaborate closely with product and engineering teams across the company. This is a hybrid role based in New York City (4 days/week in office) and includes participation in a daytime on-call rotation. This role requires onsite presence in our New York City office 4 days a week and does not provide immigration support. Behind every leading health system is K Health's AI-powered virtual care engine. Esteemed health systems like Mayo Clinic, Cedars-Sinai, Mass General Brigham, Hackensack Meridian Health, and Hartford Healthcare partner with K Health to build and run modern primary virtual care clinics on their behalf. Our deeply integrated model modernizes the primary care loop by using AI to put humans first. For our patients, we offer clinical AI (i.e., PatientGPT) and unparalleled access to close care gaps around the clock. For our Providers, we deliver provider-serving agentic solutions (i.e., Perfect Note) to eliminate administrative overload and burnout. And for the health systems, we deploy our top-grade Virtualists in AI-powered virtual clinics 24/7 to capture the patients' care journeys at step one, retain the journey through the system for longitudinal care, and strengthen profitability.
Responsibilities
- Own the design, implementation, and evolution of our GKE-based Kubernetes infrastructure across K Health and enterprise partner environments.
- Build and maintain our Terraform modular infrastructure library, including reusable modules with automated testing, across GCP, Cloudflare, and AWS.
- Architect, build, and maintain GitLab CI/CD shared pipeline templates used by all engineering teams (build, test, security scanning, deployment).
- Own and maintain self-hosted infrastructure software running in-cluster, including GitLab, ArgoCD, Langfuse, DependencyTrack, NGINX Ingress, and others.
- Implement and support security and compliance controls across infrastructure and the software supply chain - secrets management, pipeline secret detection, container scanning, SOC2 and HIPAA.
- Drive disaster recovery readiness: design failover scenarios, author runbooks, and lead periodic DR tests.
- Lead development of AI-powered operations tooling and agentic infrastructure.
- Monitor, troubleshoot, and improve production system reliability; respond to incidents during on-call shifts.
- Mentor junior DevOps engineers and establish team-wide engineering standards.
Requirements
- 5+ years of experience in DevOps, platform engineering, or site reliability engineering.
- Deep, hands-on experience with Kubernetes and the surrounding ecosystem - Helm, Helmfile, ArgoCD, Kyverno, cert-manager, and NGINX Ingress.
- Extensive experience with Google Cloud Platform - GKE, Cloud SQL, Memorystore, Cloud Storage, IAM, and Workload Identity.
- Strong Terraform expertise: modular architecture, multi-environment provisioning, and automated testing.
- Advanced knowledge of GitLab CI/CD and GitOps practices.
- Proficiency in Python and/or Go.
- Plus:
- Advanced Bash scripting skills.
- Experience with secrets management solutions such as Akeyless or HashiCorp Vault.
- Experience with database administration across PostgreSQL, Redis, and MongoDB - including DR configuration and operational runbooks.
- Experience with Datadog or equivalent observability platform (APM, infrastructure, log management).
- Experience with Cloudflare for DNS, CDN, and security rules management.
- Demonstrated experience designing and executing disaster recovery programs, including failover testing and runbook authorship.
- Bonus:
- Experience in highly regulated environments - SOC2 and HIPAA.
- Excellent communication skills with the ability to lead cross-functional infrastructure initiatives.
- Demonstrated leadership experience, including mentoring junior engineers.
- Experience with HPC or GPU cluster infrastructure, including Slurm..
- Experience building or operating AI agents or agentic infrastructure.
- Experience with microservices architecture and API gateway / reverse proxy patterns.
- Experience with AWS.
- Benefits & Perks: #LI-Hybrid
- Hybrid work schedule with weekly lunches and stocked fridges
- Monthly social committees for company events
- 18 vacation days, 9 company holidays, 5 sick days, and 2 personal days
- Stock options for every full-time employee
- Paid parental leave
- 401k benefit
- Commuter Benefits
- Competitive health, dental, and vision insurance options
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at khealthcareers? Share your experience