Staff Machine Learning Systems Engineer (MLOps)

External

Hims-and-hers · Remote

Full-timeRemoteToday

AssemblyCI/CDComplianceDatadogDockerHelm

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We're hiring a Staff ML Systems Engineer to design, build, and operate the production infrastructure that powers AI across Hims & Hers. This is a deeply technical, hands-on infrastructure role focused on the systems underneath AI - the Kubernetes platform, CI/CD and GitOps pipelines, infrastructure-as-code, inference and model-serving infrastructure, and the observability and tracing stack that keeps AI services reliable, debuggable, and compliant in production. You won't just deploy models - you'll own the machinery that lets every AI team ship and operate safely. You'll own critical systems like our EKS clusters, deployment and autoscaling infrastructure, IAM and secrets management, LLM tracing/observability pipelines (Langfuse, Datadog, OpenTelemetry), and the developer platform that AI and product engineers rely on daily. You'll partner with ML engineers, product engineers, and clinical teams to ensure our AI systems are reliable, observable, secure, and trustworthy in a regulated healthcare environment. This role is ideal for someone who thinks in systems and infrastructure, cares deeply about reliability, security, and cost, and wants to define how AI runs in production at a company where it directly impacts patient outcomes. You Will: Own and scale the AI compute and deployment platform Own and evolve our containerized application deployment platform and related systems for AI workloads, encompassing general process and job orchestration (e.g. Kubernetes) - cluster operations, node lifecycle, autoscaling (Karpenter), storage (EBS CSI), and workload isolation across staging and production. Build and maintain GitOps-based deployment pipelines (Helm/Kustomize overlays, environment promotion) that let teams ship AI services safely and repeatably. Design ephemeral/preview environments, feature-branched deployments, and nightly release pipelines so teams can validate AI changes in production-like conditions before release. Drive efficiency and cost management across compute, autoscaling, and inference infrastructure. Build inference and model-serving infrastructure Operate and scale inference infrastructure and a multi-provider LLM AI gateway (e.g. Bedrock, Vertex, and other providers) - including credentials, rate limits, and failover. Build reliable serving patterns for LLM-powered workflows: routing, grounding, tool execution, and context assembly at the platform level. Create reusable infrastructure abstractions and contracts that standardize how AI services are deployed, configured, and consumed across the company. Own observability, tracing, and reliability Own the LLM/AI observability and tracing stack - provisioning and scaling systems like Langfuse, Datadog (dd-trace), OpenTelemetry tracing (OTLP), and the underlying datastores (e.g. ClickHouse) - so AI behavior is auditable and debuggable in production. Build analytics and monitoring pipelines that surface latency, error, quality, and regression signals to engineering and clinical stakeholders. Define SLOs, alerting, on-call runbooks, and incident response for AI infrastructure; lead troubleshooting and continuously raise platform reliability. Scale the AI developer platform and CI/CD Own and improve the monorepo build system and CI/CD pipelines for AI workloads - including eval workflows, Docker image builds, automated PR checks and convention enforcement, and cross-platform test execution. Own shared infrastructure tooling, CLIs, and IaC modules (Terraform, Scalr) that AI and product engineers use daily. Identify and eliminate platform bottlenecks - reducing CI/CD cycle times, build latency, and deployment friction - to improve developer velocity across the Applied AI organization. Drive security, compliance, and governance at the systems level Build IAM, OIDC, and secrets management as first-class infrastructure - scoped, least-privilege roles, write-only secret rotation, and cross-account access audits. Encode security-by-default, scope boundaries, and access controls into the platform so AI services are HIPAA-compliant and privacy

Benefits

Health insuranceVision insuranceRemote work optionsFlexible schedule

Additional Information

Hims & Hers is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. We are redefining healthcare by putting the customer first and delivering access to care that is affordable, accessible, and personal, from diagnosis to treatment to delivery. No two people are the same, so we provide access to personalized care designed for results. By normalizing health & wellness challenges and innovating on their solutions, we're making better health outcomes easier to achieve. Hims & Hers is a public company, traded on the NYSE under the ticker symbol "HIMS." To learn more about the brand and offerings, you can visit hims.com/about and hims.com/how-it-works . For information on the company's outstanding benefits, culture, and its talent-first flexible/remote work approach, see below and visit www.hims.com/careers-professionals .

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at hims-and-hers? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect