Senior MLOps Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We're hiring a Senior MLOps Engineer to be the data team's owner of production ML operations. You'll build the pipelines that take models from prototype to production, own the low-latency serving API behind our Next Best Action (NBA) engine, and stand up the monitoring, alerting, and reliability layer that keeps NBA models - and the LLM agents that consume them - healthy in production. This is a builder's role at a builder's moment: NBA is going live, the production ML platform is being shaped now, and you'll define how Clutch ships and operates AI for years to come. When there isn't active MLOps work, you'll also contribute to data engineering and machine learning work across the team. The Data team today is five people: one data scientist, two data engineers, one data analyst, and one product manager. We're small, ambitious, and shipping fast - ML models heading to production, a serving API being built, and AI agents in active development. You'll be the senior MLOps voice inside the team and the operational bridge to HAL, the platform team that runs Clutch's agent runtime. Expect tight feedback loops, real autonomy, and a team that values pragmatism over purity.
Responsibilities
- Within 3 months, you will:
- Take ownership of the ML serving API that serves NBA recommendations, partnering with the data engineer who's been building it, and harden it for low-latency production traffic
- Build the first repeatable deployment pipeline: model artifact → versioned, deployable, rollback-able production service, with infrastructure defined as code
- Stand up the monitoring foundation: latency/error/drift dashboards, alerting, and audit/trace visibility across models and agents
- Build a working relationship with HAL and become the data team's go-to on ML serving and reliability decisions
- Within 6 months, you will:
- Be the primary owner (with data engineer support) of the ML serving platform and deployment pipelines for NBA and our ML models
- Have at least one production model and one production agent fully instrumented - versioning, monitoring, alerting, and multi-tenant gating in place
- Define the data team's playbook for shipping a new ML model to production, end-to-end
- Drive architectural decisions across APIs, processing pipelines, distributed compute, storage, search, observability, cloud infrastructure, and model-serving workflows
- Mentor the data engineers on MLOps patterns so they can confidently support and extend the systems you own
- Within 9 months, you will:
- Operate as the technical lead within the data team for NBA production ML operations - the person other teams come to when they want to understand how Clutch ships and runs ML reliably
- Have measurably improved cost and latency
- Be shaping the data team's roadmap for the next generation of ML infrastructure, in partnership with the PM and data scientist
- Help us decide what to hire next as the team scales
Requirements
- Required
- 8+ years of experience in software, data, or ML engineering, with 4-5+ years running ML systems in production - you've taken models from prototype to production and own what happens after deploy
- Infrastructure as code. You manage cloud infrastructure (AWS Lambda, ECS) with Terraform or equivalent - no click-ops, everything reviewable and reproducible
- Monitoring & observability discipline. You instrument serving systems for latency, error rates, drift, and cost; you read audit rows and distributed traces; you set up alerting so regressions are caught before users feel them. You treat monitoring as a first-class deliverable, not an afterthought
- Reliability rigor. You design for failure: structured error handling, graceful degradation, rollback paths, and runbooks. You have a story about a production incident you handled and how you hardened the system afterward
- Experience building and operating low-latency production APIs (FastAPI, BentoML, or equivalent), with opinions on serving, batching, and caching
- Comfortable in AWS (Lambda especially), containers (Docker), and GitHub-based workflows
- Security & governance. You ensure security and governance across systems: IAM, KMS, access policies, and Secrets Manager/SSM
- DevOps / infrastructure knowledge, plus data manipulation and feature engineering
- Solid understanding of ML concepts: models, pipelines, metrics, and super
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at withclutch? Share your experience