Staff DevOps Engineer
ExternalFull-timeOn-site1mo ago
AgileAirflowApacheArgoCDAWSAzure
Prepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Infrastructure Development & Integration
- Design, implement, and manage cloud-native infrastructure (AWS, Azure, GCP) to support healthcare platforms, AI agents, and clinical applications.
- Build and maintain scalable CI/CD pipelines to enable rapid and reliable delivery of software, data pipelines , and AI/ML models.
- Design and manage Kubernetes (K8s) clusters for container orchestration, workload scaling, and high availability with integrated monitoring to ensure cluster health and performance
- Implement Kubernetes-native tools (Helm, Kustomize, ArgoCD) for deployment automation and environment management ensuring observability through monitoring dashboards and alerts
- Collaborate with Staff Engineers/Architects to align infrastructure with enterprise goals for scalability, reliability, and performance leveraging monitoring insights to inform architectural decisions.
- System Optimization & Reliability
- Implement and maintain comprehensive monitoring, logging, and alerting mechanisms (Prometheus, Grafana, ELK, Datadog, AWS cloudwatch, AWS cloud trail) to ensure real-time visibility into system performance, resource utilization, and potential incidents.
- Implement monitoring, logging, and alerting mechanisms (Prometheus, Grafana, ELK, Datadog) to ensure system reliability and proactive incident response.
- Ensure data pipeline workflows (ETL/ELT, real-time streaming, batch processing) are observable, reliable, and auditable.
- Support observability and monitoring of GenAI pipelines, embeddings, vector databases, and agentic AI workflows .
- Proactively analyze monitoring data to identify bottlenecks, predict failures, and drive continuous improvement in system reliability.
- Compliance & Security
- Support audit trails and compliance reporting through automated DevSecOps practices.
- Implement security controls for LLM-based applications, AI agents, and healthcare data pipelines , including prompt injection prevention, API rate limiting, and data governance.
- Collaboration & Agile Practices
- Partner closely with software engineers, data engineers, AI/ML engineers, and product managers to deliver integrated, secure, and scalable solutions.
- Contribute to agile development processes including sprint planning, stand-ups, and retrospectives.
- Mentor junior engineers and share best practices in cloud-native infrastructure, CI/CD, Kubernetes, and automation.
- Innovation & Technical Expertise
- Stay informed about emerging DevOps practices, cloud-native architectures, MLOps/LLMOps , and data engineering tools.
- Prototype and evaluate new frameworks and tools to enhance infrastructure for data pipelines, GenAI, and Agentic AI applications .
- Advocate for best practices in infrastructure design, focusing on modularity, maintainability, and scalability.
Requirements
- Education & Experience
- Bachelor's or Master's degree in Computer Science, Engineering, or related technical discipline.
- 10+ years of experience in DevOps, Site Reliability Engineering, or related roles, with at least 3+ years building cloud-native infrastructure.
- Proven track record of managing production-grade Kubernetes clusters and cloud infrastructure in regulated environments.
- Experience supporting GenAI/LLM applications (e.g., OpenAI, Hugging Face, LangChain) and vector databases (e.g., Pinecone, Weaviate, FAISS).
- Hands-on experience supporting data pipeline products using ETL/ELT frameworks (Apache Airflow, dbt, Prefect) and streaming systems (Kafka, Spark, Flink).
- Experience deploying AI agents and orchestrating agent workflows in production environments.
- Technical Proficiency
- Expertise in Kubernetes (K8s) for orchestration, scaling, and managing containerized applications.
- Strong proficiency in containerization (Docker) and Kubernetes ecosystem tools (Helm, ArgoCD, Istio/Linkerd for service mesh).
- Hands-on experience with Infrastructure as Code (Terraform, CloudFormation, or Pulumi).
- Proficiency with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, ArgoCD, Spinnaker).
- Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, AWS cloud watch and AWS cloud trail), including setting up dashboards, alerts, and custom metrics for cloud-native and AI systems.
- Good to have: knowledge of healthcare data standards (FHIR, HL7) and secure deployment practices for AI/ML and data pipelines.
- Professional Skills
- Strong problem-solving skills with a focus on reliability, scalability, and security.
- Excellent collaboration and communication skills across cross-functional teams.
- Proactive, detail-oriented, and committed to technical excellence in a fast-paced healthcare environment.
- About Get Well:
Benefits
Health insurance
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at getwellnetwork? Share your experience