Staff DevOps Engineer

External

Getwellnetwork · Bengaluru, India

Full-timeOn-site1mo ago

AgileAirflowApacheArgoCDAWSAzure

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Infrastructure Development & Integration
Design, implement, and manage cloud-native infrastructure (AWS, Azure, GCP) to support healthcare platforms, AI agents, and clinical applications.
Build and maintain scalable CI/CD pipelines to enable rapid and reliable delivery of software, data pipelines , and AI/ML models.
Design and manage Kubernetes (K8s) clusters for container orchestration, workload scaling, and high availability with integrated monitoring to ensure cluster health and performance
Implement Kubernetes-native tools (Helm, Kustomize, ArgoCD) for deployment automation and environment management ensuring observability through monitoring dashboards and alerts
Collaborate with Staff Engineers/Architects to align infrastructure with enterprise goals for scalability, reliability, and performance leveraging monitoring insights to inform architectural decisions.
System Optimization & Reliability
Implement and maintain comprehensive monitoring, logging, and alerting mechanisms (Prometheus, Grafana, ELK, Datadog, AWS cloudwatch, AWS cloud trail) to ensure real-time visibility into system performance, resource utilization, and potential incidents.
Implement monitoring, logging, and alerting mechanisms (Prometheus, Grafana, ELK, Datadog) to ensure system reliability and proactive incident response.
Ensure data pipeline workflows (ETL/ELT, real-time streaming, batch processing) are observable, reliable, and auditable.
Support observability and monitoring of GenAI pipelines, embeddings, vector databases, and agentic AI workflows .
Proactively analyze monitoring data to identify bottlenecks, predict failures, and drive continuous improvement in system reliability.
Compliance & Security
Support audit trails and compliance reporting through automated DevSecOps practices.
Implement security controls for LLM-based applications, AI agents, and healthcare data pipelines , including prompt injection prevention, API rate limiting, and data governance.
Collaboration & Agile Practices
Partner closely with software engineers, data engineers, AI/ML engineers, and product managers to deliver integrated, secure, and scalable solutions.
Contribute to agile development processes including sprint planning, stand-ups, and retrospectives.
Mentor junior engineers and share best practices in cloud-native infrastructure, CI/CD, Kubernetes, and automation.
Innovation & Technical Expertise
Stay informed about emerging DevOps practices, cloud-native architectures, MLOps/LLMOps , and data engineering tools.
Prototype and evaluate new frameworks and tools to enhance infrastructure for data pipelines, GenAI, and Agentic AI applications .
Advocate for best practices in infrastructure design, focusing on modularity, maintainability, and scalability.

Requirements

Education & Experience
Bachelor's or Master's degree in Computer Science, Engineering, or related technical discipline.
10+ years of experience in DevOps, Site Reliability Engineering, or related roles, with at least 3+ years building cloud-native infrastructure.
Proven track record of managing production-grade Kubernetes clusters and cloud infrastructure in regulated environments.
Experience supporting GenAI/LLM applications (e.g., OpenAI, Hugging Face, LangChain) and vector databases (e.g., Pinecone, Weaviate, FAISS).
Hands-on experience supporting data pipeline products using ETL/ELT frameworks (Apache Airflow, dbt, Prefect) and streaming systems (Kafka, Spark, Flink).
Experience deploying AI agents and orchestrating agent workflows in production environments.
Technical Proficiency
Expertise in Kubernetes (K8s) for orchestration, scaling, and managing containerized applications.
Strong proficiency in containerization (Docker) and Kubernetes ecosystem tools (Helm, ArgoCD, Istio/Linkerd for service mesh).
Hands-on experience with Infrastructure as Code (Terraform, CloudFormation, or Pulumi).
Proficiency with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, ArgoCD, Spinnaker).
Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, AWS cloud watch and AWS cloud trail), including setting up dashboards, alerts, and custom metrics for cloud-native and AI systems.
Good to have: knowledge of healthcare data standards (FHIR, HL7) and secure deployment practices for AI/ML and data pipelines.
Professional Skills
Strong problem-solving skills with a focus on reliability, scalability, and security.
Excellent collaboration and communication skills across cross-functional teams.
Proactive, detail-oriented, and committed to technical excellence in a fast-paced healthcare environment.
About Get Well:

Benefits

Health insurance

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at getwellnetwork? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect