Senior DevOps Engineer

External

Nexla · Bengaluru, India

Full-timeOn-site3w ago

BigQueryCapacity PlanningClusteringKafkaKubernetesLinux

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Distributed Processing Engines: Operate and tune distributed system workloads in production in collaboration with backend teams, resource allocation, autoscaling, checkpointing, backpressure, and failure recovery for both batch and streaming jobs.
Stateful Services: Run Redis clusters and other stateful systems reliably - failover, persistence, liveness/readiness tuning, and capacity planning under heavy and bursty load.
Kubernetes & Operators: Take end-to-end ownership of Amazon EKS, Google GKE and the operators (Strimzi and others) running our stateful data workloads - cluster lifecycle, scaling, version upgrades, and resource governance.
Observability: Build deep, data-aware monitoring - consumer lag, throughput, partition skew, job latency, error rates - not just host and CPU metrics. Make the data plane's behavior legible before it breaks.
Incident Management: Lead root-cause analysis for distributed-systems failures (broker outages, crashloops, sink decommissions, control-plane race conditions) and drive durable fixes. Mitigate fast, but design out the recurrence.
Infrastructure as Code & Automation: Provision and manage cloud infrastructure with Terraform; build operational runbooks and automation, including for air-gapped / private enterprise installs (pre-staged images, operator-facing procedures).

Requirements

Experience: 8+ years in infrastructure, SRE, or DevOps, with significant time spent operating production distributed data systems (not just application/cloud infra).
Kafka: Deep, hands-on operational experience running Kafka at scale in production - ideally on Kubernetes via Strimzi - including upgrades, topic/partition management, performance tuning, and TLS/secret rotation.
Distributed Processing (Strong Plus): Production experience operating one or more of Spark, Flink, or Ray - resource tuning, checkpointing, failure recovery.
Stateful Systems (Must Have): Production experience with Redis (clustering, persistence, failover) and a solid understanding of operating stateful workloads on Kubernetes (StatefulSets, PVCs, probes, operators).
Data Warehouses: Familiarity operating against Snowflake, BigQuery, or similar, and an understanding of JDBC connectivity and sink reliability.
Kubernetes & EKS: S

Benefits

Health insuranceVision insurancePaid time off

Additional Information

About Nexla Nexla is the leading Integration platform, built with AI, for AI. Nexla takes a metadata driven approach to converge diverse integrations across Data, Documents, Agents, Applications, and APIs into a single design pattern. We accelerate the development of solutions for GenAI, Analytics, and Inter-company data. Nexla makes data users and developers up to 10x more productive by delivering a true blend of no-code, low-code, and pro-code interfaces. Leading companies including DoorDash, LinkedIn, Johnson & Johnson, and LiveRamp trust Nexla for mission-critical data. Named in the 2022, 2023, and 2024 Gartner Magic Quadrant™ for Data Integration Tools and top-rated by customers on Gartner Peer Insights, headquartered in San Mateo, California. At Nexla, our culture is built around our core values: Have Empathy , Be Curious , Be Intellectually Honest , Achieve Excellence , and Remember to Relax . We put our customers at the heart of everything we do, foster a data-driven mindset, take ownership of our work, and believe in the power of teamwork to achieve ambitious goals. Role You will own the reliability of the distributed data systems at the heart of Nexla - the streaming runtime and processing engines that move hundreds of billions of rows per day for top-tier enterprises. This is an SRE role for our big data stack: Kafka, Spark, Flink, Ray, Redis, and data warehouses, all running on Kubernetes. This is not a cloud-provisioning role. We are looking for someone who has lived inside stateful, high-throughput systems in production who has chased down a broker outage, a checkpoint stall, a crashlooping cache, and a sink that silently stopped writing, and who fixes the architecture rather than the symptom. If keeping a large, busy data platform alive and fast is the kind of problem you find satisfying, you will have a lot of fun working with us. This is a unique opportunity to shape the foundation of a product that is defining the next wave of intelligent, context-aware data movement.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at nexla? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect