Staff Engineer - Site Reliability

External

Nextiva · Bengaluru, India

Full-timeOn-site1w ago

Capacity PlanningCI/CDDatadogElasticsearchGCPGrafana

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Own the reliability, availability, scalability, and performance of middleware and cloud infrastructure platforms
Support and optimize Kafka environments, including performance tuning, capacity planning, upgrades, and troubleshooting
Administer, support, and optimize Vector Database platforms
Manage and support GCP and GKE environments
Drive infrastructure automation, operational excellence, and platform reliability initiatives
Lead production incident response, troubleshooting, root cause analysis, and post incident reviews
Build and maintain monitoring, alerting, and observability solutions
Define and maintain SLIs, SLOs, and error budgets
Support database operations, performance optimization, and operational health
Required Qualifications
Senior Level: 5+ years of relevant experience
Staff Level: 10+ years of relevant experience
Strong experience operating distributed systems in production environments
Mandatory Skills:
Kafka (Highest Priority)
5+ years of hands on Kafka production experience preferred.
Strong knowledge of Kafka architecture, brokers, partitions, replication, consumer groups, monitoring, and troubleshooting.
Vector Database (Mandatory)
Minimum 2+ years of hands on production experience with Vector Databases.
Experience with Weaviate is strongly preferred.
Strong understanding of vector search, embeddings, indexing, performance tuning, scaling, replication, and operational support.
GCP / GKE
5+ years of production experience with GCP and Kubernetes.
Experience managing and troubleshooting production GKE environments.
SRE / DevOps
Experience with incident management, root cause analysis, infrastructure automation, and CI/CD practices.
Infrastructure as Code experience, preferably Terraform.
Automation and scripting experience using Python, Go, Shell, or similar technologies.
Observability
Experience with Datadog, Splunk, OpenTelemetry, Prometheus, Grafana, or similar platforms.
Strong understanding of monitoring, alerting, metrics, logs, and distributed tracing.
Linux & Database Operations
Strong Linux administration and troubleshooting skills.
Experience supporting databases such as MongoDB, PostgreSQL, Redis, Elasticsearch, or ClickHouse.

Requirements

Experience supporting AI Native platforms.
Experience with LLM infrastructure, vector search, or embedding technologies.
Strong technical leadership and mentoring experience for Staff level candidates.
On Call Expectations
Participate in a Follow the Sun support model.
Weekend on call rotation approximately once every four weeks.
Participate in incident response and escalation management.
AI Expectations:
Candidates should demonstrate practical use of AI tools such as ChatGPT and cloud AI services to improve troubleshooting, automation, and operational efficiency while maintaining strong engineering fundamentals and problem so

Benefits

Health insurance

Additional Information

Redefine the future of customer experiences. One conversation at a time. At Nextiva, we're reimagining how businesses connect, bringing together customer experience and team collaboration on a single, conversation centric platform. Powered by AI, driven by human innovation. Our culture is forward thinking, customer obsessed and built on the belief that meaningful connections drive better business outcomes. Whether it's through our signature Amazing Service®, the technology we create, or the experiences we cultivate, connection is at the core of who we are. If you're ready to collaborate with incredible people, make an impact, and help businesses everywhere deliver truly amazing experiences, this is where you belong. Location: This is an onsite role based at Nextiva's Bengaluru office (Wilshire III by MFAR, 492, Hobli, RHB Colony, Mahadevapura, Bengaluru, Karnataka 560048). Working together onsite strengthens how we operate, enabling faster decisions, clearer communication, and stronger execution, so you can make a greater impact and move work forward with speed and clarity. In-Office Expectation: This role is expected to work onsite four days per week, with the potential to increase to five days per week, as required by the business. Specific scheduling and flexibility will be guided by your leader to support both team collaboration and individual productivity. We are seeking a Senior or Staff Site Reliability Engineer to join the Middleware Engineering team supporting NCC Next, Nextiva's AI Native platform. This role is responsible for the reliability, scalability, performance, and operational excellence of critical middleware and cloud infrastructure services. The ideal candidate will have strong experience with Kafka, Vector Databases, Kubernetes, GCP, observability, automation, and distributed systems, along with a passion for building highly reliable platforms at scale. If you enjoy owning systems end to end, writing clean automation, and working in a fast-moving team that values innovation, this role is for you.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at nextiva? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect