Staff Engineer - Site Reliability
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Own the reliability, availability, scalability, and performance of middleware and cloud infrastructure platforms
- Support and optimize Kafka environments, including performance tuning, capacity planning, upgrades, and troubleshooting
- Administer, support, and optimize Vector Database platforms
- Manage and support GCP and GKE environments
- Drive infrastructure automation, operational excellence, and platform reliability initiatives
- Lead production incident response, troubleshooting, root cause analysis, and post incident reviews
- Build and maintain monitoring, alerting, and observability solutions
- Define and maintain SLIs, SLOs, and error budgets
- Support database operations, performance optimization, and operational health
- Required Qualifications
- Senior Level: 5+ years of relevant experience
- Staff Level: 10+ years of relevant experience
- Strong experience operating distributed systems in production environments
- Mandatory Skills:
- Kafka (Highest Priority)
- 5+ years of hands on Kafka production experience preferred.
- Strong knowledge of Kafka architecture, brokers, partitions, replication, consumer groups, monitoring, and troubleshooting.
- Vector Database (Mandatory)
- Minimum 2+ years of hands on production experience with Vector Databases.
- Experience with Weaviate is strongly preferred.
- Strong understanding of vector search, embeddings, indexing, performance tuning, scaling, replication, and operational support.
- GCP / GKE
- 5+ years of production experience with GCP and Kubernetes.
- Experience managing and troubleshooting production GKE environments.
- SRE / DevOps
- Experience with incident management, root cause analysis, infrastructure automation, and CI/CD practices.
- Infrastructure as Code experience, preferably Terraform.
- Automation and scripting experience using Python, Go, Shell, or similar technologies.
- Observability
- Experience with Datadog, Splunk, OpenTelemetry, Prometheus, Grafana, or similar platforms.
- Strong understanding of monitoring, alerting, metrics, logs, and distributed tracing.
- Linux & Database Operations
- Strong Linux administration and troubleshooting skills.
- Experience supporting databases such as MongoDB, PostgreSQL, Redis, Elasticsearch, or ClickHouse.
Requirements
- Experience supporting AI Native platforms.
- Experience with LLM infrastructure, vector search, or embedding technologies.
- Strong technical leadership and mentoring experience for Staff level candidates.
- On Call Expectations
- Participate in a Follow the Sun support model.
- Weekend on call rotation approximately once every four weeks.
- Participate in incident response and escalation management.
- AI Expectations:
- Candidates should demonstrate practical use of AI tools such as ChatGPT and cloud AI services to improve troubleshooting, automation, and operational efficiency while maintaining strong engineering fundamentals and problem so
Benefits
Additional Information
Redefine the future of customer experiences. One conversation at a time. At Nextiva, we're reimagining how businesses connect, bringing together customer experience and team collaboration on a single, conversation centric platform. Powered by AI, driven by human innovation. Our culture is forward thinking, customer obsessed and built on the belief that meaningful connections drive better business outcomes. Whether it's through our signature Amazing Service®, the technology we create, or the experiences we cultivate, connection is at the core of who we are. If you're ready to collaborate with incredible people, make an impact, and help businesses everywhere deliver truly amazing experiences, this is where you belong. Location: This is an onsite role based at Nextiva's Bengaluru office (Wilshire III by MFAR, 492, Hobli, RHB Colony, Mahadevapura, Bengaluru, Karnataka 560048). Working together onsite strengthens how we operate, enabling faster decisions, clearer communication, and stronger execution, so you can make a greater impact and move work forward with speed and clarity. In-Office Expectation: This role is expected to work onsite four days per week, with the potential to increase to five days per week, as required by the business. Specific scheduling and flexibility will be guided by your leader to support both team collaboration and individual productivity. We are seeking a Senior or Staff Site Reliability Engineer to join the Middleware Engineering team supporting NCC Next, Nextiva's AI Native platform. This role is responsible for the reliability, scalability, performance, and operational excellence of critical middleware and cloud infrastructure services. The ideal candidate will have strong experience with Kafka, Vector Databases, Kubernetes, GCP, observability, automation, and distributed systems, along with a passion for building highly reliable platforms at scale. If you enjoy owning systems end to end, writing clean automation, and working in a fast-moving team that values innovation, this role is for you.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at nextiva? Share your experience