Staff Site Reliability Engineer- Splunk Expert

External

Okta · Bengaluru, India

Full-timeOn-site3w ago

AWSAzureDNSGCPGrafanaIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Splunk Architecture & Optimisation: Lead the design and tuning of Splunk environments. Optimise indexer performance, search efficiency, and data models to ensure rapid troubleshooting and cost-efficiency.
Advanced Visualisation: Architect and maintain sophisticated Grafana dashboards that correlate disparate data sources into a single pane of glass for real-time system health.
Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform .
Pipeline Engineering: Optimise the collection, processing, and storage of telemetry data (Metrics, Logs, Traces) to ensure high reliability and low latency.
Workflow Automation: Develop custom Splunk workflows and integrations that trigger automated responses to system events, reducing Mean Time to Resolution (MTTR).
Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements through "observability-driven development."
Required Skills & Experience (The Essentials)
Splunk Mastery: Deep, hands-on experience with Splunk administration, search optimisation (SPL), and architecting complex data pipelines. You know how to make Splunk "hum" at scale.
Grafana Expertise: Proven ability to build actionable, intuitive dashboards in Grafana that go beyond simple charts to provide deep operational insights.
SRE Mindset: Minimum 8+ years of experience in an SRE, DevOps, or Systems Engineering role with a focus on high-availability systems.
Programming Proficiency: Strong coding skills in Go, Python, or Ruby for building internal tools and automating observability workflows.
Telemetry Standards: Hands-on experience with OpenTelemetry (OTel) , Prometheus, or similar frameworks for instrumenting applications.
Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/EKS).
Bonus Skills (The "Nice-to-Haves")
Tracing: Implementation of distributed tracing (Jaeger, Tempo, or Honeycomb) to visualise request flow across microservices.
Security Observability: Experience using Splunk for security orchestration (SOAR) or SIEM-related workflows.
Cloud Platforms: Experience managing observability native tools within AWS, Azure, or GCP.
#LI-Hybrid
P22381_3143209
The Okta Experience
Supporting Your Well-Being
Driving Social Impact
Developing Talent and Fostering Connection + Community
We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.
If reasonable accommodation is needed to complete any part of the job application, in

Benefits

Health insurancePerformance bonus

Additional Information

Secure Every Identity, from AI to Human Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk. Workforce Identity Cloud Okta Workforce Identity Cloud (WIC) provides easy, secure access for your workforce so you can focus on other strategic priorities-like reducing costs, and doing more for your customers. If you like to be challenged and have a passion for solving large-scale automation, testing, and tuning problems, we would love to hear from you. The ideal candidate is someone who exemplifies the ethics of, "If you have to do something more than once, automate it" and who can rapidly self-educate on new concepts and tools. Position Overview We are seeking a highly technical Staff Site Reliability Engineer with deep expertise in Splunk and Grafana to own and evolve our observability ecosystem. In this role, you will move beyond simple monitoring to architect a comprehensive, scalable telemetry platform. You will be our subject-matter expert in Splunk optimisation, ensuring our logging architecture is performant, cost-effective, and deeply integrated with our automated workflows. You will treat infrastructure as code-utilising Terraform and strong coding proficiency in Go, Python, or Ruby -to automate the deployment of agents and collectors across complex distributed systems.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Okta? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect