Skip to main content
Back to jobs

Senior Platform SRE

External
ig logoIg · Bangalore, India
Full-timeHybridToday
AWSCapacity PlanningCI/CDDatadogGrafanaIncident Response
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Build and own the reliability platform
  • Implement comprehensive monitoring and observability using OpenTelemetry and distributed tracing . Maintain SLO , error budgets and burn-rate tracking
  • Establish and maintain 24/7 operational readiness including automated deployments, blue/green releases, and zero-downtime patching strategies
  • Engineer self-healing capabilities: auto-remediation, error-budget-gated rollback, and automated traffic rerouting
  • Design and run chaos experiments across the AWS estate, turning severe-but-plausible failure scenarios into engineering improvements
  • Build automation tools and CI/CD pipelines that embed reliability practices , while a pply ing software engineering discipline including version control, code reviews, and testing .
  • Contribute to the SRE AI agent, IG's agentic tooling for incident investigation and reliability review, built on AWS frontier models
  • Mentor junior SREs and Reliability Champions on reliability patterns and production engineering discipline
  • Set and uphold standards
  • Author and evolve the SRE standards that underpin the Guild: SLO methodology , error budget policy, observability instrumentation guide, and Production Readiness Review (PRR) checklist
  • Mentor developers on reliability patterns including circuit breakers, retry logic, and fault tolerance
  • Work with development teams and Reliability Champions to design SLOs on customer journeys rather than per-service .
  • Assist and guide teams in system design, capacity planning, architectural reviews and clos ing observability gaps .
  • Own incident response and learning
  • Facilitate blameless post-incident reviews (PIRs) within five working days using contributing-factor methodology
  • Maintain the Lessons Register, track remediation actions to closure, and surface patterns across incidents quarterly
  • What you'll need for this role
  • Essential Technical Skills
  • Observability and instrumentation: hands-on OpenTelemetry experience (spans, metrics, traces, context propagation) and production use of Honeycomb, Datadog, Dynatrace, or Grafana; able to instrument Java or Python services directly .
  • SLOs and error budgets: proven track record designing customer-meaningful SLIs, setting error budgets, configuring multi-window burn-rate alerts, and working with development teams on reliability measurement
  • CI/CD and release engineering: experience building pipelines with safety mechanisms: blue/green and canary releases, automated rollback, and DORA metrics integration
  • Container orchestration: Kubernetes (EKS, AKS, or GKE) required ; HashiCorp Nomad is a strong advantage on IG's hybrid estate; solid understanding of cloud networking and IaC (Terraform preferred)
  • Software engineering: production-quality coding in Java and/or Python; comfortable contributing to application codebases to implement reliability patterns, not just configuring infrastructure around them
  • Distributed systems: strong understanding of how large-sca

Additional Information

Job Title Senior Platform SRE Job Description So, who are we? IG has been at the centre of retail trading and investment since 1974, when we helped create the market for financial spread betting. Today, we're a FTSE100 fintech operating across five continents, serving over 700,000 clients and handling billions in transactions - built on decades of scale, trust and proof. We didn't pivot to innovation; it's how we've always operated . What that means for the people who work here is real: genuinely complex problems to solve, the technology and resources to tackle them properly, and the kind of scope that's rare in established businesses. The bar is high - bring a curious and forward-thinking mindset and we'll give you the platform to define what comes next. Join us at IG - the future gets built here. Your team The Platform SRE team is the engine of IG's reliability programme. We sit within Infrastructure & Operations, working across IG's hybrid estate of on-premises HashiCorp Nomad and AWS. We are not a reactive ops team. We build the platform, standards, and tooling that make reliability the default for every engineering team at IG. Through the SRE Guild, we connect with Domain SREs and Reliability Champions across the organisation, setting the bar and lifting it together. Your role in the Team's Success You will be a hands-on technical contributor at the heart of the Platform SRE team, owning pieces of the reliability platform that hundreds of engineers depend on. You will work at the intersection of software engineering, observability, and systems reliability, turning reliability from a reactive concern into a proactive engineering discipline. You will partner with Platform Engineering, product teams, and Reliability Champions to define what good looks like in production and then make it the default. You will contribute to the SRE Guild, mentor engineers across the organisation, and when things go wrong, you will be on the call helping to mitigate, understand, and prevent a repeat.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at ig? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect