Principal Software Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Application Reliability & Support
- Own end‑to‑end reliability of multi‑tier applications spanning Angular, Node.js, Java, and Python stacks
- Monitor, triage, and resolve production incidents with speed and precision, minimizing customer impact and MTTR
- Perform root cause analysis (RCA) on recurring issues and drive permanent fixes through development or platform teams
- Define and track SLIs, SLOs, and error budgets aligned to business criticality
- Lead blameless post‑mortems and ensure actionable follow‑through on learnings
- Proactively identify reliability risks and work with engineering teams to address them before they impact production
- Incident Management & Technical Triage
- Lead technical triage bridges during P1/P2 incidents, coordinating across application, infrastructure, and vendor teams
- Rapidly diagnose issues across the full stack - front‑end rendering, API failures, JVM issues, database bottlenecks, and network anomalies
- Establish and maintain runbooks, escalation paths, and incident response playbooks
- Drive structured incident timelines, stakeholder communications, and resolution documentation
- Champion fast feedback loops between on‑call, engineering, and leadership during high‑severity events
- Observability & Monitoring
- Design and implement end‑to‑end observability strategies covering logs, metrics, traces, and synthetic monitoring
- Build and maintain dashboards, alerting rules, and anomaly detection for Angular, Node.js, Java, and Python applications
- Define golden signals (latency, traffic, errors, saturation) and SLO‑based alerting for all critical services
- Drive adoption of distributed tracing and correlation of signals across service boundaries
- Evaluate and integrate observability tooling (e.g., Prometheus, Grafana, Open Telemetry, Datadog, Dynatrace,Splunk, ELK)
- Continuously improve signal‑to‑noise ratio to reduce alert fatigue and improve detection confidence
- Automation & Toil Reduction
- Identify and eliminate operational toil through automation, scripting, and self‑healing mechanisms
- Build and maintain automation scripts in Python, Shell/Bash, or Node.js for diagnostics, remediation, and reporting
- Develop automated health checks, smoke tests, and canary validations for releases and deployments
- Automate repetitive support workflows such as log analysis, data reconciliation, and environment reset procedures
- Contribute to the internal tooling ecosystem to improve operational efficiency across teams
- Release & Change Management
- Coordinate application releases in alignment with change management processes and release calendars
- Conduct pre‑release readiness reviews, validating deployment readiness, rollback plans, and monitoring coverage
- Collaborate with development and DevOps teams to define and enforce safe deployment practices(blue‑green, canary, feature flags)
- Participate in change advisory board (CAB) processes, providing technical assessment of risk and impact
- Maintain deployment runbooks and ensure change traceability across environments
- Collaboration - Development, Architecture & Platform Teams
- Serve as the operational voice in engineering discussions, advocating for reliability, observability, and supportability
- Partner with development teams during design and sprint cycles to embed SRE best practices early
- Engage with architects to review designs for failure modes, observability gaps, and operability concerns
- Provide production insights and telemetry data to inform architectural decisions and technical debt prioritization
- Drive feedback loops from production back to development and architecture teams in a structured ,data‑driven manner
- Cloud & Infrastructure
- Support and operate cloud‑native applications on Azure, AWS, or GCP, leveraging managed services effectively
- Manage and troubleshoot containerized workloads using Docker and Kubernetes (AKS / EKS / GKE)
- Understand and operate CI/CD pipelines, supporting deployment automation and pipeline r
Benefits
Additional Information
Principal Software Engineer - Site Reliability & Application Support, Chennai We are looking for a Principal Software Engineer in Site Reliability Engineering (SRE) who defines and drives the reliability strategy for large‑scale, distributed, and cloud‑native applications. This role operates at a company and platform level, bridging the gap between software engineering and operations to ensure our applications are highly available, performant, and resilient at scale. The scope spans the full application stack Angular front‑end, Node. jsservices, Java back‑end, and Python tooling - and encompasses reliability engineering, observability, incident management, and continuous improvement of application health across production environments. You will act as a technical authority for application reliability and support, leading triage efforts, driving automation to eliminate toil, setting company‑wide SRE standards, and collaborating with development, platform, and architecture teams to embed reliability as a first‑class engineering concern.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Nielseniq? Share your experience