Principal Software Engineer

External

Nielseniq · Chennai, India

Full-timeOn-site4d ago

AngularAWSAzureBashCI/CDDatadog

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Application Reliability & Support
Own end‑to‑end reliability of multi‑tier applications spanning Angular, Node.js, Java, and Python stacks
Monitor, triage, and resolve production incidents with speed and precision, minimizing customer impact and MTTR
Perform root cause analysis (RCA) on recurring issues and drive permanent fixes through development or platform teams
Define and track SLIs, SLOs, and error budgets aligned to business criticality
Lead blameless post‑mortems and ensure actionable follow‑through on learnings
Proactively identify reliability risks and work with engineering teams to address them before they impact production
Incident Management & Technical Triage
Lead technical triage bridges during P1/P2 incidents, coordinating across application, infrastructure, and vendor teams
Rapidly diagnose issues across the full stack - front‑end rendering, API failures, JVM issues, database bottlenecks, and network anomalies
Establish and maintain runbooks, escalation paths, and incident response playbooks
Drive structured incident timelines, stakeholder communications, and resolution documentation
Champion fast feedback loops between on‑call, engineering, and leadership during high‑severity events
Observability & Monitoring
Design and implement end‑to‑end observability strategies covering logs, metrics, traces, and synthetic monitoring
Build and maintain dashboards, alerting rules, and anomaly detection for Angular, Node.js, Java, and Python applications
Define golden signals (latency, traffic, errors, saturation) and SLO‑based alerting for all critical services
Drive adoption of distributed tracing and correlation of signals across service boundaries
Evaluate and integrate observability tooling (e.g., Prometheus, Grafana, Open Telemetry, Datadog, Dynatrace,Splunk, ELK)
Continuously improve signal‑to‑noise ratio to reduce alert fatigue and improve detection confidence
Automation & Toil Reduction
Identify and eliminate operational toil through automation, scripting, and self‑healing mechanisms
Build and maintain automation scripts in Python, Shell/Bash, or Node.js for diagnostics, remediation, and reporting
Develop automated health checks, smoke tests, and canary validations for releases and deployments
Automate repetitive support workflows such as log analysis, data reconciliation, and environment reset procedures
Contribute to the internal tooling ecosystem to improve operational efficiency across teams
Release & Change Management
Coordinate application releases in alignment with change management processes and release calendars
Conduct pre‑release readiness reviews, validating deployment readiness, rollback plans, and monitoring coverage
Collaborate with development and DevOps teams to define and enforce safe deployment practices(blue‑green, canary, feature flags)
Participate in change advisory board (CAB) processes, providing technical assessment of risk and impact
Maintain deployment runbooks and ensure change traceability across environments
Collaboration - Development, Architecture & Platform Teams
Serve as the operational voice in engineering discussions, advocating for reliability, observability, and supportability
Partner with development teams during design and sprint cycles to embed SRE best practices early
Engage with architects to review designs for failure modes, observability gaps, and operability concerns
Provide production insights and telemetry data to inform architectural decisions and technical debt prioritization
Drive feedback loops from production back to development and architecture teams in a structured ,data‑driven manner
Cloud & Infrastructure
Support and operate cloud‑native applications on Azure, AWS, or GCP, leveraging managed services effectively
Manage and troubleshoot containerized workloads using Docker and Kubernetes (AKS / EKS / GKE)
Understand and operate CI/CD pipelines, supporting deployment automation and pipeline r

Benefits

Health insurance

Additional Information

Principal Software Engineer - Site Reliability & Application Support, Chennai We are looking for a Principal Software Engineer in Site Reliability Engineering (SRE) who defines and drives the reliability strategy for large‑scale, distributed, and cloud‑native applications. This role operates at a company and platform level, bridging the gap between software engineering and operations to ensure our applications are highly available, performant, and resilient at scale. The scope spans the full application stack Angular front‑end, Node. jsservices, Java back‑end, and Python tooling - and encompasses reliability engineering, observability, incident management, and continuous improvement of application health across production environments. You will act as a technical authority for application reliability and support, leading triage efforts, driving automation to eliminate toil, setting company‑wide SRE standards, and collaborating with development, platform, and architecture teams to embed reliability as a first‑class engineering concern.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Nielseniq? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect