Design, implement, and maintain advanced monitoring and alerting solutions for cloud infrastructure, networks, and enterprise SaaS platforms (Workday, Salesforce, NetSuite, etc.)
Build and optimize dashboards to provide actionable insights into system health and performance
Automate repetitive monitoring tasks using scripts or API integrations (Python, PowerShell, or equivalent)
Continuously improve alert thresholds, correlation rules, and incident triage logic to reduce noise and improve detection accuracy
Incident Management & Root Cause Analysis
Act as the second line of defense for complex performance incidents, collaborating with Cloud, Network, and Application teams for resolution
Lead post-incident reviews and contribute to problem management processes
Develop and implement monitoring enhancements to prevent recurrence of major incidents
Provide mentorship and technical guidance to Level 1 Observability Associates
Performance Analysis & Continuous Improvement
Conduct trend analysis on performance metrics and system logs to identify potential bottlenecks and capacity issues
Partner with service owners and technical SMEs to improve observability coverage and service reliability
Propose and implement metric-based service-level indicators (SLIs) and service-level objectives (SLOs)
Evaluate and onboard new observability tools or features to enhance monitoring maturity
Documentation & Knowledge Sharing
Maintain up-to-date runbooks, SOPs, and architecture diagrams for observability systems
Develop internal knowledge articles and training materials for cross-functional teams
Contribute to continuous service improvement (CSI) initiatives within the ITSM framework
Requirements
Education:
Bachelor's degree in computer science, IT, or related discipline (or equivalent professional experience).
5+ years of experience in IT Operations, NOC, or Observability roles, with at least 2 years in a Level 2 capacity.
Demonstrated experience managing observability for hybrid (cloud/on-premises) environments
Technical Skills: Proficiency with monitoring and observability tools: Prometheus, Grafana, Datadog, New Relic, Splunk, ELK, or similar.
Strong understanding of networking, cloud infrastructure (AWS/Azure/GCP), and SaaS application monitoring.
Familiarity with APM (Application Performance Monitoring) and synthetic monitoring.
Scripting knowledge in Python, PowerShell, or Bash for automation and data processing.
Experience integrating observability tools with incident management systems (ServiceNow, Jira, PagerDuty, Opsgenie).
Problem-Solving: Strong analytical and troubleshooting abilities
Ability to prioritize and manage tasks in a fast-paced environment
Professionalism: Excellent analytical and problem-solving skills with a proactive mindset
Strong communication skills with the ability to convey technical insights to non-technical stakeholders
Proven ability to operate in a fast-paced, 24x5 global support environment
Working knowledge of ITIL/ITSM processes (Incident, Change, and Problem Management)
Collaboration: Ability to work cross-functionally to identify trends and improve IT services
Work Environment: Willingness to work in a 24x5 support environment
Benefits
Health insurance
Additional Information
Job Ad
Observability Analyst
Location : Bangalore or Pune
Department : Data, Technology & Security
We're seeking an experienced and proactive Observability Analyst - (Level 2) to join Procore's Data, Technology & Security team in our Bangalore or Pune office. In this role, you'll play a critical part in enhancing our monitoring and observability practices across global IT and business systems. You will work closely with engineering, network, and application teams to ensure performance, reliability, and transparency across our enterprise platforms.
Reporting to the Director of End User Services & ITSM, Ganesh Annaswamy, you'll use your technical, analytical, and troubleshooting skills to maintain our IT services. Your contributions will be integral to our incident handling processes and overall team success.