Digital Incident & Service Management Expert

External

Globe · 18f The Globe Tower

Full-timeOn-site2w ago

AWSClassificationDocumentationGCPGrafanaIncident Response

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Benefits

Health insurance

Additional Information

At Globe, our goal is to create a wonderful world for our people, business, and nation. By uniting people of passion who believe they can make a difference, we are confident that we can achieve this goal. Job Description Responsible for ensuring the reliability, performance, and stability of digital channels and APIs across web and mobile platforms. Leads day-to-day digital and API operations, including proactive monitoring, incident management, and service recovery. Partners closely with engineering, IT, and vendors to identify systemic issues, restore services within SLAs, and translate technical incidents into customer and business impact. Drives operational excellence through robust monitoring, analytics, governance, and continuous improvement of incident response and service management practices across multiple environments. 1. Digital Channel & API Operations Management Monitor end-to-end health of digital channels (web and mobile app), ensuring uptime, stability, and performance against defined SLOs and SLAs. Proactively track and analyze API performance metrics including success rate, latency, throughput, and error rates. Identify systemic issues, degradation patterns, and recurring failure points impacting transactions and system reliability. Coordinate with IT, platform teams, and external vendors to investigate incidents, validate root causes, and restore services within agreed timelines. Quantify technical incidents in terms of operational and customer impact to support prioritization and remediation. 2. Incident & Recovery Operations Serve as the operational incident lead during channel or API disruptions, driving coordination across L2/L3 support, engineering, infrastructure, and vendors. Ensure proper incident triage, escalation, and resolution in accordance with incident management frameworks (e.g., ITIL) Oversee execution of recovery actions, post-incident validation, and service normalization activities. Ensure incidents are logged, tracked, and closed within SLA, with accurate classification for trend and RCA analysis. 3. Technical Monitoring, Analytics & Reporting Develop and maintain operational dashboards covering availability, latency, error rates, traffic, and incident trends using monitoring tools (e.g., Grafana, AWS). Analyze operational data to detect early warning signals, capacity risks, or reliability gaps. Produce weekly and monthly technical operations reports summarizing channel health, incident patterns, and stability risks. Partner with analytics and engineering teams to correlate system performance with downstream customer and business impact. GCP and AWS API's technical skills 4. Operational Excellence & Governance Execute and continuously improve the Digital Operations Playbook, including monitoring standards, escalation paths, and incident response procedures. Enforce operational governance through consistent RCA documentation, post-incident reviews, and corrective action tracking. Coach and guide operations analysts on technical monitoring practices, incident handling, and service management standards. Govern service transitions from project delivery teams to steady-state L2/L3 operations, ensuring readiness, documentation completeness, and monitoring coverage. Key KPIs (Technical Operations-Led) Channel Availability: ≥ 99.5% uptime for web and mobile app Incident Resolution: ≥ 95% resolved within SLA Mean Time to Detect (MTTD) & Restore (MTTR): Continuous improvement targets Recurring Incident Reduction: Measured quarter-over-quarter Top Deliverables (Action-Oriented, KPI-Driven) Ensure Channel & API Reliability - Maintain stable, high-performing digital channels through proactive monitoring and controls, driving improvements in availability, MTTD , and overall service reliability. Drive Fast Incident Detection & Recovery - Lead incident triage, escalation, and resolution to minimize customer impact, with clear accountability for MTTD and MTTR performance. Prevent Recurrence & Improve Resilience - Convert incident learnings into prioritized fixes, monitoring enhancements, and process changes that reduce repeat incidents and improve long-term MTTR and NPS . Provide Performance Visibility & Governance - Deliver regular, outcome-focused reporting on MTTD, MTTR, SLA attainment, and NPS impact to enable leadership oversight and data-driven decisions. Hiring Requirements Work Experience At least two years of full-time work experience in web front end and backend configuration. With skills in GCP and mongoDB, AWS, Clairevoyance, CMS. Experience in crafting, delivery, or reviewing learning programs is a plus Experience in operational readiness, service delivery, or technical enablement is a plus. Level of Knowledge & Skills 4-6 years of experience in IT, digital operations, platform operations, or service management . Strong understanding of incident, problem, and service management (ITIL preferred). Hands-on or working knowledge of monitor

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at globe? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect