Senior Site Reliability Engineer

External

8x8inc · Manila-8x8 Asia

Full-timeOn-siteToday

GrafanaIncident ResponseJiraLeadership

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Production Operations & Incident Response
Own platform reliability across global UC infrastructure, driving incident response and the overall reliability strategy for your subsystem rather than resolving issues in isolation.
Triage and resolve the hardest issues - service restarts, hung processes, infrastructure failures - and act as the senior escalation point for the NOC and for other engineers when frontline teams hit their limit.
Execute and improve the unglamorous but essential work: scheduled maintenance, certificate renewals, log rotation - and redesign these processes so failure is prevented systemically, not handled case by case.
Lead blameless post-mortems that produce real follow-through, and sign off on the corrective actions that come out of them.
Cross-Team Collaboration
Work directly with Support, Sales, Sales Engineering, NOC, Professional Services, and Engineering teams across 8x8 - this team sits at the operational center of the company.
Translate production events into clear, business-readable communication under pressure; stakeholders across the org depend on your judgment during incidents.
Feed operational insight back into engineering - turning recurring failures and patterns into actionable bug reports, platform improvements, and influence over the architectural roadmap.
Work closely with technical leads to align reliability and automation work with broader engineering goals, and help focus discussion on what matters most.
Reliability Engineering & Automation
Identify recurring manual work and build automation to eliminate it - we treat toil as a bug, not a requirement.
Drive design for the tooling and automation in your domain; anticipate how a change in one component impacts others and account for adjacent domains in your designs.
Understand the limits of our existing tools - and recognize when a problem exceeds those limits and deserves the effort of building a new one.
Take on large-scale technical debt and refactoring across the subsystem, and contribute to the team's coding methodologies and best practices.
Participate in 2-week sprint cycles to deliver automation, tooling improvements, runbook development, and infrastructure initiatives from a structured backlog. Own the functional specifications for large features and sign off on test plans.
Address security issues as they arise - CVEs, misconfigurations, access control gaps - treated as first-class work alongside incident response.
Define and track SLIs, SLOs, and SLAs to drive honest, data-driven conversations about where reliability investment is needed.
Build and maintain dashboards (Grafana, OCI Log Analytics) that give the team genuine signal; tune alerting to eliminate noise - a high-noise on-call is itself a reliability failure.
Leverage AI-powered tooling to accelerate diagnostics and reduce cognitive load at scale.
Technical Leadership & Mentorship
Provide technical leadership for projects involving 1-2 other engineers.
Consistently mentor more junior engineers; be the person other developers seek out for constructive, insightful feedback.
Frequently and actively share knowledge - of your own work, of areas you've worked in, and of obscure corners outside your immediate context - and encourage others to do the same.
Run workshops, contribute to how-to guides, present at demos, and contribute to the team's presentation portfolio.
On-Call & Coverage
Shared on-call rotation, approximately 1 week per month - same expectation for every engineer on the team.
Escalation is always an option and is encouraged; you are expected to drive the response, set the pace for others, and know when to pull people in - not to hero it alone.
Tooling : PagerDuty for alerting, Jira for tracking, OCI Log Analytics and Grafana for diagnostics.
What We'r

Additional Information

8x8 connects our customers and teams globally, empowering CX leaders with performance and insights to make smarter decisions, delight customers, and drive lasting business impact. About 8x8 UC Operations The UC Operations team manages the production infrastructure behind 8x8's Unified Communications platform - voice, fax, messaging, and collaboration services used by enterprise customers globally. The team oversees dozens of applications running across more than two thousand service instances worldwide, spanning VoIP infrastructure, messaging brokers, storage systems, and cloud workloads across Oracle Cloud Infrastructure and physical datacenters. UC Ops sits at the operational center of 8x8 - taking escalations from the NOC, coordinating with Engineering, and working alongside Support, Sales, and Professional Services. The work is complex, the systems are live, and the stakes are real. We are actively moving from reactive operations to a proactive, automation-first SRE model - and we are looking for a senior engineer who will help drive that transition: setting technical direction, leading initiatives across the subsystem, and raising the bar for the engineers around them.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at 8x8inc? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect