SRE Observability SLO Engineer

External

Ge Vernova · Queretaro Vernova Que Mx 3

Full-timeOn-siteToday

AnsibleAWSBashComplianceDatadogGrafana

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Benefits

Health insurancePaid time off

Additional Information

Job Description Summary GE Vernova's GridOS Platform Engineering team is building the next generation of SaaS reliability for critical energy infrastructure.The Observability & SLO Engineer is the eyes and ears of the GridOS SRE team. In this role you will build and own the full telemetry stack - from instrumentation standards to SLO dashboards to synthetic monitors - that give GE Vernova and its utility customers real-time confidence in the reliability of mission-critical energy management systems. This is a cyclical, high-impact position: you will drive an intensive initial ramp to establish v1.0 observability coverage across all customer environments, then shift into an ongoing improvement cadence aligned to new product releases and customer onboarding. Job Description Roles and Responsibilities Telemetry Standards & Architecture Implement organization-wide telemetry standards covering metrics, logs, and distributed traces across all GridOS SaaS services. Implement metrics collection for Kubernetes-hosted services (EKS/Rancher) including pod-level, namespace-level, and cluster-level metrics. Working with the SRE Lead and SRE Platform Engineers help define and implement data retention policies, cardinality budgets, and telemetry cost controls to keep observability economically sustainable. Publish and maintain an Observability Runbook library covering onboarding, alert tuning, and dashboard standards for Platform SRE and Production DevOps teams. SLO Definition, Tooling & Governance Partner with product engineering, Platform SRE, and customer stakeholders to define meaningful Service Level Indicators (SLIs) and Service Level Objectives (SLOs) per product and customer tier. Build and maintain SLO tooling - error budget burn-rate alerts, burn-rate dashboards, and automated SLO compliance reports. Govern the SLO review cycle: facilitate monthly SLO reviews, identify reliability risks early, and drive prioritization of reliability work with the SRE Lead. Translate SLOs into SLAs for customer-facing commitments in coordination with the SRE Team Lead. Dashboards & Alerting Design and build operational dashboards covering availability, latency, error rates, and saturation (the 'Golden Signals') for every GridOS SaaS product. Implement alert policies with noise-reduction practices: symptom-based alerting, multi-window burn-rate rules, and alert deduplication. Create executive-level dashboards for SRE leadership and customer-facing uptime/availability reports aligned to contractual SLAs. Establish and maintain alert routing, escalation policies, and on-call schedules in coordination with the incident response workflow. Synthetic Monitoring Design and implement a synthetic monitoring plan covering critical user journeys for each GridOS SaaS product and customer environment. Build synthetic checks for API health, UI flows, and integration endpoints using AWS CloudWatch Synthetics or equivalent tooling. Define alerting thresholds for synthetic monitors and integrate them into the broader incident detection pipeline. Continuous Improvement Cadence After v1.0 delivery, transition into a roadmap-aligned improvement cycle: expand coverage for new features, tune alert signal-to-noise, and retire stale monitors. Conduct periodic observability health reviews to identify gaps in coverage, reduce MTTD (Mean Time to Detect), and improve MTTR (Mean Time to Resolve). Collaborate with the Production DevOps engineer on FinOps validation - correlate infrastructure cost metrics with performance and reliability data. Required Experience 2-3 years in SRE, observability engineering, or infrastructure reliability roles. Deep expertise with at least one major observability platform - Datadog, Grafana + Prometheus, AWS CloudWatch, Dynatrace, or New Relic. Hands-on experience implementing SLIs, SLOs, and error budget burn-rate alerting in a production SaaS environment. Strong understanding of distributed systems telemetry: metrics (Prometheus/CloudWatch), structured logging (CloudWatch Logs Insights, ELK), and distributed tracing (OpenTelemetry, AWS X-Ray). Experience with Kubernetes observability - kube-state-metrics, node exporters, Helm-deployed monitoring stacks, and namespace-level resource metrics. Proficiency in at least one query/visualization language: PromQL, Splunk SPL, Datadog Query Language, or CloudWatch Logs Insights query syntax. Experience designing alerting strategies that minimize alert fatigue through symptom-based and burn-rate approaches. Scripting skills in Python and/or Bash for automation of monitoring configuration and report generation. Key Skills and Technologies Cloud Technologies - AWS Cloud Infrastructure - EKS, RDS, MSK, S3, EC2, EBS, SQS, etc. Kubernetes - EKS, Rancher Infrastructure as Code: Terraform Deployment and Configuration Tools - Ansible, Chef or Puppet Telemetry standards and tools - Open Telemetry, CloudWatch, Cloudtrail Observability tools and technology - Datadog,

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at GE Vernova? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect