Observability Platform Maintenance: Support and extend dashboards, metrics tracking, and APM tracing infrastructure inside Datadog and Sumo Logic. Maintain multi-tenant workspaces and universal tagging compliance across teams.
Incident Response Configuration: Configure and manage PagerDuty infrastructure. Maintain service orchestrations, alert routing rules, event intelligence settings, on-call calendar schedules, and native alerts integration across collaboration platforms (Slack).
FinOps Execution & Data Controls: Optimize telemetry pipeline data flows using Cribl to eliminate noise, drop duplicate fields, and strip out bloated payloads. Ensure high-value signals reach Sumo Logic and Datadog while directing low-value compliance logs to archival cold storage.
Ansible Configuration Management: Fully automate the deployment, onboarding, patch management, and state consistency of monitoring agents (Datadog agents, Sumo collectors, Cribl Edge) and pipeline configurations using Ansible Playbooks and Roles.
Standardization Compliance: Enforce telemetry schemas, log signatures, and operational golden signals across the enterprise. Collaborate on the implementation and configuration of OpenTelemetry (OTel) collectors.
Team Upskilling & Collaboration: Serve as an engineering mentor across internal product teams, building out technical documentation, runbooks, and leading enablement sessions for modern logging and alerting procedures.
Requirements
3 years of Python development experience
Proven expertise in Datadog, including AWS integrations and dashboard templating.
Experience with SignalFX/Splunk Observability Cloud and legacy monitoring paradigms.
Experience working across Infra, App, and DevOps teams to create relevant metrics.
Experience with applying Site Reliability Engineering (SRE) concepts
Strong understanding of AWS architecture and cloud-native observability.
Strong understanding of monitoring distributed systems
Familiarity with OpenShift or Kubernetes
Familiarity with Ansible
Familiarity with Infrastructure-as-Code concepts
Familiarity with OpenTelemetry
Excellent communication and stakeholder management skills.
Certifications in Datadog, AWS, or related observability platforms.
Experience in enterprise-scale monitoring transformations.
#LI-SM1
About Red Hat
Inclusion at Red Hat
Equal Opportunity Policy (EEO)
Red Hat is proud to be an equal opportunity workplace and an affirmative action employer. We review applications for employment without regard to their race, color, r
Benefits
Remote work options
Additional Information
As the Senior Platform Engineer for Monitoring and Logging, you will serve as a key technical engineer responsible for building, scaling, and maintaining our enterprise-wide observability and log management ecosystem. Collaborating closely with your team and the principal engineer, you will focus on the technical execution and engineering of our open telemetry pipeline transformation-ensuring systems can seamlessly ingest terabytes of daily logs, metrics, and traces. You will directly configure and maintain our data distribution pipelines using Cribl, establish analytical environments in Sumo Logic (log management) and Datadog (monitoring), and help internal customers manage responsive alerting loops through PagerDuty.
This role works with core platform engineering, continuous infrastructure maintenance, and site reliability areas. You will ensure product and system teams across the company have the required visibility into their software stacks while maintaining tight control over data configurations, filtering workflows, and ingestion costs through automated configuration baselines.