Senior Site Reliability Engineer, Observability

External

Webflow · Remote

Full-timeRemote4w ago

AWSDatadogDockerElasticsearchGCPGrafana

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Location: Remote-first (Argentina) Full-time/Permanent Application Information: Application deadline: applications accepted on an ongoing basis until position is closed and filled This posting is for a new position Reporting to the Engineering Manager of Observability

Requirements

BS / BA college degree or relevant experience
Business-level fluency to read, write and speak in English
You'll thrive as an Observability, Site Reliability Engineer if you:
Join our newly formed Observability team responsible for ensuring engineers across Webflow have the tools, data, and practices they need to understand the health and performance of the Webflow application and our hosting services .
Own and evolve Webflow's observability stack, including OpenTelemetry, and Datadog, to provide reliable, actionable metrics, traces, and logs across our services.
Regularly dive into the main Webflow application in TypeScript, Node, or Go to better debug (and sometimes fix) behavior in production.
Continuously raise the bar on observability practices by driving adoption of SLOs, distributed tracing, and structured logging throughout engineering.
Build and maintain AI-powered agents and automation that help engineers surface insights faster, reduce alert fatigue, and accelerate incident resolution.
Guide and empower engineers on other teams to instrument their services effectively and introduce new features into production with confidence.
Participate in and continuously improve on-call and incident response processes, with a focus on making observability data the foundation of faster, more effective responses.
Reduce toil by automating common observability workflows to keep the rest of engineering working smoothly with fewer interruptions.
Partner effectively with engineering teams to define, implement, and improve observability practices, enabling them to confidently ship and operate services in production.
Help define the culture of this growing team as it expands its international presence.
About you:
Have either a background as a software engineer with an enthusiasm for observability, infrastructure and reliability or background as an infra or production engineer with an enthusiasm for code, or
Have 5+ years of experience building, maintaining, and debugging distributed systems in a customer-facing environment that allows for little to no downtime.
Have hands-on experience with observability platforms and tooling such as Datadog, Grafana, Prometheus, ElasticSearch or similar, and a strong opinion on what good observability looks like.
Have experience with OpenTelemetry or similar instrumentation frameworks for collecting metrics, traces, profiles and logs across distributed services.
Experience defining and operationalizing SLOs/SLIs at scale.
Have experience navigating and scaling multi-tier cloud environments on either AWS or GCP.
Have experience with container-centric architectures built with tools like Docker and Kubernetes (EKS, GKE, AKS, etc.), or ECS.
Have experience with infrastructure-as-code tools like Terraform,or Pulumi.
Have experience contributing to full-stack applications built using software like React, Node.js, and MongoDB or PostgreSQL.
Stay curious and open to growth - demonstrating a proactive embrace of AI, and actively building and applying fluency in emerging technologies to elevate how we work, drive faster outcomes, and expand collective impact.
It would be a bonus if you had even one of the following:
Experience building or operating AI agents that interact with observability data (e.g., automated root cause analysis, intelligent alerting, or natural-language querying of telemetry).
Experience with OpenTelemetry, Kubernetes and Pulumi specifically.
Experience improving on-call and incident response processes for Engineering.
In addition to the responsibilities outlined above, at Webflow we will support

Benefits

Health insuranceRemote work optionsPerformance bonus

Additional Information

At Webflow, we're building the world's leading AI-native Digital Experience Platform, and we're doing it as a remote-first company built on trust, transparency, and a whole lot of creativity. This work takes grit, because we move fast, without ever sacrificing craft or quality. Our mission is to bring development superpowers to everyone. From entrepreneurs launching their first idea to global enterprises scaling their digital presence, we empower teams to design, launch, and optimize for the web without barriers. We believe the future of the web, and work, is more open, more creative, and more equitable. And we're here to build it together. We're looking for a Observability , Site Reliability Engineer to improve reliability and stability of Webflow's customer-facing, production infrastructure, serving millions of page views per hour. Our product is used by over 2 million users world-wide across 190 countries, and you'll help ensure our platform is secure and scalable for these users as tens of thousands of projects are launched on Webflow each month.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Webflow? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect