Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)

External

Coupang · Seoul, South Korea

Full-timeOn-site1w ago

Capacity PlanningCI/CDDatadogEncryptionEpicForecasting

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

End-to-End Observability Platform Ownership : Design and scale telemetry pipelines using:
Grafana Alloy for metrics collection (Prometheus-compatible pipelines)
Datadog Vector for high-throughput log ingestion and transformation
Grafana Mimir for scalable time-series storage
Grafana Loki for log aggregation and querying
Strategic Roadmap : Define the multi-year vision for GPU infrastructure observability, transitioning from reactive monitoring to SLO-driven, predictive, and automated observability .
High-Cardinality Telemetry Design : Optimize pipelines for GPU workloads characterized by:
High-cardinality labels (GPU IDs, tenants, workloads)
Burst-heavy workloads (ML training, inference spikes)
Multi-tenant isolation requirements
Architect low-latency, high-throughput pipelines capable of ingesting:
GPU metrics (utilization, memory, thermals, MIG partitions)
Kubernetes and container telemetry
Distributed system logs and traces
Build and optimize metric pipelines (Alloy → Mimir) ensuring:
Efficient remote_write tuning
Cost-effective retention strategies
Horizontal scalability and compaction tuning
Design log pipelines (Vector → Loki) with:
Structured logging and enrichment
Intelligent filtering/sampling
Stream partitioning for high-ingest environments
Establish deep observability into:
GPU hardware (NVIDIA DCGM, MIG, NVLink, PCIe)
Kubernetes GPU operators and scheduling behavior
Network fabric (RDMA, InfiniBand, TCP performance)
Define GPU-specific SLIs/SLOs such as:
GPU utilization efficiency
Job scheduling latency
Cluster fragmentation
Thermal and power anomalies
Build rich Grafana dashboards for:
Real-time GPU fleet health
Tenant-level usage and billing insights
Capacity planning and forecasting
Standardize dashboard frameworks and reusable panels across teams
Enable self-service observability for platform and ML engineering teams
Drive adoption of SRE principles :
SLIs, SLOs, error budgets tailored to GPU workloads
Integrate observability into CI/CD and IaC pipelines (Terraform/Kubernetes) :
Automated canary analysis
Observability-driven rollbacks
Build automation (Go/Python) for:
Pipeline health monitoring
Dynamic routing and scaling of telemetry workloads
Develop tooling and practices for cross-layer correlation :
GPU → Node → Kubernetes → Application → Network
Lead deep RCA efforts for:
GPU contention issues
Performance degradation in ML workloads
Telemetry pipeline backpressure/failures
Enable "needle-in-a-haystack" debugging using unified logs + metrics
Mentor engineers and lead design reviews for observability systems
Act as a force multiplier across SRE, Infra, and ML platform teams
Promote Observability-by-Design in all new GPU cluster deployments
Drive adoption and contribution to:
Grafana stack (Alloy, Mimir, Loki, Tempo)
OpenTelemetry ecosystem
Define build vs. buy decisions (Datadog vs OSS vs hybrid approaches)
Optimize interoperability between Vector and OTEL pipelines
Architect secure telemetry pipelines with:
Encryption in transit and at rest

Benefits

Health insuranceVision insuranceRemote work options

Additional Information

About Coupang We exist to wow our customers. We know we're doing the right thing when we hear our customers say, "How did we ever live without Coupang?" Born out of an obsession to make shopping, eating, and living easier than ever, we're collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce. We are proud to have the best of both worlds - a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what's possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world. Role Overview We are seeking a Sr. Staff Observability Engineer to lead the design and evolution of our observability platform for a GPU-as-a-Service (GPUaaS) infrastructure. This role will own the end-to-end telemetry strategy-from high-throughput metric ingestion to log pipelines and real-time visualization-powering deep insights into GPU clusters, datacenter systems, and distributed workloads. You will architect and operate planet-scale telemetry pipelines leveraging Grafana Alloy, Mimir, Loki, and Vector , ensuring high-fidelity observability across GPU workloads, Kubernetes clusters, and datacenter infrastructure.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at coupang? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect