Skip to main content
Back to jobs

Sr. Staff Observability Engineer (GPU Cloud & Telemetry Platform)

External
coupang logoCoupang · Seoul, South Korea
Full-timeOn-site1w ago
Capacity PlanningCI/CDDatadogEncryptionEpicForecasting
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • End-to-End Observability Platform Ownership : Design and scale telemetry pipelines using:
  • Grafana Alloy for metrics collection (Prometheus-compatible pipelines)
  • Datadog Vector for high-throughput log ingestion and transformation
  • Grafana Mimir for scalable time-series storage
  • Grafana Loki for log aggregation and querying
  • Strategic Roadmap : Define the multi-year vision for GPU infrastructure observability, transitioning from reactive monitoring to SLO-driven, predictive, and automated observability .
  • High-Cardinality Telemetry Design : Optimize pipelines for GPU workloads characterized by:
  • High-cardinality labels (GPU IDs, tenants, workloads)
  • Burst-heavy workloads (ML training, inference spikes)
  • Multi-tenant isolation requirements
  • Architect low-latency, high-throughput pipelines capable of ingesting:
  • GPU metrics (utilization, memory, thermals, MIG partitions)
  • Kubernetes and container telemetry
  • Distributed system logs and traces
  • Build and optimize metric pipelines (Alloy → Mimir) ensuring:
  • Efficient remote_write tuning
  • Cost-effective retention strategies
  • Horizontal scalability and compaction tuning
  • Design log pipelines (Vector → Loki) with:
  • Structured logging and enrichment
  • Intelligent filtering/sampling
  • Stream partitioning for high-ingest environments
  • Establish deep observability into:
  • GPU hardware (NVIDIA DCGM, MIG, NVLink, PCIe)
  • Kubernetes GPU operators and scheduling behavior
  • Network fabric (RDMA, InfiniBand, TCP performance)
  • Define GPU-specific SLIs/SLOs such as:
  • GPU utilization efficiency
  • Job scheduling latency
  • Cluster fragmentation
  • Thermal and power anomalies
  • Build rich Grafana dashboards for:
  • Real-time GPU fleet health
  • Tenant-level usage and billing insights
  • Capacity planning and forecasting
  • Standardize dashboard frameworks and reusable panels across teams
  • Enable self-service observability for platform and ML engineering teams
  • Drive adoption of SRE principles :
  • SLIs, SLOs, error budgets tailored to GPU workloads
  • Integrate observability into CI/CD and IaC pipelines (Terraform/Kubernetes) :
  • Automated canary analysis
  • Observability-driven rollbacks
  • Build automation (Go/Python) for:
  • Pipeline health monitoring
  • Dynamic routing and scaling of telemetry workloads
  • Develop tooling and practices for cross-layer correlation :
  • GPU → Node → Kubernetes → Application → Network
  • Lead deep RCA efforts for:
  • GPU contention issues
  • Performance degradation in ML workloads
  • Telemetry pipeline backpressure/failures
  • Enable "needle-in-a-haystack" debugging using unified logs + metrics
  • Mentor engineers and lead design reviews for observability systems
  • Act as a force multiplier across SRE, Infra, and ML platform teams
  • Promote Observability-by-Design in all new GPU cluster deployments
  • Drive adoption and contribution to:
  • Grafana stack (Alloy, Mimir, Loki, Tempo)
  • OpenTelemetry ecosystem
  • Define build vs. buy decisions (Datadog vs OSS vs hybrid approaches)
  • Optimize interoperability between Vector and OTEL pipelines
  • Architect secure telemetry pipelines with:
  • Encryption in transit and at rest

Benefits

Health insuranceVision insuranceRemote work options

Additional Information

About Coupang We exist to wow our customers. We know we're doing the right thing when we hear our customers say, "How did we ever live without Coupang?" Born out of an obsession to make shopping, eating, and living easier than ever, we're collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce. We are proud to have the best of both worlds - a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what's possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world. Role Overview We are seeking a Sr. Staff Observability Engineer to lead the design and evolution of our observability platform for a GPU-as-a-Service (GPUaaS) infrastructure. This role will own the end-to-end telemetry strategy-from high-throughput metric ingestion to log pipelines and real-time visualization-powering deep insights into GPU clusters, datacenter systems, and distributed workloads. You will architect and operate planet-scale telemetry pipelines leveraging Grafana Alloy, Mimir, Loki, and Vector , ensuring high-fidelity observability across GPU workloads, Kubernetes clusters, and datacenter infrastructure.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at coupang? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect