Platform Support Engineer (APAC)

External

Lightningai · Worldwide

Full-timeRemote3w ago

DocumentationGrafanaKubernetesLessLinuxMachine Learning

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems-designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

Responsibilities

Work Directly With ML Engineers
Partner directly with customer engineering teams running training and inference workloads in production
Help customers diagnose and resolve complex distributed systems and ML infrastructure issues
Act as a technical advisor during high impact incidents and platform degradation events
Translate infrastructure level issues into actionable guidance for ML engineers
Build credibility with customers through strong technical reasoning and clear communication
Debug ML Infrastructure & Distributed Workloads
Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems
Troubleshoot PyTorch, CUDA, NCCL, and inference serving related issues
Analyze logs, metrics, traces, and system behavior to isolate root causes
Debug containerized workloads running across Kubernetes and bare metal GPU environments
Support customers scaling workloads across multi node GPU systems
Diagnose performance bottlenecks involving compute, memory, networking, or storage
Improve Reliability & Platform Operations
Identify recurring patterns across customer issues and drive long term reliability improvements
Contribute to post incident reviews and operational improvements
Build internal tooling, automation, documentation, and runbooks
Partner closely with infrastructure, networking, and platform engineering teams
Help improve observability, operational visibility, and troubleshooting workflows
Improve the customer experience through better processes and technical guidance
What This Role Is Not
To set clear expectations:
This is not a traditional help desk or ticket routing support role
This is not purely customer success or account management
This is not a backend engineering role
This is not a passive escalation position
This role is for engineers who enjoy solving difficult technical problems while working closely with other engineers.

Requirements

Lightning AI is looking to hire a Platform Support Engineer to join our APAC Customer Experience team, supporting ML engineers running large-scale training and inference workloads across cloud infrastructure, Kubernetes, and GPU platforms in production environments.
This role is remote and open to candidates based in either the Philippines or Singapore. The role follows a Thursday-Sunday schedule, with working hours from 7:00 AM to 5:00 PM local time (UTC+8).
Required Qualifications
Infrastructure & Systems
Strong software engineering and systems troubleshooting background
Experience with Kubernetes and containerized environments
Linux systems knowledge, including networking, storage, process management, and performance tuning
Experience with cloud infrastructure and distributed systems
Experience with observability and debugging tools such as Prometheus, Grafana, or OpenTelemetry
ML Infrastructure Experience
Hands on experience operating machine learning workloads in production or research environments
Experience with distributed ML systems and tooling such as PyTorch, CUDA, or NCCL
Familiarity with GPU infrastructure and orchestration
Experience troubleshooting performance, reliability, or scaling issues in ML infrastructure
Understanding of the operational challenges involved in running ML systems at scale
Collaboration
Strong communication skills and ability to work directly with highly technical customers and engineering teams
Comfortable operating in fast moving, highly ambiguous environments
Enjoys solving complex technical problems collaboratively
Experience with lar

Benefits

Remote work options

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at lightningai? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect