AI Engineer (ML Systems & Infrastructure)

External

Swapetech · Singapore

S$192K–S$300K/yrFull-timeUnknownToday

AnsibleCI/CDGrafanaKubernetesLinuxMachine Learning

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We are looking for exceptional AI Engineers to build the next generation of AI infrastructure and Machine Learning Systems(MLSys). This role focuses on large-scale system infrastructure rather than model research. You will work on the core foundations that power large-scale AI training and inference systems, including Kubernetes cluster management, RDMA networking, unified KV Cache architecture, observability platforms, distributed systems, GPU orchestration, and CUDA kernel optimisation. You will collaborate closely with AI researchers, infrastructure architects, networking engineers, and platform teams to maximize the efficiency, scalability, and reliability of AI systems.

Responsibilities

AI Infrastructure & Kubernetes
Design, deploy, and operate large-scale Kubernetes-based AI infrastructure.
Develop cluster governance frameworks, scheduling policies, resource isolation, and multi-tenancy capabilities.
Build and optimize GPU orchestration platforms using Kubernetes, Slurm, Volcano, Kueue, Ray, and related technologies.
Improve cluster utilization, reliability, elasticity, and operational efficiency.
RDMA & High-Performance Networking
Design and optimize RDMA, InfiniBand, RoCE, and high-speed Ethernet fabrics for distributed AI workloads.
Optimize GPU-to-GPU and GPU-to-NIC communication paths.
Improve distributed communication efficiency for large-scale training and inference.
Analyze and eliminate networking bottlenecks across AI clusters.
Unified KV Cache & Distributed Memory Systems
Design and implement unified KV Cache architecture across:
GPU HBM
CPU Memory
RDMA-accessible Memory
NVMe SSD
Distributed Storage
Develop efficient KV Cache sharing, migration, offloading, and scheduling mechanisms.
Optimize latency and throughput for large-scale inference systems.
CUDA & System Performance Optimisation
Develop and optimize CUDA kernels for training and inference workloads.
Profile and optimize GPU compute, memory, communication, and scheduling efficiency.
Contribute to low-level optimization of AI frameworks and inference engines.
Work on technologies such as FlashAttention, TensorRT, Triton, NCCL, CUTLASS, and custom operators.
Observability & Reliability
Build end-to-end observability platforms for AI infrastructure.
Design monitoring, logging, tracing, alerting, and troubleshooting frameworks.
Develop performance dashboards and SLO-driven operational systems.
Improve maintainability, debuggability, and operational excellence of AI platforms.
Automation & Platform Engineering
Build automation tools for deployment, provisioning, monitoring, and operations.
Develop Infrastructure-as-Code (IaC) solutions using Terraform, Ansible, and related tools.
Build CI/CD pipelines and engineering productivity platforms.
Improve platform scalability and operational efficiency.
Required Qualifications
Education
Bachelor's degree or above in Computer Science, Software Engineering, Electrical Engineering, or related fields.
Technical Skills
Strong software engineering and programming skills.
Excellent system design capability and strong engineering craftsmanship.
Strong coding standards and code quality awareness.
Strong sense of ownership, accountability, and execution.
System Fundamentals
Strong understanding of:
Operating Systems
Computer Networks
Distributed Systems
Data Structures and Algorithms
Linux Internals
Programming Languages
Proficiency in one or more of:
C++
Go
Python
Rust
AI Infrastructure Experience
Hands-on experience in one or more of:
Kubernetes
GPU Infrastructure
Distributed Systems
AI Infrastructure
HPC (High Performance Computing)
Cloud-Native Platforms
Networking Experience
Experience with:
RDMA
InfiniBand
RoCE/RoCEv2
GPUDirect
NCCL
UCX
High-Speed Ethernet
GPU & Performance Engineering
Experience with:
CUDA
GPU Performance Optimization
Multi-GPU Systems
Distributed Training
Distributed Inference

Requirements

Experience building large-scale AI training or inference clusters.
Experience with vLLM, SGLang, TensorRT-LLM, Triton, DeepSpeed, Megatron-LM, Ray, or similar frameworks.
Experience with unified KV Cache systems, memory hierarchy optimisation, or distributed storage systems.
Experience with Kubernetes GPU Operator and NVIDIA NetworkOperator.
Experience with Prometheus, Grafana, Loki, OpenTelemetry, and observability platforms.
Experience contributing to open-source projects such as: vLLM, FlashAttention, CUTLASS, TVM, MLIR, Triton, Kubernetes, NCCL
Experience working across AI Infrastructure, HPC, Networking, and Silicon Systems is highly desirable.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at SWAPETECH PTE. LTD.? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect