Skip to main content
Back to jobs

AI Engineer (ML Systems & Infrastructure)

External
SWAPETECH PTE. LTD. logoSwapetech · Singapore
S$192K–S$300K/yrFull-timeUnknownToday
AnsibleCI/CDGrafanaKubernetesLinuxMachine Learning
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

We are looking for exceptional AI Engineers to build the next generation of AI infrastructure and Machine Learning Systems(MLSys). This role focuses on large-scale system infrastructure rather than model research. You will work on the core foundations that power large-scale AI training and inference systems, including Kubernetes cluster management, RDMA networking, unified KV Cache architecture, observability platforms, distributed systems, GPU orchestration, and CUDA kernel optimisation. You will collaborate closely with AI researchers, infrastructure architects, networking engineers, and platform teams to maximize the efficiency, scalability, and reliability of AI systems.

Responsibilities

  • AI Infrastructure & Kubernetes
  • Design, deploy, and operate large-scale Kubernetes-based AI infrastructure.
  • Develop cluster governance frameworks, scheduling policies, resource isolation, and multi-tenancy capabilities.
  • Build and optimize GPU orchestration platforms using Kubernetes, Slurm, Volcano, Kueue, Ray, and related technologies.
  • Improve cluster utilization, reliability, elasticity, and operational efficiency.
  • RDMA & High-Performance Networking
  • Design and optimize RDMA, InfiniBand, RoCE, and high-speed Ethernet fabrics for distributed AI workloads.
  • Optimize GPU-to-GPU and GPU-to-NIC communication paths.
  • Improve distributed communication efficiency for large-scale training and inference.
  • Analyze and eliminate networking bottlenecks across AI clusters.
  • Unified KV Cache & Distributed Memory Systems
  • Design and implement unified KV Cache architecture across:
  • GPU HBM
  • CPU Memory
  • RDMA-accessible Memory
  • NVMe SSD
  • Distributed Storage
  • Develop efficient KV Cache sharing, migration, offloading, and scheduling mechanisms.
  • Optimize latency and throughput for large-scale inference systems.
  • CUDA & System Performance Optimisation
  • Develop and optimize CUDA kernels for training and inference workloads.
  • Profile and optimize GPU compute, memory, communication, and scheduling efficiency.
  • Contribute to low-level optimization of AI frameworks and inference engines.
  • Work on technologies such as FlashAttention, TensorRT, Triton, NCCL, CUTLASS, and custom operators.
  • Observability & Reliability
  • Build end-to-end observability platforms for AI infrastructure.
  • Design monitoring, logging, tracing, alerting, and troubleshooting frameworks.
  • Develop performance dashboards and SLO-driven operational systems.
  • Improve maintainability, debuggability, and operational excellence of AI platforms.
  • Automation & Platform Engineering
  • Build automation tools for deployment, provisioning, monitoring, and operations.
  • Develop Infrastructure-as-Code (IaC) solutions using Terraform, Ansible, and related tools.
  • Build CI/CD pipelines and engineering productivity platforms.
  • Improve platform scalability and operational efficiency.
  • Required Qualifications
  • Education
  • Bachelor's degree or above in Computer Science, Software Engineering, Electrical Engineering, or related fields.
  • Technical Skills
  • Strong software engineering and programming skills.
  • Excellent system design capability and strong engineering craftsmanship.
  • Strong coding standards and code quality awareness.
  • Strong sense of ownership, accountability, and execution.
  • System Fundamentals
  • Strong understanding of:
  • Operating Systems
  • Computer Networks
  • Distributed Systems
  • Data Structures and Algorithms
  • Linux Internals
  • Programming Languages
  • Proficiency in one or more of:
  • C++
  • Go
  • Python
  • Rust
  • AI Infrastructure Experience
  • Hands-on experience in one or more of:
  • Kubernetes
  • GPU Infrastructure
  • Distributed Systems
  • AI Infrastructure
  • HPC (High Performance Computing)
  • Cloud-Native Platforms
  • Networking Experience
  • Experience with:
  • RDMA
  • InfiniBand
  • RoCE/RoCEv2
  • GPUDirect
  • NCCL
  • UCX
  • High-Speed Ethernet
  • GPU & Performance Engineering
  • Experience with:
  • CUDA
  • GPU Performance Optimization
  • Multi-GPU Systems
  • Distributed Training
  • Distributed Inference

Requirements

  • Experience building large-scale AI training or inference clusters.
  • Experience with vLLM, SGLang, TensorRT-LLM, Triton, DeepSpeed, Megatron-LM, Ray, or similar frameworks.
  • Experience with unified KV Cache systems, memory hierarchy optimisation, or distributed storage systems.
  • Experience with Kubernetes GPU Operator and NVIDIA NetworkOperator.
  • Experience with Prometheus, Grafana, Loki, OpenTelemetry, and observability platforms.
  • Experience contributing to open-source projects such as: vLLM, FlashAttention, CUTLASS, TVM, MLIR, Triton, Kubernetes, NCCL
  • Experience working across AI Infrastructure, HPC, Networking, and Silicon Systems is highly desirable.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at SWAPETECH PTE. LTD.? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect