Skip to main content
Back to jobs

LLM Pre-training & Distributed Engineer (AI Infrastructure)

External
hyphenconnect logoHyphenconnect · Boston
Full-timeOn-site1mo ago
KubernetesMachine LearningPythonPyTorch
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.
  • Required Skills:
  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

Additional Information

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at hyphenconnect? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect