Distributed Training Engineer

External

Periodic-labs · Menlo Park

Full-timeRemote8mo ago

Reinforcement Learning

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world's best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks. You might thrive in this role if you have experience with: Training on clusters with ≥5,000 GPUs 5D parallel LLM training Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan Optimizing training throughput for large scale Mixture-of-Expert models

Additional Information

About Periodic Labs We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at periodic-labs? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect