Machine Learning Engineer - Distributed ML Systems

External

Pluralis-research · San Francisco

Full-timeRemote2mo ago

gRPCPythonRouting

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Pluralis Research carries out foundational research on Protocol Learning : multi-participant training of foundation models where no single participant has, or can ever obtain, a full copy of the model. The purpose of Protocol Learning is to facilitate the creation of community-trained and community-owned frontier models with self-sustaining economics. We're looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large-scale training. You'll be implementing a novel substrate for training distributed ML models that work under consumer grade internet connection.

Responsibilities

Distributed Training Architecture & Optimization
Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.
Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead.
Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.
Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.
Decentralized Networking & Resilience
Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.
Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes.
Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.
Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

Requirements

Strong experience building and operating distributed systems in production.
Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).
Deep understanding of model parallelism (data, tensor, pipeline parallelism).
Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture).
Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.
Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.

Benefits

Equity-heavy compensation with meaningful ownership in a mission-driven companyCompetitive base salary for senior engineering roles in AustraliaVisa sponsorship available for exceptional candidatesRemote-first with optional access to our Melbourne hubWorld-class team - team mates were previously at at Google, Amazon, Microsoft, and leading startupsRemote work optionsEquity / stock options

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at pluralis-research? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect