Product Manager, Networking

External

Fluidstack · San Francisco, CA

$175K–$275K/yrFull-timeOn-site1w ago

BGPGrafanagRPCInfluxDBObservabilityPrometheus

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We are hiring a Product Manager to own the tools and systems our team uses to design, deploy, operate, and remediate the networks that run our GPU clusters. That means frontend Ethernet fabrics, backend Ethernet and InfiniBand interconnects, out-of-band management networks, and building management systems. The surface area is wide: BOM generators, configuration generators, digital twins, observability pipelines, and performance profiling tools all sit under this charter. This is not a role for someone who hands requirements to engineers and waits. You will be the person with the clearest opinion in the room on what needs to be built, why the current state is broken, and what the right architecture looks like. You should be fluent in the underlying technology, having worked hands-on with network gear, streaming telemetry, or large-scale fabric automation at some point in your career. The networking team will trust your judgment because you have earned it technically. The right candidate has a working mental model of how a 400G spine-leaf fabric is cabled, what gRPC-based telemetry looks like at 10,000 devices, and why config generation is harder than it sounds. You Will Own the product roadmap for all internal networking tooling: design automation, provisioning, observability, performance analysis, and incident remediation workflows across frontend, backend, OOB, and BMS networks. Drive the strategy and requirements for digital twin tooling that models physical fabric topology, enabling engineers to validate designs, simulate failures, and test config changes before touching production. Define and ship BOM generators that produce accurate, version-controlled bills of materials for frontend Ethernet, backend Ethernet, InfiniBand, and OOB networks tied directly to cluster topology specs. Own the configuration generation pipeline: translate high-level cluster designs into device-ready configs across switches, routers, and OOB management infrastructure, with correctness guarantees and rollback support. Build the observability stack requirements for network telemetry ingestion (gNMI, SNMP, streaming) into dashboards and alerting systems that give operators sub-minute visibility into fabric health and performance degradation. Define performance profiling tooling that surfaces InfiniBand and RoCEv2 congestion, all-reduce bottlenecks, and east-west bandwidth saturation at the GPU job level, not just the interface level. Work with network engineers and site operations to map the full lifecycle of a network event from detection through remediation, then build the tooling that compresses mean time to resolution. Partner with infrastructure and software engineering teams to integrate networking tooling into the broader cluster lifecycle: from site design through rack-and-stack, burn-in, and steady-state operations. Define the data model and schema standards that sit underneath all networking tools, ensuring BOM data, topology data, telemetry data, and config state are coherent and queryable across systems. Conduct working sessions with network engineers, site leads, and operations staff to identify the highest-friction workflows, then prioritize ruthlessly based on operational impact.

Requirements

5+ years of product management experience with at least 3 years focused on infrastructure, networking, or platform tooling.
Direct working knowledge of data center networking technologies: spine-leaf topology, EVPN/VXLAN, BGP, 400G/800G Ethernet, and high-radix switch platforms from vendors such as Arista, Cisco Nexus, or Nvidia Spectrum.
Hands-on familiarity with high-performance interconnects: InfiniBand (HDR/NDR), RoCEv2, and the operational realities of running large-scale RDMA fabrics under AI training workloads.
Working knowledge of network telemetry protocols and frameworks: gNMI/gRPC streaming, SNMP, OpenConfig, and at least one observability stack built on top of them (Prometheus, InfluxDB, Grafana, or equivalent).
Experience shipp

Benefits

Health insuranceVision insurance

Additional Information

About Fluidstack We exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it. We're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI. We hire people who care deeply about this problem space. If that is you, please apply!

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at fluidstack? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect