Network Engineer, Reliability & Observability

External

Fluidstack · San Francisco, CA

$150K–$250K/yrFull-timeOn-site3w ago

AgileBGPIncident ResponseLeadershipMachine LearningObservability

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

Fluidstack is seeking a Network Engineer, Reliability & Observability to serve as a reliability engineer championing and building process, data collections, and reliability metrics with the objective of improving the quality and reliability of AI networks from deployment through the full lifecycle of operations. This role is focused on developing processes, systems, tools, data and data pipelines, and observability to improve the quality of networks and deliver automated metrics (24x7) as well as periodic reliability reports for both internal and external customers. This role is ideal for experienced network operators who are passionate about reliability and have experience designing and building full lifecycle software such as Quality Assurance audits, circuit audits, periodic audits, failure rates and failure analysis. You are passionate about hardware (electronics and optics), software development, and you value and promote the use of data to make informed decisions in deployment, operations, and strategic sourcing. Experienced SRE (Site Reliability Engineers) with a passion for networking are encouraged to apply. Focus Ownership of Quality Assurance: Design, develop, and support QA process for network hardware and networks. Pipelines: Develop and deploy serverless workflows, server based, and manually triggered data pipelines producing network quality and reliability observability for internal and external customers. Deployment and Operations Support : Support full lifecycle data collection and analysis partnering with Deployment, Operations, DC hardware, and logistics teams to produce data that drives process improvements and delivers on SLA and SLOs. Process Engineering: Develop, pilot, and deploy process improvements for deployment and repair to produce data and consume data with Machine Learning to fulfill our mission. Cross-Team Collaboration: Own without ego and execute in a collaborative team with design, deployment, operations engineers and software developers. Subject Matter Expert: In at least two or more deep subjects such as IP routing, optics, optical transport, Ethernet, RDMA/RoCE, or electrical power. About You Strong Operations Background: 5+ years in network engineering and at least 3+ years in operations with significant hands-on operational experience. You've run production networks or compute, responded to incidents at all hours, and debugged complex failures under pressure. You understand the difference between "working" and "production-ready". Software Development: You have experience with ITIL, Agile (xP), and TDD including developing and leading programs and projects. You have experience building hyperscale platforms, demonstrating a fluency in Golang with supporting tools in Python or RUST. Datacenter Fabric Expertise: Deep experience operating modern datacenter networks including EVPN/VXLAN, BGP, CLOS topologies, and high-radix switching. You're comfortable troubleshooting Layer 2/3 issues, BGP routing problems, fabric misconfigurations, and physical media failures.. Incident Response Excellence: Proven ability to lead incident response, perform systematic troubleshooting, and drive issues to resolution. You remain calm during outages, communicate clearly with stakeholders, and know when to escalate versus when to dig deeper. You've been the person others call when things break. Matrix Leadership Experience: You understand how to build relationships with onsite teams, coordinate physical infrastructure work, and represent network engineering in a field environment. You know how to get things done in operational settings with many internal and external teams and stakeholders. Operational Pragmatism: You balance perfection with progress. You can troubleshoot with imperfect information, make pragmatic decisions under time pressure, and prioritize based on business impact. You document as you go and continuously improve operational processes. Self Driven: You embrace complex challenges with undefined proces

Additional Information

About Fluidstack We exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it. We're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI. We hire people who care deeply about this problem space. If that is you, please apply!

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at fluidstack? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect