Network Architect & Operations Lead

External

Gruve · Pune, India

Full-timeOn-site1mo ago

AWSAzureBGPCapacity PlanningIncident ResponseLeadership

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Operate large-scale AI inferencing data center network environments across multiple sites.
Lead and manage a tiered NOC team (L1, L2, L3), driving operational discipline, escalation processes, and continuous uptime of inferencing infrastructure.
Own uptime and SLA commitments for GPU inferencing data centres, ensuring high availability and rapid incident response.
Design and maintain scalable network topologies supporting high-throughput GPU cluster traffic, east-west data flows, API traffic, VPN connectivity, and GPUaaS tenant workloads.
Lead technical design decisions for leaf-spine data center architectures using Arista switching platforms.
Configure, manage, and troubleshoot advanced routing protocols with a strong focus on BGP, EVPN, VXLAN, and large-scale traffic engineering.
Manage and optimize Cisco firewall platforms (Firepower/FTD) to ensure secure and efficient traffic flows across tenant and infrastructure networks.
Support GPU cluster networking including high-bandwidth, low-latency east-west traffic between GPU nodes, with familiarity in RDMA over Converged Ethernet (RoCE) or similar low-latency fabrics.
Drive operational excellence through structured NOC methodologies, runbook standardization, incident management, and continuous optimization.
Improve network efficiency through automation, tooling, and operational workflows.
Perform deep troubleshooting and debugging of complex network and performance issues across Arista, Cisco, and Dell infrastructure.
Lead network capacity planning, upgrades, and infrastructure scaling initiatives in line with GPU compute growth.
Collaborate with cross-functional engineering teams (compute, storage, platform, security) to support business growth and ensure high availability.
Document architecture standards, operational procedures, runbooks, and best practices.

Requirements

10+ years of hands-on experience in network architecture or network operations within large-scale enterprise, cloud, or AI/HPC data center environments.
Strong expertise in advanced routing & switching technologies, including BGP, OSPF, EVPN, and VXLAN.
Deep operational understanding of multi-site, high-scale data center network infrastructure.
Hands-on experience with Arista EOS and Arista switching platforms in a data center environment.
Proven experience managing complex network topologies and distributed environments.
Experience leading and managing NOC teams (L1/L2/L3), including escalation frameworks, shift management, and SLA ownership.
Familiarity with high-performance compute (HPC) or GPU cluster networking and associated traffic patterns.
Strong troubleshooting, debugging, and analytical skills across multi-vendor environments.
Practical experience with network automation and operational optimization.
CCIE certification (or equivalent real-world expertise).
Arista ACE (Arista Certified Engineer) certification or equivalent hands-on Arista expertise.
Experience with GPU-as-a-Service (GPUaaS), AI/ML inferencing platforms, or hyperscale compute environments.
Hands-on experience with Cisco Firepower / FTD firewall platforms and enterprise security frameworks.
Experience with Dell PowerEdge

Additional Information

About Gruve Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks. Position Summary: We are looking for a highly experienced Network Architect & Operations Lead to drive the design, operation, and optimization of large-scale, AI inferencing data center environments. The ideal candidate will have a strong background in architecting and running complex distributed infrastructures supporting GPU-as-a-Service (GPUaaS) workloads, high-throughput AI inferencing traffic, GPU cluster networking, and globally distributed network topologies. This role is central to ensuring the uptime, reliability, and scalability of our inferencing data centers. This role requires a senior professional with CCIE-level expertise (or equivalent capability) who can combine architecture, operations, team leadership, and automation skills to ensure network reliability, scalability, and operational excellence across a 24/7 GPU inferencing environment. The ideal candidate will lead and mentor a tiered team of L1, L2, and L3 network engineers, owning end-to-end network operations and SLA commitments for our AI inferencing data centers. Experience in high-scale environments such as Lambda Labs, Nvidia, Equinix, AWS, Azure, Google Cloud, or similar large AI infrastructure or hyperscale organizations is highly preferred.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at gruve? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect