Experience: 5+ years in software engineering with proven experience building network platforms, SDN systems, or network automation for production environments.
Kubernetes Networking & Container Orchestration: Strong familiarity with Kubernetes networking architecture, CNI plugins, service networking, and network policies. Understanding of pod networking, services, ingress, and how Kubernetes manages network resources.
Programming Skills: Experience with Go and Python for performance-critical networking components and services is highly valued.
Linux Networking: Strong experience with Linux networking stack, including network namespaces, iptables/nftables, Open vSwitch, and kernel networking systems.
DPU & SmartNIC Experience: Familiarity with DPU/SmartNIC architectures (Bluefield, or similar), SR-IOV, hardware offload capabilities, and programmable networking hardware - or strong ability to learn quickly.
High-Performance Networking: Understanding of RDMA, RoCE, Infiniband, and low-latency networking requirements for distributed computing and GPU workloads.
Problem-Solving & Architecture: Demonstrated ability to solve complex networking performance and scalability challenges while balancing pragmatic shipping with good long-term architecture.
Autonomy
Benefits
Vision insuranceFlexible schedule
Additional Information
Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads. We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.
Your Role:
You will be foundational to building our software-defined networking (SDN) platform that enables high-performance, isolated networking for distributed computing, model training, inference, and data-intensive workloads. Working closely with our network, infrastructure, and product teams, you'll design and implement the network orchestration and provisioning systems that manage DPU-accelerated networking, tenant isolation, and network lifecycle management - enabling researchers and engineers to access enterprise-grade networking with cloud-like simplicity.
Job Responsibilities:
Software-Defined Networking Architecture: Collaborate with infrastructure to design and build scalable SDN orchestration systems leveraging NVIDIA Bluefield-3 DPUs to deliver programmable, high-performance networking for AI workloads with hardware-accelerated forwarding isolation.
Research Cluster Networking: Design and implement networking systems for research computing environments including Kubernetes and SLURM clusters, enabling high-performance connectivity, optimized network topology for distributed workloads, and seamless integration with cluster orchestration systems.
Network Provisioning & Lifecycle Management: Implement automated SDN provisioning systems that handle VPC creation, subnet allocation, routing configuration, and network resource lifecycle from deployment through decommissioning.
DPU Platform Engineering: Develop platform capabilities for managing Bluefield-3 DPUs including SR-IOV virtual function management, OVS offload configuration, network function deployment, and integration with compute orchestration systems.
Multi-Tenancy & Network Isolation: Build enterprise-grade network isolation using VPCs, VXLAN, and hardware-accelerated forwarding to ensure complete tenant separation while maintaining high-performance connectivity for GPU clusters and distributed workloads.
High-Performance Networking: Collaborate with infrastructure to optimize network paths for RDMA, RoCE, and GPU-to-GPU communication, ensuring minimal latency and maximum throughput for distributed training and large-scale computational workloads.
Network APIs & Integration: Develop robust APIs and SDKs for network resource management that integrate seamlessly with compute and storage platforms, enabling programmatic network provisioning and configuration.
Network Observability: Implement comprehensive network monitoring, telemetry, and troubleshooting systems that provide visibility into network performance, utilization, and tenant traffic patterns.Security & Policy Management: Build platform network security features including security groups, firewall rules, and policy enforcement that protect tenant workloads while enabling flexible network configuration.