Principal Solutions Architect (Req#1048)

External

Eplusinc · San Ramon, CA

Full-timeOn-site1mo ago

Machine LearningMicroservicesReinforcement Learning

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We are seeking an elite Solutions Architect to lead the end-to-end design, sizing, and deployment of NVIDIA AI Factory-aligned infrastructure. In this highly technical, customer-facing role you will translate complex AI and machine learning workload requirements into fully engineered infrastructure solutions spanning colocation facilities, GPU compute, high-performance networking, parallel storage, and the complete NVIDIA AI software stack. You will serve as a trusted technical advisor to enterprise and hyperscale customers, partnering with sales, product, and engineering teams to win and deliver transformational AI infrastructure programs. Your expertise will directly shape how organizations build and operate production AI Factories capable of training frontier models, running large-scale inference fleets, and accelerating data science pipelines at scale.

Responsibilities

Solution Design & Architecture
Lead discovery workshops to capture AI/ML workload requirements, including model training scale, inference SLAs, data pipeline throughput, and multi-tenancy needs.
Architect full-stack AI Factory solutions aligned to NVIDIA reference architectures, integrating colocation, GPU compute, networking, storage, and software layers.
Develop detailed Bills of Materials (BOMs), rack elevation diagrams, network topology drawings, and power/cooling budgets for customer proposals.
Define GPU cluster architectures using NVIDIA DGX, HGX, and MGX systems with B200, B300, and GB300 Blackwell SXM and NVLink-Switch configurations.
Design RTX PRO 6000 Blackwell Server Edition deployments for inference-optimized and enterprise AI workloads.
Conduct workload sizing and TCO/ROI modeling to validate infrastructure dimensioning for training, finetuning, and inference at scale.
Colocation & Facility Planning
Specify colocation requirements including critical power load (MW-scale), UPS and generator configurations, and PUE targets.
Design high-density GPU deployments utilizing air-cooled, direct liquid cooling (DLC), and rear-door heat exchanger configurations.
Define meet-me room (MMR) and cross-connect requirements; specify carrier-neutral telecom diversity strategies.
Engage colocation providers and data center operators to validate capacity availability and negotiate technical SLAs.
Coordinate with facilities and MEP engineers to validate power infrastructure from utility feed through PDU to rack level.
GPU Compute Infrastructure
Architect multi-node GPU clusters optimized for large language model (LLM) pre-training, fine-tuning, and reinforcement learning from human feedback (RLHF).
Size and configure DGX SuperPOD, HGX H/B-series, and MGX modular systems based on model parameter count, dataset size, and iteration timelines.
Define server firmware, BIOS, BMC, and DGXOS baselines for production GPU infrastructure.
Establish GPU health monitoring, RAS (Reliability, Availability, Serviceability) policies, and lifecycle management procedures.
High-Performance Networking
Design backend GPU fabric networks using NVIDIA Quantum InfiniBand (NDR 400Gb/s and HDR 200Gb/s) for distributed training traffic.
Architect Spectrum-X Ethernet-based AI networking solutions for inference clusters requiring highbandwidth, low-latency connectivity.
Specify ConnectX-8/7 HCA deployments and configure RDMA over Converged Ethernet (RoCEv2) or InfiniBand transport for NCCL collective operations.
Integrate BlueField-3 DPUs for GPU-accelerated network functions, storage offload, zero-trust security isolation, and bare-metal provisioning.
Design leaf-spine and fat-tree topologies for non-blocking bisectional bandwidth in GPU training clusters.
Define Quality of Service (QoS) policies separating storage, compute fabric, and management plane traffic.
Parallel Storage Architecture
Design high-performance parallel file system solutions using VAST Data, Hammerspace, and Pure Storage FlashBlade//E for AI training and checkpoint storage.
Size storage capacity, IOPS, and throughput based on dataset characteristics, checkpoint frequency, and concurrent reader/writer counts.
Architect multi-tier storage hierarchies: hot NVMe flash (VAST/FlashBlade) for active datasets, warm object storage for model archives, and cold tape/cloud for long-term retention.
Configure VAST Data Universal Storage for disaggregated storage with NFS, S3, and POSIX access; tune for large sequential read performance.
Deploy Hammerspace Global Data Environment for distributed data management and NFS-over-RDMA acceleration across geographically dispersed GPU clusters.
Define data pipeline architectures ingesting from cloud object stores (S3, GCS, ABS) to local flash for GPUlocal data loading without I/O bottlenecks.
AI Software Stack & Orchestration
Deploy and configure NVIDIA AI Enterprise (NVAIE) software stack including NVIDIA GPU Operator, NIM microservices, and RAPIDS accelerated data science libraries.
Architect inference serving infrastructure using N

Benefits

Health insuranceVision insurance

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at eplusinc? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect