Skip to main content
Back to jobs

Senior Solutions Architect, AI Factory Observability and Visualization - NVIS

External
NVIDIA logoNvidia · Worldwide
Full-timeRemoteToday
PythonExpress
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


Responsibilities

  • Run AI factory validation tools, microbenchmarks, and workloads provided by the team, and interpret results to assess system health and performance.
  • Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
  • Establish what "healthy" represents across the stack - the metrics, logs, and signals that confirm a system is functioning well, and the thresholds that show it isn't.
  • Build and extend the telemetry surface across hardware, fabric, and workload, crafting how data is collected, transformed, stored, and surfaced.
  • Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
  • Develop automation (Python, Shell) for collecting, transforming, and presenting system and network data.
  • Recommend improvements to system visibility, data sources, and reporting that give teams clearer insight.
  • Collaborate with hardware, software, networking, datacenter, and product groups to ready HPC systems and AI factories for customer deployment, contributing documentation and readiness materials throughout the process.
  • What We Need to See:
  • Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
  • 6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
  • Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
  • Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
  • Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
  • Practical experience working with observability systems (e.g., Prometheus, Grafana, Loki, or similar), including building custom exporters or collectors, setting up alerts, and handling metric cardinality and retention on a large scale.
  • Experience transforming metrics, logs, and traces into clear, actionable insight for complex distributed environments.
  • Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, InfiniBand/Ethernet fabric counters) and using it to diagnose performance regressions.
  • Strong communication skills and the ability to work effectively with cross-functional teams.
  • Ways to Stand Out From the Crowd:
  • Experience with AI factory or large-scale AI infrastructure build, deployment, or operations.
  • Background in HPC systems engineering, SRE, or systems analysis for GPU-accelerated environments.
  • Experience building automation and data pipelines that feed dashboards and reporting at scale.
  • Demonstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions.
  • Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits .
  • Applications for this job will be accepted at least until June 28, 2026. This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.

Additional Information

NVIDIA's Infrastructure Specialists team is hiring a Senior Solutions Architect - AI Factory Observability & Visualization! This remote role develops full-spectrum visibility that supports the smooth functioning of HPC systems and AI factories, transforming intricate telemetry across network and compute into straightforward, actionable perspectives. The role has a complete, end-to-end understanding of the HPC/AI system, running and interpreting microbenchmarks and workloads to confirm system readiness, then establishing the observability that maintains this state. The work involves collaborating across NVIDIA teams to help partners see, understand, and respond to HPC system and AI factory performance, from hardware to workload.


Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at NVIDIA? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect