Network Operations Center (NOC) Analyst
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems-designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.
Responsibilities
- Monitor data center systems using telemetry data, dashboards, and alerting tools to detect anomalies and emerging issues
- Perform independent technical diagnosis across Linux systems, network connectivity, and hardware health using command-line tools, logs, and diagnostic utilities
- Troubleshoot network-layer issues including connectivity, routing, and interface errors
- Triage and escalate incidents to the appropriate teams (hardware, network, SRE) with technically accurate summaries, relevant logs, and telemetry findings
- Create and maintain detailed tickets documenting diagnostic steps, technical findings, and observed system behavior
- Identify recurring alert patterns through telemetry analysis and surface findings to improve monitoring coverage and reliability
Requirements
- This role offers a clear pathway toward positions in network engineering, site reliability, or data center operations, and the opportunity work with next-generation AI hardware and some of the most advanced compute infrastructure deployed today.
- This role is based onsite at one of our data center facilities in Lisle, IL; Fort Worth, TX; or Quincy, WA. Shift flexibility is required to support our 24/7 operations environment. We are not able to provide visa sponsorship for this position at this time.
- Required Qualifications
- Hands-on Linux experience including command-line proficiency and system log analysis
- Practical understanding of networking concepts: TCP/IP, DNS, routing, and diagnostic tools (ping, traceroute, netstat, tcpdump)
- Ability to independently diagnose technical issues and exercise sound judgment in ambiguous situations
- Clear, precise communication skills with strong technical documentation ability
- Availability to work overnight and rotating shifts in a 24/7 environment
- Ideal Experience
- Experience with Grafana, Datadog, or Prometheus
- Familiarity with HPC or GPU-based infrastructure
- Scripting experience in Bash or Python
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at lightningai? Share your experience