Technical Account Manager (TAM), AI Factory
ExternalFull-timeOn-site1mo ago
BashComplianceGrafanaObservabilityPrometheusPython
Prepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
As a Dedicated AI Factory TAM at Together AI, you will serve as the named technical owner for one of our most strategic enterprise relationships. You will be the primary technical point of contact across all infrastructure domains - compute, networking, storage, and facilities - ensuring smooth delivery and operational health of large-scale GPU deployments. This role sits at the intersection of deep infrastructure expertise and high-stakes customer partnership, making you a critical driver of both customer success and company growth.
Responsibilities
- Serve as the named technical point of contact for a dedicated strategic customer, owning the end-to-end technical relationship across compute, networking, storage, and facilities
- Drive structured engagement through regular cadences including status reporting, technical steering meetings, and executive business reviews
- Translate customer operational feedback into actionable input for Engineering, Product, and Infrastructure roadmaps
- Lead issue lifecycle management, escalation, and RCA authorship across all infrastructure domains in partnership with Support, SRE, DC Ops, and Engineering teams
- Own end-to-end RMA coordination and hardware lifecycle management, including acceptance testing, spare inventory management, and hardware health reporting for large-scale GPU deployments
- Maintain deep technical expertise across the customer's infrastructure stack - GPU compute, high-speed fabric, and large-scale storage systems - advising on configuration, operational best practices, and incident resolution
- Own the observability strategy for the customer estate, including alert policy definition, dashboard development, and proactive health management across all infrastructure layers
- Coordinate DC operations and facilities events in partnership with internal teams and hosting providers, ensuring SLA compliance and cluster availability
- Act as project manager for all capacity expansions, owning the full node deployment lifecycle from freight receipt through production acceptance
Requirements
- 5+ years in a customer-facing technical role, with 2+ years in dedicated technical account management or solutions architecture for large-scale AI or HPC infrastructure
- Deep expertise in GPU infrastructure - GPU health diagnostics, RMA workflows, and hardware acceptance testing
- Hands-on experience with large-scale Ethernet and InfiniBand fabric architecture
- Working knowledge of enterprise storage systems, including high-density NVMe, parallel file systems, and metadata infrastructure
- Experience with DC operations, facilities coordination, and hosting provider SLA management
- Strong ownership mindset for incident management, RCA authorship, and executive-level customer communication
- Proficiency in infrastructure monitoring and observability tooling (Prometheus, Grafana, or equivalent)
- Proven ability to manage multiple concurrent workstreams with hyperscaler-level rigor and communication standards
- Proficiency in Python, Bash, or infrastructure automation tools preferred
- About Together AI
Benefits
LocationSan Francisco, CA (Hybrid) or New York, NY (Hybrid)Equal OpportunityTogether AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.Please see our Privacy Policy at https://www.together.ai/privacyHealth insuranceRemote work optionsEquity / stock options
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Together AI? Share your experience