Senior Software Engineer - Together Cloud Platform
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
Together AI is building the AI Acceleration Cloud, an end-to-end platform for the full generative AI lifecycle, combining the fastest LLM inference engine with state-of-the-art AI cloud infrastructure. As a Senior Backend Engineer, you will play a key role in building the next generation AI cloud platform - a highly available, global, blazing-fast cloud infrastructure that virtualizes cutting-edge ML hardware (GB200s/GB300s, BlueField DPUs) and enables state-of-the-art ML practitioners with self-serve AI cloud services, such as on-demand + managed Kubernetes and Slurm clusters. This platform serves both our internal StaaS products (inference, fine-tuning) and our external cloud customers, spanning dozens of data centers across the world. Some of what you'll work on: Work on a distributed GPU scheduling system for the on-demand clusters product, Instant Clusters. Build out a global management plane for managing our data center compute, networking, and storage. Design and build new customer-facing cloud platform services, delivering killer enterprise AI cloud features.
Responsibilities
- Identify, design, and develop foundational backend services that power Together's cloud platform
- Analyze and improve the robustness and scalability of existing distributed systems, APIs, databases, and infrastructure
- Partner with product teams to understand functional requirements and deliver solutions that meet business needs
- Write clear, well-tested, and maintainable software and IaC for both new and existing systems
- Conduct design and code reviews, create developer documentation, and develop testing strategies for robustness and fault tolerance
- Participate in an on-call rotation to address critical incidents when necessary
Requirements
- 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems and API microservices
- Experience designing, analyzing and improving efficiency, scalability, and stability of various system resources
- Excellent communication skills - able to write clear design docs and work effectively with both technical and non-technical team members
- Demonstrated experience with building and operating high-performance and/or globally distributed microservice architectures across one or more cloud providers (AWS, Azure, GCP)
- Strong systems knowledge across compute, networking, and storage, including concurrency, memory management, performant I/O, and scale
- Experience developing against and managing a relational database, such as PostgreSQL
- Expert-level programmer in one or more of programming language (Golang preferred)
- Proficiency in version control practices and integrating IaC with CI/CD pipelines.
- Experience with Kubernetes and containers preferred
- Experience building and operating data infrastructure (Kinesis, Airflow, Kafka, etc) a plus
- Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
- About Together AI
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Together AI? Share your experience