Software Engineer (Multiple Levels) - Machine Learning Infrastructure, Slack
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
The AI and ML Infrastructure team is part of Slack's Core Infrastructure organization and is responsible for the foundational systems that enable machine learning and AI across the company. The team designs, builds, and operates reliable, scalable, and high performance platforms that allow product and ML teams to develop, deploy, and operate AI driven capabilities with confidence. The team owns shared infrastructure, services, and tooling that support the full ML lifecycle, including model training, deployment, inference, and monitoring. As Slack AI continues to grow, the team is evolving from traditional ML deployments toward large scale, highly distributed systems. This work involves deep architectural decisions around scalable model deployment strategies, real time feature serving at very high throughput, GPU accelerated inference at message scale, and responsible training of models on sensitive data with strong privacy and safety requirements. Core Focus Areas - ML Infrastructure - The ML Infrastructure focus area is responsible for the low level systems that power training and inference at scale. This includes architecting and maintaining distributed systems for model training, serving, and deployment using Kubernetes based platforms, GPU infrastructure, and open source ML stacks such as KubeRay and vLLM. The team delivers platform capabilities that improve the speed, reliability, and quality of ML development, including training pipelines, feature generation systems, and compute orchestration. - AI Platform - The AI Platform focus area builds the tooling and platform layers that enable AI development across Slack. This includes creating developer facing tools, SDKs, and workflows that allow product teams to integrate AI into Slack features efficiently and safely. The platform supports LLM efficiency and model transition initiatives through integrations with managed services across multiple cloud providers acting as the connective layer between core infrastructure and product engineering teams. We are looking for Software Engineers to join the ML Infrastructure focus area and help architect and operate the core systems that power AI at Slack. In this role, you will own foundational infrastructure for large scale model training and inference, and evolve it into a reliable, secure, and self service platform used across the company. You will work at the intersection of distributed systems, GPU infrastructure, and modern ML stacks, solving complex scalability and reliability challenges. This role blends deep systems engineering with a strong understanding of the ML lifecycle, and plays a critical part in shaping the long term technical foundations of Slack's AI capabilities.
Responsibilities
- Design, build, and operate systems to train, serve, and deploy machine learning models at scale, with a focus on reliability, performance, and operational simplicity
- Evolve GPU backed inference infrastructure to support high throughput, latency sensitive workloads, including large scale model serving
- Architect and optimize distributed training and data processing systems using platforms such as Ray, Airflow, Spark, or similar technologies
- Build and maintain Kubernetes based platforms and orchestration layers using tools such as KubeRay, vLLM, and internally developed services
- Architect solutions that bridge legacy systems with modern technologies while maintaining monolithic application stability
- Develop robust monitoring, observability, and alerting for production ML workloads to ensure operational excellence
- Partner closely with AI Platform, ML modeling, security, and product engineering teams to design infrastructure that supports evolving AI use cases
- Provide technical leadership through design reviews, mentorship, and by setting engineering standards and long term architectural direction for ML infrastructure
- Author technical design and architecture documentation, and contribute thought leadership through engineering blog posts
- Build and ship high-quality, production-grade software using modern engineering practices, with AI as a core part of your development workflow by pushing the boundaries of AI development tools to deliver secure, optimized, and high-quality code.
- Design and orchestrate complex systems where AI agents integrate seamlessly into human workflows, driving efficiency and innova
Requirements
- ml infrastructure - the ml infrastructure focus area is responsible for the low level systems that power training and inference at scale. this includes architecting and maintaining distributed systems for model training, serving, and deployment using kubernetes based platforms, gpu infrastructure, and open source ml stacks such as kuberay and vllm. the team delivers platform capabilities that improve the speed, reliability, and quality of ml development, including training pipelines, feature generation systems, and compute orchestration.
Additional Information
To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts. Job Category Software Engineering Job Details The software engineer role at Salesforce encompasses architecture, design, implementation, and testing to ensure we build products right and release them with high quality. Equally important is advanced prompt engineering - the ability to write precise, structured prompts and cultivate the system context that makes AI outputs reliable, secure, and production-ready.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Salesforce, Inc.? Share your experience