Senior Platform Engineer - Cloud & ML Platform (m/f/d)

External

Quantum- Systems · Gilching

Full-timeOn-siteToday

AnsibleAWSAzureBashCapacity PlanningCI/CD

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

As a Platform Engineer - Cloud & ML Platform (m/f/d), you will be a key contributor to the cloud-native infrastructure that powers our AI and autonomy development at global scale. You will design, deploy, operate, and continuously improve Kubernetes-based platforms that enable our teams to train, evaluate, deploy, and monitor machine learning workloads reliably across regions, clouds, and compute environments. At Quantum Systems, we build intelligent unmanned systems that operate under real-world constraints. Our AI teams depend on scalable, secure, and high-performance infrastructure to turn data, models, and experiments into field-ready capabilities. In this role, you will help build the cloud and ML platform backbone that makes this possible. You will work closely with AI engineers, data engineers, software teams, security, IT, and product stakeholders to provide robust, automated, and developer-friendly infrastructure for large-scale ML workloads. Your work will directly support our mission to push the boundaries of autonomous systems through cutting-edge software, edge computing, and real-time AI-powered data processing. What is your Day to Day Mission: Design, deploy, operate, and continuously improve Kubernetes-based platforms for machine learning and data-intensive workloads. Build and maintain globally distributed Kubernetes clusters with a strong focus on reliability, scalability, security, and observability. Own the lifecycle management of ML platform components, including Kubeflow , Metaflow , workflow orchestration, experiment tracking, and related MLOps tooling. Enable AI and data teams to run scalable training, inference, evaluation, and data processing pipelines across heterogeneous compute environments. Develop infrastructure-as-code, automation, and GitOps workflows to ensure reproducible, auditable, and efficient platform operations. Manage GPU-enabled workloads, scheduling, storage, networking, secrets, access control, and cost-aware resource utilization. Improve platform resilience through monitoring, alerting, incident response, backup strategies, disaster recovery, and capacity planning. Collaborate with AI, software, DevOps, security, and IT teams to define platform standards, best practices, and deployment patterns. Support hybrid and multi-cloud infrastructure scenarios, including on-premise, private cloud, and public cloud environments. Evaluate and integrate cloud providers and infrastructure technologies, including Azure, AWS, Telekom Cloud, or comparable platforms. Continuously improve developer experience for ML engineers through self-service tooling, documentation, templates, and platform abstractions. Help bring AI capabilities from prototype to production by providing a reliable, scalable, and secure ML infrastructure foundation. What you bring to the team: Strong hands-on expertise with Kubernetes in production environments, including cluster operations, networking, storage, security, scaling, upgrades, and troubleshooting. Proven experience deploying and maintaining globally distributed, large-scale clusters for production or mission-critical workloads. Strong experience with Kubeflow and Metaflow in production or production-like ML platform environments. Solid understanding of MLOps workflows, including training pipelines, model lifecycle management, artifact handling, experiment tracking, reproducibility, and deployment automation. Experience operating GPU-enabled Kubernetes environments and supporting high-performance machine learning workloads. Strong infrastructure-as-code experience using tools such as Terraform, Helm, Kustomize, Argo CD, Flux, Crossplane, Ansible, or comparable technologies. Good understanding of cloud-native observability, including metrics, logs, traces, alerting, dashboards, and SLO-driven operations. Experience with containerization, CI/CD, GitOps, secrets management, identity and access control, and secure platform operations. Familiarity with cloud platforms such as Azure , AWS , Telekom Cloud , GCP, OpenStack, or comparable private/hybrid cloud environments. Strong scripting or programming skills in Python, Go, Bash, or a comparable language. Ability to analyze complex infrastructure issues, drive root-cause analysis, and implement robust long-term solutions. Structured, analytical mindset with a hands-on attitude and a strong sense of ownership. Strong communication skills and the ability to work with globally distributed engineering teams. Communication in English is a matter of course for you. Additional plus Experience building internal developer platforms or ML platform products. Experience with distributed storage systems, object storage, data lake architectures, or high-throughput data pipelines. Experience with service mesh technologies, policy engines, cluster federation, or multi-cluster management. Experience with security-sensitive, regulated, defense, rob

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Quantum- Systems GmbH? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect