Industrial AI Cloud - Platform Engineer (REF5501C)
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Responsibilities
- Operate & Evolve Kubernetes Platform: Build, configure, and maintain bare metal hosts and Kubernetes clusters to run GPU/AI workloads.
- Design & Operate NVIDIA AI related software stack (Slurm, Run AI)
- Provide customized application support for AI related workloads
- Container Orchestration & Automation: Manage Helm charts, GitOps workflows, Ansible scripts, possibly Terraform code and automation for deploying services and AI workloads.
- Operate Kubernetes Workloads: Act as primary contact for all Kubernetes-related topics, including troubleshooting, performance tuning, and scaling.
- CI/CD & GitOps: Develop and maintain CI/CD pipelines with Jenkins and GitLab; implement GitOps practices for consistent deployments and infrastructure changes. Terraform basics.
- Monitoring & Observability: Operate and enhance Prometheus and Grafana monitoring stacks for bare metal hosts, Kubernetes and platform services.
- Container Images & Registries: Build, optimize and secure container images (Docker, Podman); manage registries and versioning, image scanning (Trivy).
- Object Storage & Persistent Volumes: Integrate and maintain object storage solutions for AI workloads.
- Run AI & HPC Workloads: Support and operate distributed AI workloads within bare metal hosts and Kubernetes environments.
- Collaboration with Infrastructure & AI Teams: Coordinate closely with Infrastructure Engineers, data center staff and AI developers to ensure smooth delivery of services.
- ITIL Processes: Follow incident, problem, and change management workflows; create and maintain operational runbooks. Adhere to ZERO outage guidelines.
Benefits
Additional Information
General description/ Purpose NVIDIA and Deutsche Telekom are jointly developing the world's first industrial AI cloud for European manufacturers. This AI factory in Germany will host 10,000 GPUs across NVIDIA DGX B200 systems and RTX Pro Servers. Deutsche Telekom provides secure, sovereign and fast infrastructure, including data centers, operations, security, and AI solutions. Role Overview We are seeking a Platform Engineer to build, automate, and operate the platform services of the Industrial AI Cloud. This role focuses on running and evolving large-scale Kubernetes clusters, container orchestration platforms, CI/CD pipelines, GitOps and associated automation to support AI workloads. Experience with Infrastructure as Code. You'll be part of the team supporting Kubernetes workloads, ensuring smooth operations and continuous improvement of the platform layer, while collaborating with infrastructure, security, and AI teams.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Deutschetelekomitsolutions? Share your experience