Senior Cloud Native Platform Engineer

External

Nscaleoperationsukltd · US

Full-timeOn-site1d ago

BashCI/CDDNSIncident ResponseKubernetesLess

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We're hiring a Senior Cloud Native Platform Engineer to build, operate, and improve the cloud-native platform foundations that support AI applications and services at scale. In this hands-on platform engineering role, you'll work on shared Kubernetes-based platforms , deployment patterns, observability foundations, infrastructure automation, and operational tooling that help internal teams run services safely and efficiently on GPU-backed infrastructure . You'll partner closely with software engineering, infrastructure, and SRE teams to ensure platform capabilities meet real developer and operational needs. This role is important to the reliability, scalability, and usability of Nscale's platform. You'll take ownership of significant platform components, deliver complex technical work independently, and raise the quality of operations and engineering through practical improvements, sound technical judgement, and mentoring.

Responsibilities

Platform Operations & Engineering
Build and improve shared cloud-native platform capabilities used by internal engineering teams to run AI applications and services.
Own significant parts of the platform area, including Kubernetes cluster operations, workload runtime configuration, deployment workflows, observability foundations, or environment automation.
Improve the reliability, scalability, and supportability of platform services through practical engineering and operational enhancements.
Develop automation, tooling, and configuration that reduce manual effort, improve consistency, and make the platform easier to use and operate.
Apply software engineering where it creates leverage, including scripts, services, CI/CD automation, operational tooling, and platform integrations.
Reliability, Operability & Automation
Improve incident prevention, detection, response, and recovery across the platform areas you support.
Build and refine observability for platform services, including metrics, logs, tracing, dashboards, alerts, and other useful operational signals.
Strengthen rollout safety, capacity awareness, failure handling, and recovery procedures for production environments.
Debug and resolve complex issues spanning Kubernetes, Linux, networking, storage, workload runtime behaviour, and cloud or datacentre infrastructure dependencies.
Enhance operational playbooks, runbooks, and engineering practices to reduce toil and increase service resilience.
Team Technical Contribution
Contribute to design discussions, code reviews, and operational standards within the platform engineering team.
Collaborate with software engineering, infrastructure, and SRE teams to deliver platform capabilities that are practical, supportable, and aligned to operational needs.
Define sensible defaults, paved roads, and supportable patterns for service deployment and runtime operations.
Mentor less experienced engineers in platform engineering fundamentals, operational judgement, and good automation practices.
KPIs
Platform reliability and service resilience
Reduction in manual operational toil
Incident detection, response, and recovery effectiveness
Observability and operational readiness of platform services
About You
Strong hands-on experience operating and improving Kubernetes-based platforms in production.
Solid experience with infrastructure automation, CI/CD, configuration management, or GitOps-style workflows.
Strong understanding of reliability engineering principles, including observability, incident response, failure analysis, and operational readiness.
Experience writing production-quality automation, tooling, or backend code in Go, Python, Bash, or similar languages.
Good Linux fundamentals , including processes, filesystems, cgroups, service behaviour, and system debugging.
Good networking fundamentals , including TCP/IP, DNS, routing, load balancing, and container or overlay networking concepts.
Experience debugging complex production issues across multiple system layers.
Ability to work independently on substantial technical problems while collaborating effectively with adjacent teams.
Experience mentorin

Additional Information

About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at nscaleoperationsukltd? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect