Staff HPC Systems Software Engineer

External

Nscaleoperationsukltd · US

Full-timeOn-site3d ago

KubernetesObservabilityPythonSystem Design

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We're hiring a Staff HPC Systems Software Engineer to define the technical direction and evolution of a core HPC platform domain at Nscale. In this role, you will operate beyond a single team, shaping how multiple teams build, automate, and run Slurm-based capabilities within Nscale's wider cloud-native platform . You'll work across engineering boundaries to bring coherence to architecture, interfaces, lifecycle models, and operational approaches, while partnering closely with teams working on platform tooling, infrastructure APIs, identity systems, and Kubernetes-adjacent systems. This is a high-impact staff-level role for someone who combines deep hands-on software engineering with strong systems judgement. Your work will help ensure Nscale's HPC services are robust, supportable, and maintainable, while creating leverage through shared patterns, reusable implementations, and clear technical direction across ambiguous, business-critical problem spaces.

Responsibilities

Domain Architecture & Technical Direction
Own and evolve the technical direction for a defined HPC systems domain, such as Slurm platform architecture, scheduler integrations, cluster lifecycle, workload environments, or service automation.
Make architectural decisions that balance software quality, operational realities, customer needs, and long-term maintainability.
Define how proven Slurm implementations should be packaged, automated, and exposed as a service.
Resolve ambiguity around ownership, interfaces, lifecycle boundaries, and operating models across teams.
Act as the technical escalation point for the most complex issues within the domain.
Cross-Team Engineering Leverage
Establish shared patterns and standards for automation, service lifecycle management, observability, reliability, and supportability across the HPC platform.
Drive cross-team design for integrations between Slurm, Kubernetes-adjacent systems, infrastructure APIs, identity systems, and platform tooling.
Create reusable modules, automation, deployment patterns, and reference implementations that increase engineering leverage.
Identify and correct avoidable technical divergence, duplicated effort, and fragile operating models.
Ensure domain designs reflect the realities of GPU scheduling, HPC networking, performance isolation, and production operations.
Delivery, Reliability & Influence
Lead technically critical initiatives spanning 2-4 teams or a defined HPC platform area.
Unblock delivery by clarifying technical direction and reducing ambiguity in complex system design problems.
Contribute hands-on where needed to de-risk or accelerate critical work.
Influence engineering teams without formal authority through strong judgement, design clarity, and practical solutions.
Partner with adjacent cloud-native software engineers so HPC implementations build on shared platform patterns rather than separate ones.
KPIs
Technical direction across a defined HPC domain
Delivery of critical initiatives across 2-4 teams
Reduction in technical divergence and duplicated effort
Reliability and supportability of Slurm-based HPC services
About You
Extensive experience designing and building production software and automation for HPC systems, especially Slurm-based environments .
Strong track record of writing maintainable, testable, and resilient software in Go, Python, or similar languages .
Proven ability to define technical direction across a domain spanning multiple teams or services.
Strong understanding of Slurm internals, scheduler behaviour, cluster lifecycle concerns, and operational trade-offs .
Strong practical understanding of GPU-backed infrastructure and HPC networking , including InfiniBand, RoCE, RDMA , and performance-sensitive workload characteristics.
Experience integrating HPC systems with cloud-native platforms, APIs, or service delivery models .
Experience creating engineering leverage through standards, reusable patterns, shared tooling, and architectural clarity .
Strong judgement in balancing short-term delivery with long-term platform health and supportability .
Strong written and

Benefits

Health insurance

Additional Information

About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at nscaleoperationsukltd? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect