AI & HPC Infrastructure Engineer

External

Firstprinciples · On, Canada

Full-timeRemote3w ago

AnsibleArgoCDAWSAzureCapacity PlanningDocumentation

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Design, deploy, and operate Kubernetes infrastructure for AI inference, research, and engineering workloads
Set up and manage GPU and HPC-style compute environments, including scheduling, utilization, job management, and node-level troubleshooting
Work with systems such as Kubernetes, Slurm or similar schedulers, container runtimes, GPU drivers & libraries (ie; CUDA), storage systems, and observability tools
Build and manage Linux-based compute environments, including provisioning, networking, storage, monitoring, access control, and lifecycle management
Help architect bare metal, cloud, and hybrid infrastructure across AWS, GCP, Azure, or equivalent platforms
Own the reliability and operational health of infrastructure systems, including monitoring, alerting, incident response, capacity planning, and performance tuning
Improve deployment workflows, automation, configuration management, secrets management, and infrastructure-as-code practices
Partner with ML engineers, researchers, and software engineers to understand workload requirements and translate them into practical infrastructure designs
Evaluate tradeoffs between managed cloud services, self-managed Kubernetes, HPC schedulers, bare metal deployments, and multi-cloud architectures
Build tooling, documentation, runbooks, and operational practices that help the team move quickly without making infrastructure fragile or opaque
Balance speed and robustness, knowing when to prototype quickly and when to harden systems for long-term use

Requirements

Strong infrastructure builder with experience operating production, research, cloud, or high-performance compute systems
Deeply comfortable with Linux administration, including debugging networking, storage, system services, permissions, performance issues, and node-level failures
Experienced with Kubernetes in real environments, including cluster operations, deployments, networking, observability, scaling, and troubleshooting
Comfortable working with cloud infrastructure on AWS, GCP, Azure, or equivalent platforms
Familiar with infrastructure automation and configuration tools such as Terraform, Ansible, Helm, ArgoCD, GitOps workflows, or similar systems
Experienced with GPU-heavy, compute-heavy, or HPC-style workloads, especially in environments involving AI, ML, research computing, or scientific workloads
Able to work across bare metal and cloud environments, and interested in the practical tradeoffs between the two
Comfortable reasoning about resource scheduling, clu

Benefits

Health insuranceVision insuranceRemote work options

Additional Information

About FirstPrinciples FirstPrinciples is a research organization building AI infrastructure for discovery in fundamental science. Currently, our work focuses on building systems like Theo, the AI Physicist, which is a domain-specialized system for research in fundamental physics. We're a fast-growing, remote-first team of builders, researchers, engineers, and thinkers working across Canada, the US, the UK, and expanding globally. What brings us together is a shared curiosity about how the universe works, and a belief that we can build systems that help us explore it more effectively. We spend our time working on questions that don't have clear answers, like how to design AI that can reason through scientific problems, and how the scientific process as a whole might evolve. This is work that sits somewhere between creativity and rigorous thinking, and often requires comfort with ambiguity and iteration. If you're someone who enjoys tackling big, abstract problems and building the infrastructure that makes ambitious research possible, you'll likely find the work here interesting. Why This Role Exists We're building the next generation of infrastructure for AI-driven scientific discovery, and we need someone who can help own the systems that make our research and inference workloads reliable, scalable, and fast. This role is about building and operating the compute foundation behind our AI Physicist: Kubernetes clusters, Linux systems, GPU infrastructure, cloud environments, HPC-style compute, deployment workflows, monitoring, and automation. As our workloads grow, we need infrastructure that can support both experimentation and production-like inference across cloud, bare metal, and hybrid environments. You'll play a central role in shaping how we run compute at FirstPrinciples. That includes provisioning and managing clusters, improving reliability and observability, reducing operational toil, supporting researchers and engineers, and helping us make practical decisions about when to use managed cloud services, self-managed Kubernetes, Slurm-style systems, or owned hardware. We're looking for someone hands-on, systems-oriented, and comfortable working in a fast-moving research environment. You should have strong Kubernetes and Linux fundamentals, good operational instincts, and enough experience with cloud and HPC/GPU infrastructure to help us build toward a robust bare metal and multi-cloud inference platform.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at FirstPrinciples? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect