Senior Machine Learning Engineer - Research Optimisation

External

Canva · Sydney, Australia

Full-timeOn-site1w ago

A/B TestingAWSCachingCI/CDDocumentationKubernetes

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

You'll be the bridge between research and production. Partnering closely with researchers, you'll ensure experimental code is production ready, integrate models into our monorepo, build shared libraries and services, and create the tooling and processes that let multiple model variants ship safely and quickly. You'll also work across the training stack, profiling and tuning PyTorch workloads, improving GPU utilisation, and shaping how we use distributed training and storage to get the most out of our compute. Your work shortens the research-to-user loop, reduces duplication, and ensures our ML features are reliable, observable, and easy for other teams to adopt. At the moment, this role is focused on: Research-to-Production Pipeline: Hardening experimental models (containerisation, tests, CI/CD), making them deployable for real users. Training Performance and GPU Efficiency: Profiling PyTorch training jobs, improving GPU utilisation, and applying techniques like mixed precision, efficient data loading, and distributed training strategies (FSDP, DDP, DeepSpeed) to reduce time and cost per experiment. Library development: Converting experiments into well-factored libraries with clear APIs, dependency hygiene, and versioning, so teams can import rather than copy-paste. Developer Experience & Documentation: Creating templates, examples, and guidance; offering supportive, high-signal communication so others can adopt libraries confidently. Reliability, Observability & Cost: Instrumenting services with metrics/logging/tracing, setting SLIs/SLOs, and optimising training and inference performance and spend. Primary Responsibilities: Productionise research models: refactor, test, containerise, and integrate them into the monorepo for scalable reuse. Profile and optimise PyTorch training jobs, working with researchers to identify bottlenecks across compute, memory, I/O, and networking. Improve distributed training setups (multi-GPU, multi-node) and help teams pick the right parallelism strategy for their workload. Build and maintain inference services, SDKs, and shared libraries that standardise pre/post-processing and interfaces across variants. Create multi-variant runners and rollout frameworks (feature flags, canaries, A/B testing, automated rollbacks). Establish CI/CD workflows, artifact management, and reproducible builds for ML services and model assets. Add robust observability (dashboards, alerts) and reliability practices (load tests, chaos/resiliency checks) across training and inference workloads. Optimise inference (batching, caching, quantisation/compilation, hardware utilisation) to reduce latency and cost. Work across the broader training stack, including Kubernetes orchestration, storage (e.g. Weka, Vast, Lustre), and data pipelines, to remove friction for research teams. Partner with researchers and product engineers via code reviews, pair sessions, and clear documentation to accelerate adoption. Drive good engineering hygiene in the research codebase: testing strategy, dependency management, and de-duplication across multiple model variants. You're probably a match if you: Have strong software engineering fundamentals and excellent Python skills; you're comfortable turning notebooks and prototypes into production-grade services. Have shipped ML systems in production (containers, APIs, CI/CD), ideally within a monorepo environment. Have hands-on experience optimising PyTorch training or inference, profiling workloads, and reasoning about GPU memory, compute, and throughput. Are comfortable in containerised environments and understand Kubernetes concepts well enough to debug and improve ML workloads running on it. Can read research code and refactor it into clean abstractions with stable, well-documented interfaces. Understand service reliability and observability (metrics, tracing, logging) and how they apply to ML systems. Think holistically about the stack, from storage and networking through to model code, and can hold a credible conversation with researchers, DevOps, and platform engineers alike. Communicate clearly and empathetically, especially when guiding others to adopt libraries and best practices and mentoring engineers earlier in their ML journey. Bring cloud experience (AWS a plus) without needing to be a deep specialist.

Requirements

Familiarity with model-serving/optimisation tooling (e.g., ONNX, TorchScript, Triton, quantisation).
Experience writing or optimising CUDA kernels, or using compilation frameworks (torch.compile, Triton, TensorRT) to speed up models.
Experience w

Additional Information

At Canva, our mission is to empower the world to design. To get cutting-edge research into the hands of millions of users faster, we're looking for a Machine Learning Engineer focused on research enablement and performance, turning promising experiments into stable, scalable, user-facing capabilities while making training and inference faster, cheaper, and more reliable.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Canva? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect