Senior Machine Learning Engineer, On-Device & Mobile AI Optimization

External

Unity · San Francisco, CA

Full-timeOn-site6d ago

Budget ManagementCachingMachine LearningTensorFlowTransformersVulkan

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

About the role

We are building the next generation of AI-driven game experiences, running generative models on-device, right where the players are - on phones, tablets, laptops, and desktops. Our games run inside a modern, browser-native runtime (built on technologies such as WebGPU and WebNN), so the models that power these experiences must be deployed and accelerated entirely within that runtime. As a Senior Machine Learning Engineer for On-Device & Mobile AI, you will take state-of-the-art multi-modal models - transformers, diffusion networks, and vision-language models (VLMs) - and make them run fast, small, and reliably on mobile and constrained hardware. This is a deeply hands-on role. You will own the optimization and deployment of significant parts of the inference stack - from a trained checkpoint leaving research, through export, quantization, and kernel-level tuning, to a shipped feature running inside the engine at interactive frame rates within a fixed memory and power budget. Your work directly shapes the latency, quality, memory footprint, and battery profile of AI features experienced by billions of players. This role is for an engineer who is energized by the gap between a research model and a shipping, on-device product. If you enjoy profilers, frame captures, op-fusion, and shaving milliseconds and megabytes, this is your role.

Responsibilities

Inference & On-Device Optimization
Own the optimization pipeline for the models you ship: model export, graph transformation, operator fusion, memory-layout planning, and hardware-specific tuning across NPU, mobile GPU, and desktop/laptop GPU.
Apply quantization (INT4/INT8/FP16), weight sharing, structured/unstructured pruning, and knowledge distillation to hit hard latency, memory, and power budgets - and validate them against quality bars.
Do low-level performance work: write and tune WebGPU compute shaders (WGSL) and, where relevant, native kernels (Metal, Vulkan/SPIR-V compute, CUDA); profile with browser and platform tools (Chrome/Dawn GPU traces, PIX, Instruments/Metal System Trace,
Snapdragon Profiler, Nsight, RenderDoc), and eliminate bottlenecks at the op and memory-bandwidth level.
Apply efficiency techniques - dynamic resolution, token reduction, cross-frame caching/reuse, reduced-step diffusion samplers - as engineering levers to meet budgets on target SKUs.
Runtime & Systems Integration
Work with WebGPU-targeted inference runtimes (ONNX Runtime Web, Transformers.js, WebLLM, TensorFlow.js) alongside native options (CoreML, ONNX Runtime, TFLite, ExecuTorch), and extend or build glue code where off-the-shelf options fall short of our diffusion and VLM workloads.
Build parts of the integration between the ML runtime and the game engine: real-time scheduling, memory pooling, zero-copy buffer sharing between the inference and render paths, and frame-budget management alongside the renderer.
Build supporting engineering for your components: model packaging and asset pipelines, on-device fallbacks and SKU-aware capability tiers, crash/quality telemetry, and automated on-device benchmarking in CI.
Research Productionization
Partner with research scientists to turn novel CV and multi-modal architectures into implementations that are deployable, debuggable, and fast on device.
Provide a feedback loop into research: surface hardware constraints, op-support gaps, and cost models early so model design and deployment converge.
Track breakthroughs in efficient inference (efficient attention, distillation, reduced-step diffusion) and assess them pragmatically: what actually moves latency/memory/power on our target devices.
Collaboration & Engineering Quality
Contribute to engineering best practices, code-review standards, performance-regression gates, and on-device benchmarking methodology.
Support a culture of measurement: track KPIs for latency, quality, memory, and power for the systems you work on, across the device matrix.
Partner with platform engineers, product managers, and runtime teams to align your work with device-SKU constraints and product roadmaps.
Share knowledge and mentor junior and mid-level engineers through code review, pairing, and design discussion.

Requirements

5+ years in software/ML engineering, with meaningful time focused on on-device / edge inference or real-time, performance-critical systems.
Production deployment of transformer- and/or diffusion-based models (e.g., ViT, Stable Diffusion, CLIP/SigLIP-style encoders) on mobile, desktop, or embedded hardware - shipped, not just prototyped.
Hands-on experience with at least one major inference runtime (ONNX Runtime / ORT Web, CoreML, TFLite, ExecuTorch) and a working understanding of operator fusion, memory layout, and runtime scheduling.
Low-level performance engineering: solid command of at least one GPU/compute API - WebGPU/WGSL, Metal, Vulkan, D3D12, or CUDA - and the profiling tools to go with it. You can read a frame capture and

Benefits

Vision insurancePaid time off

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Unity? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect