Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
About the role
We are building the next generation of AI-driven game experiences - generative world models, neural rendering, and multi-modal understanding that turn images, text, and 3D primitives into interactive worlds. As our Staff Machine Learning Engineer, you will be a core technical leader bringing state-of-the-art computer vision and multi-modal models - transformers, diffusion networks, vision-language models (VLMs), and JEPA-style architectures - from research into robust, production-grade systems. This is a deeply hands-on, high-impact role. You will help define the modeling and deployment strategy, drive architectural decisions across the ML stack, and mentor a team of senior and mid-level engineers. Your work will directly shape the quality, capability, and performance of AI features experienced by billions of players - across cloud, server, and on-device targets.
Responsibilities
- Technical Leadership
- Help set the technical vision and roadmap for computer vision and multi-modal AI models, spanning transformers, diffusion models, vision-language models, and JEPA-style generative architectures.
- Drive design and implementation of models for image and video understanding, generation, segmentation, detection, and dense prediction, as well as multi-modal reasoning over images, text, and 3D inputs.
- Make sound decisions on model architecture, training strategy, data pipelines, and evaluation - balancing quality, capability, latency, and cost across deployment targets.
- Own the path from research prototype to production: training, fine-tuning, distillation, export, and serving, with deployment spanning cloud GPUs through to efficient on-device inference where the product requires it.
- Architecture & Research Translation
- Collaborate directly with research scientists to translate novel CV and multi-modal model architectures into deployable, well-engineered implementations.
- Design scalable systems for multi-modal inference that process diverse inputs images,
- video, text, primitives, and metadata - and produce rich outputs from semantic
- predictions to pixel-level generation.
- Track and rapidly adopt breakthroughs across the field: vision-language pretraining and
- alignment, efficient diffusion (e.g., consistency models, flow matching), efficient attention
- e.g., FlashAttention, linear-attention variants), and tokenization/representation learning
- for vision.
- Where latency or device constraints demand it, apply compression, quantization, pruning, and knowledge distillation, and work with appropriate runtimes (e.g., TensorRT, ONNX Runtime, CoreML, TFLite) to meet performance budgets.
- Team & Cross-Functional Leadership
- Lead and mentor a team of ML engineers; define engineering best practices, code review standards, and rigorous benchmarking and evaluation methodology.
- Partner with research, platform engineers, product managers, and runtime teams to align ML capabilities with product roadmaps and target-platform constraints.
- Champion a culture of measurement: define KPIs for model quality, accuracy, latency, memory, and cost, and ensure the team tracks them rigorously.
Requirements
- 6+ years in ML engineering, with significant depth in computer vision and/or multi-modal modeling.
- Proven production experience with transformer-based and diffusion-based vision models (e.g., ViT, CLIP/SigLIP-style encoders, Stable Diffusion, DETR/SAM-style architectures)
- Strong command of the full model lifecycle: data curation, training and fine-tuning, evaluation, and serving at scale.
- Familiarity with efficient attention, diffusion samplers, multi-modal fusion, and vision-language alignment techniques.
- Strong Python and modern deep-learning tooling (PyTorch); solid software
- engineering fundamentals.
- Track record of technical leadership: setting direction, influencing cross-functional partners, and growing engineers.
- You might also have
- Experience with world-model, video-generation, or neural rendering pipelines (NeRF,
- 3DGS, or similar).
- Experience deploying models to constrained or on-device targets, including quantization
- INT8/INT4/FP16), pruning, distillation, and runtimes such as CoreML, TFLite, ONNX
- Familiarity with mobile SoC accelerators (Apple Neural Engine, Qualcomm Hexagon/Adreno,ARM Mali) or compiler stacks such as MLIR, TVM, or XLA.
- Contributions to open-source ML frameworks or peer-reviewed CV/ML research publications.
- Background in real-time graphics or game engine pipelines (Metal, Vulkan, OpenGL ES).
- Additional information
- Relocation support is not available for this position
- Work visa/immigration sponsorship is not available for this position
Benefits
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Unity? Share your experience