Inference Optimization Engineer (local / edge runtime)

External

Intel · California, Santa Clara

Full-timeHybrid3d ago

LinuxPhoenixPythonVulkan

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Responsibilities

Profile and optimize local inference (llama.cpp-vulkan and vLLM) for latency, throughput, and memory on edge hardware
Tune KV cache, continuous batching, and scheduling for interactive agent workloads
Drive quantization strategy (GGUF / AWQ / GPTQ) and validate quality impact with the Post-Training team
Cut CPU overhead and improve engine startup, model load, and lifecycle (start / stop / health)
Benchmark across hardware tiers and publish honest performance comparisons
Upstream fixes and patches to open-source engines where it helps us
What you'll learn / grow into
Curiosity is required. You will develop:
The internals of modern inference engines and where the milliseconds actually go
Hardware-aware optimization across iGPU / CPU paths (Vulkan, SYCL, oneAPI, CUDA where relevant)
The quality-vs-speed-vs-memory trade space for small models
Interest in local / edge AI and squeezing hardware

Requirements

Minimum qualifications are required to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.
You must possess the minimum qualifications to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.
Required Qualifications
BS/MS in CS, EE, Math or related STEM field
5+ years software development background
Strong in C++ and/or Python; comfortable reading systems-level code
Understands how LLM inference works (attention, KV cache, decoding)
Has profiled and optimized real performance problems (CPU or GPU) and can prove the speedup
Linux, build systems, and low-level debugging expertise
Hands-on with llama.cpp, vLLM, ggml, or similar engines
Experience with GPU / accelerator programming (Vulkan, CUDA, SYCL, Metal) or SIMD / CPU kernels
Familiarity with quantization formats and their quality trade-offs
Open-source contributions to inference engines
Requirements listed would be obtained through a combination of industry relevant job experience, internship experiences and or schoolwork/classes/research.
Benefits at Intel
Job Type:
Shift:
Shift 1 (United States of America)
Primary Location:
US, California, Santa Clara
Additional Locations:
US, Arizona, Phoenix, US, California, Folsom, US, Oregon, Hillsboro
Business group:
Posting Statement:
All qualified applicants will receive consideration for employment without regard to race, color, religion, religious creed, sex, national origin, ancestry, age, physical or mental disability, medical condition, genet

Benefits

Health insurance

Additional Information

Job Details: Job Description: Our Mission At Intel, our journey is to transform AI into something safer, more trustworthy, and respectful of human privacy by design. We believe transformative AI should have a positive impact on people-powerful in capability, yet honest about its limits and protective of the data and resources it touches. To get there, we build agentic AI that combines the best of local and cloud intelligence - private, affordable, and sustainable by design. Small, efficient models run directly on the user's machine (AI PC, edge, on-prem, and beyond), keeping data private and token costs low, while powerful cloud models handle the hardest work: planning, reasoning, and complex problem-solving. Today, neither approach can deliver this alone. Together, they give people real capability without compromise-data stays private, spend stays predictable, and energy use stays in check. We're building intelligence that scales without sacrificing trust, cost, or the planet-because the future of AI should belong to the people it serves Role Summary Make models fast on the hardware people actually own. You optimize inference engines (llama.cpp, vLLM) for constrained local and edge environments - GPU/iGPUs, Vulkan backends - not datacenter H100 environment, mostly PC/edge. KV cache, batching, quantization, scheduling, and CPU-overhead reduction are your daily tools. This is the rare skill that makes a hybrid, low-cost agent product viable.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Intel? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect