Runtime Engineer
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Requirements
- Strong experience in a systems programming language - Rust, C, C++, or Go - including memory management, allocator design, and FFI/ABI work
- Have built Python interop layers in production (PyO3, ctypes, pybind11, or equivalent C-ABI bridging)
- Have designed and maintained API or ABI contracts between teams - versioning, evolution, breaking-change discipline - not just consumed someone else's
- Hands-on with at least one accelerator programming model (CUDA, ROCm, oneAPI Level Zero, TPU, or comparable) - enough to reason about device memory, async execution, and kernel launch
- ML-systems literate - comfortable with the training and inference loop, what collectives do, what a tensor layout is. Research depth not required.
- Bonus Points If You Have
- LLM inference internals - vLLM, TensorRT-LLM, or SGLang (paged attention, scheduler design)
- Rust at depth, including proc macros, unsafe with soundness reasoning, and complex lifetime/trait work
- Custom allocator design (slab, paged, arena) or other low-level memory work
- ML framework integration experience (PyTorch custom backends, JAX/XLA, ONNX runtime)
- Profiler or tracing infrastructure work (perfetto, Nsight, or a custom stack)
- Driver-adjacent or kernel-bypass work, or prior new-silicon bring-up
Benefits
Additional Information
What MatX is Building MatX is building custom silicon for large-language-model inference and training, with HW/SW co-design across ISA, RTL, simulator, compiler, and kernels so each layer benefits from the others. The runtime owns the host-side stack and the contracts that bind those teams together. What You'll Do Here Build the host-side interface library - device memory management, DMA, streams and events, sync primitives - that every compiler-emitted program runs on top of Own and extend the executable format: the compiler→runtime contract, its versioning, the weight and quantization layouts that let compiler and runtime evolve independently Design the custom-kernel ABI - calling convention, sync semantics, lifecycle - and the host-side marshaling layer (DLPack, the buffer protocol, numpy) that gets Python tensors to the device Build Python bindings via PyO3, with a C-ABI shim as the alternative integration path for downstream consumers Build the LLM inference serving stack - paged KV cache, continuous batching, request scheduling, token streaming - and the cluster orchestration primitives underneath it Bring up interconnect topology from the host and own the failure-detection and clean-teardown path for stop-restructure-resume recovery across racks Design what the chip exposes to host-side profilers and debuggers - perf counters, traces, and the Python surfaces ML engineers actually use - and hit measurable performance targets on runtime overhead and serving throughput
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at matx? Share your experience