Skip to main content
Back to jobs

Machine Learning Engineer - Inference Optimization

External
featherlessai logoFeatherlessai · Remote
Full-timeRemote4mo ago
Deep LearningMachine LearningObservabilityPyTorch
Cover LetterConnect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role


About the role

We're looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale . You'll work at the intersection of research and production-turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users. This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

Responsibilities

  • Optimize inference latency, throughput, and cost for large-scale ML models in production
  • Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
  • Implement and tune techniques such as:
  • Quantization (fp16, bf16, int8, fp8)
  • KV-cache optimization & reuse
  • Speculative decoding, batching, and streaming
  • Model pruning or architectural simplifications for inference
  • Collaborate with research engineers to productionize new model architectures
  • Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
  • Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
  • Improve system reliability, observability, and cost efficiency under real workloads

Requirements

  • Strong experience in ML inference optimization or high-performance ML systems
  • Solid understanding of deep learning internals (attention, memory layout, compute graphs)
  • Hands-on experience with PyTorch (or similar) and model deployment
  • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
  • Experience scaling inference for real users (not just research benchmarks)
  • Comfortable working in fast-moving startup environments with ownership and ambiguity
  • Experience with LLM or long-context model inference
  • Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
  • Experience optimizing across different hardware vendors
  • Open-source contributions in ML systems or inference tooling
  • Background in distributed systems or low-latency services
  • Why Join Us
  • Real ownership over performance-critical systems
  • Direct impact on product reliability and unit economics
  • Close collaboration with research, infra, and product
  • Competitive compensation + meaningful equity at Series A
  • A team that cares about engineering quality, not hype

Benefits

Equity / stock options

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at featherlessai? Share your experience

Interested in this role?

Apply on the company's website.

Cover LetterConnect