AI/ML Technical Leader - Language Model Inference & AI Ops

External

Cisco · San Jose, CA

Full-timeHybrid1w ago

Capacity PlanningCI/CDGenerative AIJavaLLMsMLOps

Cover Letter Connect

Prepare for this interview

Elite

AI-generated questions, company research, and talking points tailored to this role

Requirements

Bachelor's degree with 9+ years of related experience, or Masters degree with 7+ years of related experience.
Experience in Python, Java or C++, and building production services for ML/AI workloads.
Experience with PyTorch/TensorFlow and tooling across the ML lifecycle (data pipelines, training, evaluation, deployment).
Experience deploying and operating NLP/Generative AI systems in production, including performance tuning and reliability practices.
Experience working in cross-functional teams, delivering in fast-paced environments, and communicating technical concepts clearly.
~Inference & Serving
Proven experience productionizing LLMs/SLMs with GPU-backed inference and runtime optimization.
Hands-on experience with inference engines - vLLM, TensorRT-LLM, Triton, SGLang, llama.cpp and GPU profiling (Nsight, PyTorch profiler).
Working knowledge of speculative/assisted decoding, continuous batching, paged/flash attention, KV-cache management, and structured/constrained decoding (guided JSON, grammar-based).
Experience with quantization techniques (GPTQ, AWQ, SmoothQuant, FP8, INT4) and accuracy/perf tradeoffs.
Familiarity with multi-GPU parallelism (tensor, pipeline, expert) and disaggregated serving patterns.
~Model Adaptation
Experience with PEFT (LoRA, QLoRA), distillation, and SLM specialization for domain-specific deployments.
Familiarity with LLM-evaluation (LLM-as-a-judge, golden sets, drift detection, regression gates).
~On-Prem, Edge & Infra
Hands-on experience with on-prem deployment patterns (air-gapped, customer-managed), packaging, integration, upgrade strategy.
Exposure to edge/resource-constrained inference (CPU, NPU, small GPU; runtimes like llama.cpp, ONNX Runtime, OpenVINO, MLC).
Experience with AI infra and MLOps t

Additional Information

The application window is expected to close on: 09/30/2026 Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received . This is a HYBRID role in San Jose, CA. Must be able to work on site 3 days per week. Meet the Team Join Cisco's CX AI Incubation Team as an AI Operations Technical Leader and help productionize LLM/SLM capabilities for Intelligent Customer Experiences, across cloud and on-prem environments. In Cisco CX, you will build and operate scalable AI systems that move from prototype to production, powering delivery intelligence, network automation, infrastructure testing, and intelligence on edge. You will collaborate with product and engineering teams to deploy reliable, secure, and observable AI services, optimizing inference performance from CPU and small GPUs to large multi-GPU servers, including air-gapped and customer-managed deployments. You'll work on cutting-edge inference optimization - speculative decoding, continuous batching, quantization, and KV-cache strategies to deliver cost-effective, low-latency AI across cloud, on-prem, and air-gapped environments.

Your Match

How well this role fits your profile.

Company Intel

What employees say

Worked at Cisco? Share your experience

Interested in this role?

Apply on the company's website.

Cover Letter Connect