AI/ML Technical Leader - Language Model Inference & AI Ops
ExternalPrepare for this interview
EliteAI-generated questions, company research, and talking points tailored to this role
Requirements
- Bachelor's degree with 9+ years of related experience, or Masters degree with 7+ years of related experience.
- Experience in Python, Java or C++, and building production services for ML/AI workloads.
- Experience with PyTorch/TensorFlow and tooling across the ML lifecycle (data pipelines, training, evaluation, deployment).
- Experience deploying and operating NLP/Generative AI systems in production, including performance tuning and reliability practices.
- Experience working in cross-functional teams, delivering in fast-paced environments, and communicating technical concepts clearly.
- ~Inference & Serving
- Proven experience productionizing LLMs/SLMs with GPU-backed inference and runtime optimization.
- Hands-on experience with inference engines - vLLM, TensorRT-LLM, Triton, SGLang, llama.cpp and GPU profiling (Nsight, PyTorch profiler).
- Working knowledge of speculative/assisted decoding, continuous batching, paged/flash attention, KV-cache management, and structured/constrained decoding (guided JSON, grammar-based).
- Experience with quantization techniques (GPTQ, AWQ, SmoothQuant, FP8, INT4) and accuracy/perf tradeoffs.
- Familiarity with multi-GPU parallelism (tensor, pipeline, expert) and disaggregated serving patterns.
- ~Model Adaptation
- Experience with PEFT (LoRA, QLoRA), distillation, and SLM specialization for domain-specific deployments.
- Familiarity with LLM-evaluation (LLM-as-a-judge, golden sets, drift detection, regression gates).
- ~On-Prem, Edge & Infra
- Hands-on experience with on-prem deployment patterns (air-gapped, customer-managed), packaging, integration, upgrade strategy.
- Exposure to edge/resource-constrained inference (CPU, NPU, small GPU; runtimes like llama.cpp, ONNX Runtime, OpenVINO, MLC).
- Experience with AI infra and MLOps t
Additional Information
The application window is expected to close on: 09/30/2026 Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received . This is a HYBRID role in San Jose, CA. Must be able to work on site 3 days per week. Meet the Team Join Cisco's CX AI Incubation Team as an AI Operations Technical Leader and help productionize LLM/SLM capabilities for Intelligent Customer Experiences, across cloud and on-prem environments. In Cisco CX, you will build and operate scalable AI systems that move from prototype to production, powering delivery intelligence, network automation, infrastructure testing, and intelligence on edge. You will collaborate with product and engineering teams to deploy reliable, secure, and observable AI services, optimizing inference performance from CPU and small GPUs to large multi-GPU servers, including air-gapped and customer-managed deployments. You'll work on cutting-edge inference optimization - speculative decoding, continuous batching, quantization, and KV-cache strategies to deliver cost-effective, low-latency AI across cloud, on-prem, and air-gapped environments.
Your Match
How well this role fits your profile.
Company Intel
What employees say
Worked at Cisco? Share your experience