2026Position Paper
arXiv Technical Report
vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models
A signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into routing policies across cost, privacy, latency, and safety constraints.
Authors: vLLM Semantic Router Team
2026Vision Paper
arXiv Technical Report
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
Synthesizes routing, fleet, multimodal, and governance results into the Workload-Router-Pool architecture, connecting signal-driven routing to full-stack inference optimization.
Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang
2026
arXiv Technical Report
Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
Formalizes the visual confused deputy as a security failure mode in computer-using agents and introduces a dual-channel guardrail that checks click targets and action reasoning before execution.
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
2026
arXiv Technical Report
Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference
Introduces OATS, an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without serving-time model inference.
Authors: Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu
2026
arXiv Technical Report
Adaptive Vision-Language Model Routing for Computer Use Agents
Proposes Adaptive VLM Routing, which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
2026
arXiv Technical Report
98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
Uses Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
2026
arXiv Technical Report
inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference
A queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets without up-front profiling runs.
Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu
2026
arXiv Technical Report
FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism
Derives the minimum-cost two-pool LLM fleet directly from workload CDF and P99 TTFT targets, then uses Compress-and-Route to make the boundary deployable in practice.
Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu
2026
arXiv Technical Report
The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency
Derives the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than GPU generation alone.
Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu
2026
arXiv Technical Report
Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL
Shows how probabilistic ML predicates can silently co-fire on the same query and implements conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.
Authors: Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu
2026
arXiv Technical Report
From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification
Extends the Semantic Router DSL from per-request routing to multi-step agent workflows, emitting verified decision nodes for orchestration frameworks, Kubernetes artifacts, and protocol boundaries.
Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu
2026
arXiv Technical Report
Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
Shows that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model's performance on persistent user-specific queries while cutting cost by 96%.
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
2026RAG Verification
arXiv Technical Report
Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems
A real-time verification component for long-document RAG that handles up to 32K-token contexts, balancing latency and grounding coverage for interactive systems.
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen
2025
NeurIPS - MLForSys
When to Reason: Semantic Router for vLLM
A semantic router that classifies queries by reasoning requirement and selectively applies reasoning only when it is beneficial.
Authors: Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen
2025
arXiv
Category-Aware Semantic Caching for Heterogeneous LLM Workloads
A category-aware semantic cache where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.
Authors: Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen
2025
Internet Engineering Task Force (IETF)
Semantic Inference Routing Protocol (SIRP)
Specifies the Semantic Inference Routing Protocol, a framework for content-level classification and semantic routing in AI inference systems.
Authors: Huamin Chen, Luay Jalil
2025
IETF NMRG
Multi-Provider Extensions for Agentic AI Inference APIs
Specifies multi-provider extensions for agentic AI inference APIs.
Authors: H. Chen, L. Jalil, N. Cocker