Research

Research for controllable AI systems.

Routing, safety, and runtime systems that ship.

Papers · OSS · production systems

About our research

We study the control layer above the model.

That means routing requests, enforcing safety, and making execution legible.

Research Focus

We work on a narrow set of infrastructure questions with outsized impact.

The common theme is control: how to route requests, decide when to reason, and make agent behavior inspectable.

Signal-driven selection

Model routing

We study signal learning, model selection, and inference policy so each request can be matched to the right model instead of treating every query the same.

Open GitHub

Guardrails that understand meaning

Safety and factuality

We study jailbreak detection, privacy protection, and workload-aware hallucination checks as runtime signals rather than after-the-fact filters.

Execution across tools, cache, and boundaries

Runtime intelligence

We study request lifecycles, semantic caching, and system interfaces that make multi-model execution usable in production.

Publications

Papers and talks.

A running list of 17 papers and 3 talks across routing, runtime, safety, fleet planning, and agent systems.

20 items total
2026Position Paper

arXiv Technical Report

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

A signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into routing policies across cost, privacy, latency, and safety constraints.

Authors: vLLM Semantic Router Team

2026Vision Paper

arXiv Technical Report

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Synthesizes routing, fleet, multimodal, and governance results into the Workload-Router-Pool architecture, connecting signal-driven routing to full-stack inference optimization.

Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

2026

arXiv Technical Report

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Formalizes the visual confused deputy as a security failure mode in computer-using agents and introduces a dual-channel guardrail that checks click targets and action reasoning before execution.

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

2026

arXiv Technical Report

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Introduces OATS, an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without serving-time model inference.

Authors: Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

2026

arXiv Technical Report

Adaptive Vision-Language Model Routing for Computer Use Agents

Proposes Adaptive VLM Routing, which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

2026

arXiv Technical Report

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Uses Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

2026

arXiv Technical Report

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

A queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets without up-front profiling runs.

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

2026

arXiv Technical Report

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Derives the minimum-cost two-pool LLM fleet directly from workload CDF and P99 TTFT targets, then uses Compress-and-Route to make the boundary deployable in practice.

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

2026

arXiv Technical Report

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Derives the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than GPU generation alone.

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

2026

arXiv Technical Report

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Shows how probabilistic ML predicates can silently co-fire on the same query and implements conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.

Authors: Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

2026

arXiv Technical Report

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

Extends the Semantic Router DSL from per-request routing to multi-step agent workflows, emitting verified decision nodes for orchestration frameworks, Kubernetes artifacts, and protocol boundaries.

Authors: Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu

2026

arXiv Technical Report

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Shows that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model's performance on persistent user-specific queries while cutting cost by 96%.

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

2026RAG Verification

arXiv Technical Report

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

A real-time verification component for long-document RAG that handles up to 32K-token contexts, balancing latency and grounding coverage for interactive systems.

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

2025

NeurIPS - MLForSys

When to Reason: Semantic Router for vLLM

A semantic router that classifies queries by reasoning requirement and selectively applies reasoning only when it is beneficial.

Authors: Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

2025

arXiv

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

A category-aware semantic cache where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.

Authors: Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

2025

Internet Engineering Task Force (IETF)

Semantic Inference Routing Protocol (SIRP)

Specifies the Semantic Inference Routing Protocol, a framework for content-level classification and semantic routing in AI inference systems.

Authors: Huamin Chen, Luay Jalil

2025

IETF NMRG

Multi-Provider Extensions for Agentic AI Inference APIs

Specifies multi-provider extensions for agentic AI inference APIs.

Authors: H. Chen, L. Jalil, N. Cocker

Research Method

Research ships in three forms.

We move papers, open source, and production systems forward together rather than treating them as separate tracks.

Papers

We publish the technical ideas that define our view of routing, safety, and runtime control.

Open source

Research is grounded in working software, from signal extraction and decision logic to provider-neutral execution.

Production systems

System ideas are tested against real deployment requirements, not just benchmark results.

Continue

See the platform and the company behind the work.

Platform shows how the research becomes product. About explains the thesis and team.