Research

Research for controllable AI systems.

Routing, safety, and runtime systems that ship.

Papers · OSS · production systems

About our research

We study the control layer above the model.

That means routing requests, enforcing safety, and making execution legible.

Research Focus

We work on a narrow set of infrastructure questions with outsized impact.

The common theme is control: how to route requests, decide when to reason, and make agent behavior inspectable.

Signal-driven selection

Model routing

We study signal learning, model selection, and inference policy so each request can be matched to the right model instead of treating every query the same.

Open GitHub

Guardrails that understand meaning

Safety and factuality

We study jailbreak detection, privacy protection, and workload-aware hallucination checks as runtime signals rather than after-the-fact filters.

Execution across tools, cache, and boundaries

Runtime intelligence

We study request lifecycles, semantic caching, and system interfaces that make multi-model execution usable in production.

Publications

Papers and talks.

A running list of 17 papers and 3 talks across routing, runtime, safety, fleet planning, and agent systems.

20 items total

2026Position Paper

arXiv Technical Report

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

A signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into routing policies across cost, privacy, latency, and safety constraints.

Authors: vLLM Semantic Router Team

Research for controllable AI systems.

We work on a narrow set of infrastructure questions with outsized impact.

Model routing

Safety and factuality

Runtime intelligence

Papers and talks.

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Adaptive Vision-Language Model Routing for Computer Use Agents

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

When to Reason: Semantic Router for vLLM

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Semantic Inference Routing Protocol (SIRP)

Multi-Provider Extensions for Agentic AI Inference APIs

Research ships in three forms.

Papers

Open source

Production systems

See the platform and the company behind the work.