Skip to main content
Version: v0.0.0

Features

Prefill-decode disaggregation, expert parallelism, routing strategies (load-aware, prefix cache-aware), presets, and auto-scaling.

📄️ Load-aware routing

Load-aware routing monitors the number of assigned requests and real-time utilization metrics of each inference instance (pod) to determine where the next request should be routed. Since individual requests have different workload characteristics and processing times, applying load-aware routing can achieve higher system-level efficiency than round-robin routing and especially help reduce latency variance across requests. Similar to other routing strategies such as prefix cache-aware routing, load-aware routing cannot serve as the sole routing criterion and should be combined with other metrics for optimal decision-making.