📄️ Presets
The MoAI Inference Framework provides a set of pre-configured InferenceServiceTemplates, known as presets. These presets encapsulate standard configurations for various models and hardware setups, simplifying the deployment of inference services.
📄️ Prefill-decode disaggregation
During LLM inference, computation occurs in two stages: prefill and decode. In the prefill phase, the model processes the entire input prompt to generate the first token — a highly parallel, compute-bound process. The decode phase then predicts one token at a time, reusing the growing KV cache, and is memory-bound.
📄️ Expert parallelism
Expert parallelism (EP) refers to a parallelization method that assigns and executes multiple experts of a Mixture-of-Experts (MoE) model across different GPUs. While applying EP is not mandatory for MoE models, it is generally helpful for increasing overall inference throughput (total output tokens per second).
📄️ Prefix cache-aware routing
Prefix caching refers to a technique that stores the KV cache from previous queries, allowing subsequent queries with an identical prefix to reuse it, thereby eliminating redundant computation and improving performance. Since multiple queries often share common prefixes — such as system prompts, conversation history, or contextual documents, — recomputing the KV cache for every request would be highly inefficient.
📄️ Load-aware routing
Load-aware routing monitors the number of assigned requests and real-time utilization metrics of each inference instance (pod) to determine where the next request should be routed. Since individual requests have different workload characteristics and processing times, applying load-aware routing can achieve higher system-level efficiency than round-robin routing and especially help reduce latency variance across requests. Similar to other routing strategies such as prefix cache-aware routing, load-aware routing cannot serve as the sole routing criterion and should be combined with other metrics for optimal decision-making.
📄️ Auto-scaling
The model inference endpoints provided by the MoAI Inference Framework are often just one of many functions running on the overall AI compute infrastructure. Therefore, it is essential to allocate the appropriate amount of GPU resources (to run the appropriate number of Pods) so that GPUs are not under-utilized while still handling all incoming traffic and meeting the defined service level objectives (SLOs).