📄️ Overview
MoAI Inference Framework is designed to enable efficient and automated distributed inference on cluster systems and Kubernetes environments. It supports a wide range of distributed inference techniques — such as prefill-decode disaggregation, expert parallelism, and prefix-cache-aware routing. Leveraging its unique cost model, it automatically identifies, applies, and dynamically adjusts the optimal way to utilize various accelerators so as to meet the defined service level objectives (SLOs). All of these capabilities are seamlessly integrated not only for NVIDIA GPUs but also for other accelerators, especially AMD GPUs.
📄️ Prerequisites
This document introduces the prerequisites for the MoAI Inference Framework and provides instructions on how to install them.
📄️ Quickstart
In this quickstart, we will launch two vLLM instances (pods) of the Llama 3.2 1B Instruct model and serve them through a single endpoint as an example. Please make sure to install all prerequisites, including the following versions of the components, before starting this quickstart guide.
📄️ Monitoring
This document describes how to access Grafana dashboards to monitor the MoAI Inference Framework and provides an overview of the available metrics. Please make sure to install all prerequisites before starting this monitoring guide.
📄️ Supported devices
Accelerator labels