Multi-model serving
Multi-model serving enables serving multiple models through a single endpoint, routing requests to the appropriate model based on the model field in the request body. By combining the Body-Based Router (BBR) with Heimdall schedulers, the MoAI Inference Framework can serve multiple models across different GPU types through a unified API.
Architecture overview​
This example deploys four InferenceService instances serving two models (Llama 3.2 1B and Qwen 3 1.7B) across two GPU types (AMD MI250 and AMD MI300X):
- Gateway: Receives all requests through a single endpoint (port 80)
- BBR (Body-Based Router): Extracts the model name from the request body and injects it as an HTTP header
- 2x Heimdall schedulers: One per model, each managing its own InferencePool
- 4x InferenceService instances: Two models across two GPU types
When a request arrives:
- The client sends a request with the
modelfield in the body (e.g.,"model": "meta-llama/Llama-3.2-1B-Instruct") - BBR extracts the model name and injects it as the
X-Gateway-Model-Nameheader - The gateway matches the header to the correct Heimdall instance via HTTPRoute rules
- Heimdall selects a pod within the InferencePool based on its scheduling profiles
Prerequisites​
- Complete the Prerequisites setup, including the MoAI Inference Framework and preset chart installation.
- Pre-download the required models to a shared persistent volume by following the HF Model Management (PV) guide. The
vllm-hf-hub-offlinetemplate used in this guide requires a PVC namedmodelsin the deployment namespace.
Create a namespace​
kubectl create namespace multi-model-test
kubectl label namespace multi-model-test mif=enabled
Deploy the gateway​
Create a gateway.yaml file. The contents vary depending on which gateway controller you use.
- Istio (Default)
- Kgateway
apiVersion: v1
kind: ConfigMap
metadata:
name: mif-gateway-infrastructure
data:
service: |
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: http
deployment: |
spec:
template:
metadata:
annotations:
proxy.istio.io/config: |
accessLogFile: /dev/stdout
accessLogEncoding: JSON
spec:
containers:
- name: istio-proxy
resources:
limits: null
ports:
- name: http
containerPort: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: mif
spec:
gatewayClassName: istio
infrastructure:
parametersRef:
group: ""
kind: ConfigMap
name: mif-gateway-infrastructure
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: All
apiVersion: gateway.kgateway.dev/v1alpha1
kind: GatewayParameters
metadata:
name: mif-gateway-infrastructure
spec:
kube:
service:
type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: mif
spec:
gatewayClassName: kgateway
infrastructure:
parametersRef:
group: gateway.kgateway.dev
kind: GatewayParameters
name: mif-gateway-infrastructure
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: All
kubectl apply -n multi-model-test -f gateway.yaml
Deploy the Body-Based Router​
BBR is an external processor extension to the gateway that parses request bodies and extracts the model field as an HTTP header (X-Gateway-Model-Name). This enables routing to different InferencePool backends based on the requested model.
BBR is only needed when clients send the model name in the request body (e.g., OpenAI-compatible API). If the client can set the X-Gateway-Model-Name header directly, you can skip the BBR deployment entirely — the gateway will route based on the header alone.
This guide demonstrates BBR deployment using Istio. For other gateway controllers, refer to the Gateway API Inference Extension documentation.
Create a bbr-values.yaml file:
provider:
name: istio
inferenceGateway:
name: mif
helm upgrade -i body-based-router \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing \
--version v1.4.0 \
-n multi-model-test \
-f bbr-values.yaml \
--wait
Deploy Heimdall schedulers​
Deploy two Heimdall instances, one for each model. Each instance specifies gateway.bbr.models so that requests matching the model name in the X-Gateway-Model-Name header are routed to its InferencePool.
Set gateway.gatewayClassName to kgateway if you are using Kgateway as the gateway controller.
Create heimdall-llama-values.yaml:
global:
imagePullSecrets:
- name: moreh-registry
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: queue-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
- pluginRef: max-score-picker
gateway:
name: mif
gatewayClassName: istio
bbr:
models:
- meta-llama/Llama-3.2-1B-Instruct
inferencePool:
targetPorts:
- number: 8000
Create heimdall-qwen-values.yaml with the same structure but a different gateway.bbr.models value:
global:
imagePullSecrets:
- name: moreh-registry
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: queue-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
- pluginRef: max-score-picker
gateway:
name: mif
gatewayClassName: istio
bbr:
models:
- Qwen/Qwen3-1.7B
inferencePool:
targetPorts:
- number: 8000
Deploy both instances:
helm upgrade -i heimdall-llama moreh/heimdall \
--version <heimdallVersion> \
-n multi-model-test \
-f heimdall-llama-values.yaml
helm upgrade -i heimdall-qwen moreh/heimdall \
--version <heimdallVersion> \
-n multi-model-test \
-f heimdall-qwen-values.yaml
Verify that both Heimdall pods are running:
kubectl get pods -n multi-model-test -l app.kubernetes.io/name=heimdall
NAME READY STATUS RESTARTS AGE
heimdall-llama-7d54fcbfff-abc12 1/1 Running 0 30s
heimdall-qwen-8e65gdcggg-def34 1/1 Running 0 25s
Deploy InferenceService instances​
Deploy four InferenceService instances: two models (Llama 3.2 1B and Qwen 3 1.7B) each on two GPU types (MI250 and MI300X). Each instance references the vllm-hf-hub-offline template to load models from the shared persistent volume.
The vllm-hf-hub-offline template is installed by the moai-inference-preset chart in the mif namespace. It configures offline mode (HF_HUB_OFFLINE=1) and mounts the models PVC at /mnt/models. Ensure the PVC exists in the multi-model-test namespace with the required models pre-downloaded. See HF Model Management (PV) for setup instructions.
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: llama-mi250
spec:
replicas: 1
inferencePoolRefs:
- name: heimdall-llama
templateRefs:
- name: vllm
- name: quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2
- name: vllm-hf-hub-offline
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: llama-mi300x
spec:
replicas: 1
inferencePoolRefs:
- name: heimdall-llama
templateRefs:
- name: vllm
- name: quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi300x-tp2
- name: vllm-hf-hub-offline
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: qwen-mi250
spec:
replicas: 1
inferencePoolRefs:
- name: heimdall-qwen
templateRefs:
- name: vllm
- name: quickstart-vllm-qwen-qwen3-1.7b-amd-mi250-tp2
- name: vllm-hf-hub-offline
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: qwen-mi300x
spec:
replicas: 1
inferencePoolRefs:
- name: heimdall-qwen
templateRefs:
- name: vllm
- name: quickstart-vllm-qwen-qwen3-1.7b-amd-mi300x-tp2
- name: vllm-hf-hub-offline
Deploy all four instances:
kubectl apply -n multi-model-test \
-f isvc-llama-mi250.yaml \
-f isvc-llama-mi300x.yaml \
-f isvc-qwen-mi250.yaml \
-f isvc-qwen-mi300x.yaml
Wait for all services to be ready:
kubectl wait inferenceservice -n multi-model-test \
llama-mi250 llama-mi300x qwen-mi250 qwen-mi300x \
--for=condition=Ready \
--timeout=15m
Send requests​
Set up port forwarding to access the gateway:
SERVICE=$(kubectl -n multi-model-test get service -l gateway.networking.k8s.io/gateway-name=mif -o name)
kubectl -n multi-model-test port-forward $SERVICE 8000:80
Send a request to the Llama model:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}' | jq '.'
Send a request to the Qwen model:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-1.7B",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}' | jq '.'
Both requests go to the same endpoint. BBR routes each request to the correct model group based on the model field, and Heimdall distributes the load across available pods within each group.
Cleanup​
Delete all resources in reverse order:
# Delete InferenceService instances
kubectl delete -n multi-model-test \
-f isvc-llama-mi250.yaml \
-f isvc-llama-mi300x.yaml \
-f isvc-qwen-mi250.yaml \
-f isvc-qwen-mi300x.yaml
# Uninstall Heimdall instances
helm uninstall -n multi-model-test heimdall-llama
helm uninstall -n multi-model-test heimdall-qwen
# Uninstall BBR
helm uninstall -n multi-model-test body-based-router
# Delete gateway
kubectl delete -n multi-model-test -f gateway.yaml
# Delete namespace
kubectl delete namespace multi-model-test