Heimdall scheduler
Heimdall is the component that performs smart routing and scheduling across multiple inference pods, deciding which pod each request should be sent to. It is implemented according to the Kubernetes Gateway API Inference Extension and can operate together with various gateway controllers.
Heimdall's behavior can be flexibly defined by composing multiple scheduling profiles, filters, scorers, and pickers as plugins. The MoAI Inference Framework provides a wide range of plugins — from advanced plugins that enable fully automated routing and scheduling to low-level plugins that allow fine-grained customization.
Architecture​
Control flow​
sequenceDiagram
participant Gateway
participant Inference-pod as Inference pod
participant Heimdall
participant Scheduling-profile-1 as Scheduling profile 1
participant Scheduling-profile-2 as Scheduling profile 2
Gateway ->>+ Heimdall: Request
loop
Note over Heimdall: Profile handler plugin
alt If the scheduling profile 1 is picked
Heimdall ->>+ Scheduling-profile-1: Request
Note over Scheduling-profile-1: Filter plugins
Note over Scheduling-profile-1: Scorer plugins
Note over Scheduling-profile-1: Picker plugin
Scheduling-profile-1 -->>- Heimdall: Routing candidates
else If the scheduling profile 2 is picked
Heimdall ->>+ Scheduling-profile-2: Request
Note over Scheduling-profile-2: Filter plugins
Note over Scheduling-profile-2: Scorer plugins
Note over Scheduling-profile-2: Picker plugin
Scheduling-profile-2 -->>- Heimdall: Routing candidates
end
end
Note over Heimdall: Pre-request plugins
Heimdall -->>- Gateway: Updated request and routing information
Gateway ->>+ Inference-pod: Request
Inference-pod -->>- Gateway: Response
Gateway ->>+ Heimdall: Response
Note over Heimdall: Post-response plugins
Heimdall -->>- Gateway: Updated response
Plugins​
MoAI Inference Framework supports six categories of plugins — profile handler, pre-request, post-response, filter, scorer, and picker plugins. Users need to instantiate and activate some of the plugins to define the behavior of Heimdall.
| Category | Description | Instantiation and activation |
|---|---|---|
| Profile handler | Determines the scheduling profile(s) to be used for each incoming request. | Automatically invoked by Heimdall; exactly one profile handler must be instantiated. |
| Pre-request | Performs certain processing or transformations before the request is delivered to a pod. | Automatically invoked by Heimdall once instantiated. |
| Post-response | Performs certain processing or transformations before the response is delivered to the client. | Automatically invoked by Heimdall once instantiated. |
| Filter | Restricts the set of pods that a scheduling profile may consider as candidates. | Becomes active only when included in a scheduling profile. |
| Scorer | Adds a score to each individual pod. | Becomes active only when included in a scheduling profile. |
| Picker | Selects the routing target based on the scores of pods. | Becomes active only when included in a scheduling profile. |
Profile handler​
| Plugin | Description | Reference |
|---|---|---|
single-profile-handler | Uses the default scheduling profile to select a single pod. | |
pd-profile-handler | Uses the prefill and decode scheduling profile to select both prefill-only and decode-only pods. | Prefill-decode disaggregation |
Filter​
| Plugin | Description | Reference |
|---|---|---|
decode-filter | Selects only decode-only pods, and is used by the decode scheduling profile. | Prefill-decode disaggregation |
prefill-filter | Selects only prefill-only pods, and is used by the prefill scheduling profile. | Prefill-decode disaggregation |
Scorer​
| Plugin | Description | Reference |
|---|---|---|
active-request-scorer | Scores based on the number of active requests. | Load-aware routing |
load-aware-scorer | Scores based on the number of queued requests. | Load-aware routing |
no-hit-lru-scorer | Scores based on whether the request hits the prefix cache, and if it misses, when each pod last experienced a miss. | Load-aware routing |
precise-prefix-cache-scorer | Scores based on the prefix cache hit rate. | Prefix cache-aware routing |
queue-scorer | Scores based on the number of queued requests. | Load-aware routing |
session-affinity-scorer | Scores based on whether the pod has previously handled a request from the same session. | Load-aware routing |
Picker​
| Plugin | Description | Reference |
|---|---|---|
max-score-picker | Selects the pod with the highest total score. | |
random-picker | Selects a pod randomly without considering scores. | |
weighted-random-picker | Selects a pod randomly with probabilities weighted by their scores. |
Configuration​
global:
imagePullSecrets:
- name: moreh-registry
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: pd-profile-handler
- type: prefill-filter
- type: decode-filter
- type: queue-scorer
- type: kv-cache-utilization-scorer
- type: max-score-picker
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: queue-scorer
- pluginRef: kv-cache-utilization-scorer
- pluginRef: max-score-picker
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: queue-scorer
- pluginRef: kv-cache-utilization-scorer
- pluginRef: max-score-picker
gateway:
name: mif
gatewayClassName: istio
serviceMonitor:
labels:
release: prometheus-stack
config section​
The config section defines Heimdall's routing rules. As explained earlier, this is achieved by appropriately combining and configuring multiple plugins.
This section begins with config.apiVersion and config.kind, which should be set to inference.networking.x-k8s.io/v1alpha1 and EndpointPickerConfig, respectively.
The config.plugins list instantiates the plugins to be used, specifying their respective parameters. In most cases, a single instance per plugin is sufficient, but you may instantiate the same plugin multiple times under different names if you need to apply different parameters. Each plugin instance is defined by an entry of the following form:
- name: <instanceName>
type: <pluginName>
parameters:
<parameter>: <value>
name(optional): The name used to reference the plugin instance in scheduling profiles. If omitted, the plugin'stypeis used as its name.type(required): The predefined name of the plugin to instantiate.parameters(optional): Configuration options specific to this plugin instance.
The config.schedulingProfiles combines zero or more filter and scorer plugins with exactly one picker plugin to define a rule for making routing decisions. Each scheduling profile is defined by an entry of the following form:
- name: <profileName>
plugins:
- pluginRef: <instanceName>
weight: <weight>
name(required): The name of the scheduling profile.plugins(required): A list of filter, scorer, and picker plugins that make up the scheduling profile.pluginRef(required): The name of the plugin instance as defined in theconfig.pluginssection.weight(optional, for scorer plugins only): A numerical weight that influences the scoring outcome. Higher weights increase the impact of the scorer on the final decision. If omitted, a default weight of 1 is applied.