Skip to main content
Version: Dev 🚧

Heimdall plugins

All plugins are declared in the top-level plugins list of EndpointPickerConfig. Heimdall assigns each plugin to the appropriate extension point based on the interfaces it implements. A single plugin can implement multiple interfaces (e.g., context-length-aware acts as both a Filter and a Scorer).

  • Profile handlers manage the outer scheduling loop — selecting which profiles to run and aggregating results. They are not referenced from schedulingProfiles.
  • Filter, Scorer, and Picker run within a SchedulingProfile. Each profile executes them in order: Filters → Scorers → Picker. These are the only plugin types that can be referenced in schedulingProfiles[].plugins[].pluginRef.
  • Response plugins hook into the response lifecycle after scheduling. They are automatically activated when declared in the top-level plugins list and are not referenced from schedulingProfiles.

Profile handlers​

single-profile-handler​

Handles a single profile which is always the primary profile.

No parameters.

pd-profile-handler​

Handles scheduler profiles for Prefill-Decode (PD) disaggregation.

ParameterTypeDefaultDescription
thresholdint0Threshold for decoding operations.
decodeProfilestring"decode"Name of the profile to use for decode operations.
prefillProfilestring"prefill"Name of the profile to use for prefill operations.
prefixPluginTypestring"prefix-cache-scorer"Type of the prefix cache plugin to use.
prefixPluginNamestring"prefix-cache-scorer"Name of the prefix cache plugin to use.
hashBlockSizeint64Block size used for hashing tokens.
primaryPortint0Port number of the primary container (0 to disable).

Filters​

by-label​

Filters out pods based on the values defined by the given label.

ParameterTypeDefaultDescription
labelstring-The label key to filter by. (Required)
validValues[]string-List of allowed values for the label. (Required unless allowsNoLabel is true)
allowsNoLabelboolfalseWhether to allow pods that do not have the specified label.

by-label-selector​

Filters out pods that do not match the configured label selector criteria.

ParameterTypeDefaultDescription
matchLabelsmap[string]string-Key-value pairs of labels that must match.
matchExpressions[]LabelSelectorRequirement-List of label selector requirements (set-based matching).

prefill-filter​

Filters for pods designated with the prefill role. It retains pods that have the label mif.moreh.io/role set to prefill.

No parameters.

decode-filter​

Filters for pods designated with the decode role. It retains pods that satisfy one of the following conditions:

  • The label mif.moreh.io/role is set to decode or both.
  • The label mif.moreh.io/role is not set.

No parameters.

context-length-aware​

Also functions as a filter when enableFiltering is set to true. Pods whose label-defined range does not cover the estimated token count of the request are removed. See scorer section for parameters.


Scorers​

active-request-scorer​

Scores pods based on the number of active requests being served. Scores are normalized from 0 to 1.

ParameterTypeDefaultDescription
requestTimeoutstring"2m"Duration to consider a request active (e.g., "30s", "1m").

load-aware-scorer​

Scores pods based on load (waiting queue size). Pods with empty queues get higher scores.

ParameterTypeDefaultDescription
thresholdint128Queue size threshold for scoring.

no-hit-lru-scorer​

Favors pods that were least recently used for cold requests to distribute cache growth.

ParameterTypeDefaultDescription
prefixPluginTypestring"prefix-cache-scorer"Type of the prefix cache plugin.
prefixPluginNamestring"prefix-cache-scorer"Name of the prefix cache plugin.
lruSizeint1024Size of the LRU cache.

precise-prefix-cache-scorer​

Scores pods based on precise prefix-cache KV-block locality using an internal indexer.

ParameterTypeDefaultDescription
tokenProcessorConfigObject-Configuration for the token processor.
indexerConfigObject-Configuration for the KV cache indexer.
kvEventsConfigObject-Configuration for KV events subscription.

tokenProcessorConfig​

ParameterTypeDefaultDescription
blockSizeint16Number of tokens per block.
hashSeedstring""Seed for computing block hashes. Should match PYTHONHASHSEED.

indexerConfig​

ParameterTypeDefaultDescription
kvBlockIndexConfigObject-Configuration for the KV-block index backend.
tokenizersPoolConfigObject-Configuration for the tokenizers pool.

kvBlockIndexConfig​

Only one of the following backends should be configured.

ParameterTypeDefaultDescription
inMemoryConfigObject-Configuration for in-memory index.
redisConfigObject-Configuration for Redis index.
valkeyConfigObject-Configuration for Valkey index.
costAwareMemoryConfigObject-Configuration for cost-aware memory index.
enableMetricsboolfalseWhether to enable metrics for the indexer.
metricsLoggingIntervalstring0sInterval for logging metrics (e.g., "10s").
inMemoryConfig​
ParameterTypeDefaultDescription
sizeint1e8Maximum number of keys in the index.
podCacheSizeint10Maximum number of pod entries per key.
redisConfig / valkeyConfig​
ParameterTypeDefaultDescription
addressstring"redis://127.0.0.1:6379"Address of the Redis/Valkey server.
backendTypestring"redis"Backend type ("redis" or "valkey").
enableRDMAboolfalseEnable RDMA (experimental, Valkey only).
costAwareMemoryConfig​
ParameterTypeDefaultDescription
sizestring"2GiB"Maximum memory size (e.g., "2GiB", "500MiB").

tokenizersPoolConfig​

ParameterTypeDefaultDescription
modelNamestring-Base model name for the tokenizer. (Required)
workersCountint5Number of concurrent tokenizer workers.
hfObject-Configuration for HuggingFace tokenizer.
localObject-Configuration for local tokenizer.
udsObject-Configuration for UDS-based tokenizer.
hf (HuggingFace Tokenizer)​
ParameterTypeDefaultDescription
enabledbooltrueEnable HuggingFace tokenizer.
huggingFaceTokenstring""HuggingFace API token.
tokenizersCacheDirstringbinDirectory to cache downloaded tokenizers.
tokenizerstring""Specific tokenizer to use (defaults to model name).
tokenizerModestring"auto"Tokenizer mode ("auto", "hf", "limit", "mistral").
tokenizerRevisionstring""Revision of the tokenizer.
local (Local Tokenizer)​
ParameterTypeDefaultDescription
autoDiscoveryDirstring/mnt/modelsDirectory to search for tokenizers.
autoDiscoveryTokenizerFileNamestringtokenizer.jsonFilename to search for.
modelTokenizerMapmap[string]string-Manual mapping of model names to tokenizer paths.
tokenizerstring""Specific tokenizer to use (defaults to model name).
tokenizerModestring"auto"Tokenizer mode ("auto", "hf", "limit", "mistral").
tokenizerRevisionstring""Revision of the tokenizer.
uds (UDS Tokenizer)​
ParameterTypeDefaultDescription
socketFilestring/tmp/tokenizer/tokenizer-uds.socketPath to the UDS socket file.
useTCPboolfalseUse TCP instead of Unix domain socket.
modelTokenizerMapmap[string]string-Manual mapping of model names to tokenizer paths.

kvEventsConfig​

ParameterTypeDefaultDescription
zmqEndpointstring-ZMQ endpoint to connect to (e.g., "tcp://indexer:5557").
topicFilterstring"kv@"ZMQ topic filter subscription.
concurrencyint4Number of event processing workers.
discoverPodsbooltrueEnable automatic pod discovery.
podDiscoveryConfigObject-Configuration for pod discovery.
podDiscoveryConfig​
ParameterTypeDefaultDescription
podLabelSelectorstring"llm-d.ai/inferenceServing=true"Label selector to find pods.
podNamespacestring""Namespace to watch pods in (empty = all).
socketPortint5557Port where pods expose their ZMQ socket.

session-affinity-scorer​

Routes subsequent requests in a session to the same pod as the first request.

This scorer relies on the x-session-token HTTP header to maintain session affinity:

  1. Response: When a request is served, the plugin sets the x-session-token header in the response with the Base64-encoded name of the serving pod.
  2. Request: For subsequent requests, the client must include this x-session-token header. The scorer decodes it to identify the target pod and assigns it a high score.

No parameters.

kv-cache-utilization-scorer​

Scores pods based on their KV cache utilization (lower utilization = higher score).

No parameters.

lora-affinity-scorer​

Scores pods based on LoRA adapter availability and capacity.

No parameters.

queue-scorer​

Scores pods based on their waiting queue size (smaller queue = higher score).

No parameters.

running-requests-size-scorer​

Scores pods based on their number of running requests.

No parameters.

context-length-aware​

Scores pods based on how well their context length range matches the estimated token count of the request. Pods with a matching range receive higher scores. Also functions as a filter when enableFiltering is enabled.

ParameterTypeDefaultDescription
labelstring"mif.moreh.io/context-length-range"Pod label whose value specifies context length ranges (format: "min-max", comma-separated for multiple).
enableFilteringboolfalseWhether to also filter out pods that do not match the request's context length.

prefix-cache-scorer​

Scores pods based on the length of the prefix match for the request prompt.

ParameterTypeDefaultDescription
autoTunebooltrueWhether to automatically tune configuration based on metrics.
blockSizeint64Size of a token block for hashing.
maxPrefixBlocksToMatchint256Maximum number of blocks to match for prefix caching.
lruCapacityPerServerint31250Estimated LRU capacity per model server (in blocks).

Pickers​

max-score-picker​

Picks the pod(s) with the maximum score from the list of candidates.

ParameterTypeDefaultDescription
maxNumOfEndpointsint1Maximum number of endpoints to pick.

random-picker​

Picks random pod(s) from the candidates.

ParameterTypeDefaultDescription
maxNumOfEndpointsint1Maximum number of endpoints to pick.

weighted-random-picker​

Picks pod(s) based on weighted random sampling derived from their scores.

ParameterTypeDefaultDescription
maxNumOfEndpointsint1Maximum number of endpoints to pick.

Response plugins​

Response plugins hook into the response lifecycle. They are invoked by the request-control layer in the following order:

  1. ResponseReceived — Called when response headers arrive from the model server, indicating the beginning of response handling.
  2. ResponseStreaming — Called after each chunk of a streaming response is sent.
  3. ResponseComplete — Called when the request lifecycle terminates (response fully sent, or request failed/disconnected after a pod was scheduled). This is the final cleanup hook.

response-header-handler​

Adds serving pod information to the response headers. Implements the ResponseReceived extension point.

  • x-decoder-host-port: Always set to the address and port of the pod that handled the decode phase (the primary target).
  • x-prefiller-host-port: Set to the address and port of the prefill pod, if a separate prefill pod was used (PD disaggregation).

No configuration parameters.

info

When heimdall-proxy is deployed with --response-header, the proxy natively sets the same headers. In that case, this plugin is not needed.