Skip to main content
Version: Dev 🚧

Heimdall plugins

All plugins are declared in the top-level plugins list of EndpointPickerConfig. Heimdall assigns each plugin to the appropriate extension point based on the interfaces it implements. A single plugin can implement multiple interfaces (for example, context-length-aware acts as both a Filter and a Scorer; active-request-scorer acts as both a Scorer and a response lifecycle hook).

  • Profile handlers manage the outer scheduling loop — selecting which profiles to run and aggregating results. They are not referenced from schedulingProfiles.
  • Deciders are helper plugins consumed by disagg-profile-handler (via its deciders.* parameters) or by the legacy pd-profile-handler (via its flat deciderPluginName parameter, prefill decider only). They are declared in the top-level plugins list but not referenced from schedulingProfiles.
  • Filter, Scorer, and Picker run within a SchedulingProfile. Each profile executes them in order: Filters → Scorers → Picker. These are the only plugin types that can be referenced in schedulingProfiles[].plugins[].pluginRef.
  • Pre-request handlers hook into the request path before the scheduler runs. They are activated automatically when declared in the top-level plugins list and are not referenced from schedulingProfiles.
  • Prepare-data plugins enrich the request with derived data (for example tokenized prompts) that downstream plugins can consume. They are activated automatically when declared in the top-level plugins list, provided the prepareDataPlugins feature gate is enabled.
  • Response plugins hook into the response lifecycle after scheduling. They are activated automatically when declared in the top-level plugins list and are not referenced from schedulingProfiles.
  • Data layer plugins (sources and extractors) feed pod metrics into the scheduler. They are referenced from the top-level data field, not from schedulingProfiles, and require the dataLayer feature gate to be enabled.
  • Store plugins manage multi-turn conversation state. They are activated automatically when declared in the top-level plugins list, provided the prepareDataPlugins feature gate is enabled.

Profile handlers​

single-profile-handler​

Handles a single profile, which is treated as the primary profile. Suitable when you only need one scheduling profile per request.

No parameters.

disagg-profile-handler​

Unified profile handler for disaggregated inference deployments. It orchestrates up to three stages — decode, prefill, and encode — and consults per-stage decider plugins to decide whether each stage should run for a given request.

Stage pipeline:

  1. Decode always runs first and selects the primary endpoint.
  2. Encode (optional) runs next when the deciders.encode plugin decides the request benefits from a dedicated encode stage (for example, multimodal inputs).
  3. Prefill (optional) runs last when the deciders.prefill plugin decides the request has enough uncached tokens to justify dedicated prefill.

When prefill or encode selects an endpoint, disagg-headers-handler (declared separately) writes the chosen endpoint(s) into request headers so the decode pod can reach them.

Parameters use a nested format (preferred). A legacy flat format is still accepted for backward compatibility.

ParameterTypeDefaultDescription
profiles.decodestring"decode"Name of the SchedulingProfile to use for decode endpoints.
profiles.prefillstring"prefill"Name of the SchedulingProfile to use for prefill endpoints.
profiles.encodestring"encode"Name of the SchedulingProfile to use for encode endpoints.
deciders.prefillstring(unset)Name of the decider plugin that decides whether prefill runs. Unset disables the prefill stage. Must be registered on the plugin list.
deciders.encodestring(unset)Name of the decider plugin that decides whether encode runs. Unset disables the encode stage. Must be registered on the plugin list.

Legacy flat parameters (deprecated, still accepted — each logs a deprecation warning and maps into the nested form):

Legacy parameterMaps to
decodeProfileprofiles.decode
prefillProfileprofiles.prefill
encodeProfileprofiles.encode
prefillDeciderPluginNamedeciders.prefill
encodeDeciderPluginNamedeciders.encode
deciderPluginNamedeciders.prefill (lower priority than prefillDeciderPluginName)
warning

When deciders.prefill or deciders.encode is set, disagg-profile-handler requires disagg-headers-handler to also be registered. The lookup happens at initialization, so both the referenced decider plugin and disagg-headers-handler must appear earlier in the top-level plugins list than disagg-profile-handler itself.

Minimal canonical example enabling prefill disaggregation with the prefix-based decider:

plugins:
- type: disagg-headers-handler
- type: prefix-based-pd-decider
parameters:
nonCachedTokens: 16
- type: prefill-filter
- type: decode-filter
- type: max-score-picker
- type: disagg-profile-handler
parameters:
deciders:
prefill: prefix-based-pd-decider
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: max-score-picker
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: max-score-picker

For a full walkthrough, see PD disaggregation.

pd-profile-handler (legacy)​

Separate profile handler for Prefill-Decode (PD) disaggregation, predating disagg-profile-handler. Still registered in moreh-v0.7.x with its own factory (parameter struct is flat, not nested).

ParameterTypeDefaultDescription
decodeProfilestring"decode"Name of the SchedulingProfile to use for decode endpoints.
prefillProfilestring"prefill"Name of the SchedulingProfile to use for prefill endpoints.
prefixPluginTypestring"prefix-cache-scorer"Plugin type of the prefix cache scorer the decider reads from. Must be the registered type string.
prefixPluginNamestring(value of prefixPluginType)Plugin name (instance name) of the prefix cache scorer.
primaryPortint0When non-zero, rewrites the decode endpoint's port to this value (used with data parallelism). Must be between 1 and 65535 when set.
deciderPluginNamestring"prefix-based-pd-decider"Name of the decider plugin. The referenced plugin must implement the PD decider interface.
warning

Like disagg-profile-handler, the decider plugin (and disagg-headers-handler) must appear earlier in the top-level plugins list than pd-profile-handler. New deployments should prefer disagg-profile-handler — pd-profile-handler is kept for backward compatibility with existing heimdall-values.yaml files.


Deciders​

Decider plugins are consumed by disagg-profile-handler through its nested deciders.* parameters, and by the legacy pd-profile-handler through its flat deciderPluginName parameter. pd-profile-handler only supports a prefill decider; encode deciders are exclusive to disagg-profile-handler. Each decider answers one of two questions:

  • Prefill deciders — "should this request run prefill?" Consumed via disagg-profile-handler.deciders.prefill or pd-profile-handler.deciderPluginName.
  • Encode deciders — "should this request run encode?" Consumed via disagg-profile-handler.deciders.encode.

Declare the decider in the top-level plugins list (before the profile handler) and reference it by name.

prefix-based-pd-decider​

Runs prefill only when the request has enough non-cached tokens, based on how many prefix tokens already hit the cache. Prefill decider.

ParameterTypeDefaultDescription
nonCachedTokensint0Minimum number of non-cached tokens required to trigger prefill. With the default 0, P/D disaggregation is disabled and prefill never runs; set a positive threshold to enable it.

always-disagg-pd-decider​

Always requests prefill. Equivalent to "PD disaggregation enabled for every request." Prefill decider.

No parameters.

always-disagg-multimodal-decider​

Runs encode whenever the incoming request contains multimodal content (image, audio, or video blocks). Encode decider.

No parameters.


Filters​

by-label​

Filters out pods based on the values defined by the given label.

ParameterTypeDefaultDescription
labelstring-The label key to filter by. (Required)
validValues[]string-List of allowed values for the label. (Required unless allowsNoLabel is true)
allowsNoLabelboolfalseWhether to allow pods that do not have the specified label.

by-label-selector​

Filters out pods that do not match the configured label selector criteria.

ParameterTypeDefaultDescription
matchLabelsmap[string]string-Key-value pairs of labels that must match.
matchExpressions[]LabelSelectorRequirement-List of label selector requirements (set-based matching).

prefill-filter​

Filters for pods designated with the prefill role. It retains pods whose label mif.moreh.io/role is set to prefill.

No parameters.

decode-filter​

Filters for pods designated with the decode role. It retains pods that satisfy one of the following conditions:

  • The label mif.moreh.io/role is set to decode or both.
  • The label mif.moreh.io/role is not set.

No parameters.

encode-filter​

Filters for pods designated with an encode role. It retains pods whose mif.moreh.io/role label value is one of encode, encode-prefill, or encode-prefill-decode. Pods without the role label are rejected.

No parameters.

context-length-aware​

Also functions as a filter when enableFiltering is set to true. Pods whose label-defined range does not cover the estimated token count of the request are removed. See the scorer section for parameters.


Scorers​

active-request-scorer​

Scores pods based on the number of active (in-flight) requests being served. Scores are normalized from 0 to 1. Also hooks the request/response lifecycle to maintain its in-flight counter.

ParameterTypeDefaultDescription
requestTimeoutstring"2m"Go duration string (for example "30s", "1m"). A request older than this is treated as dropped.
idleThresholdint0Maximum active-request count for a pod to be treated as idle. Idle pods score 1.0.
maxBusyScorefloat1.0Upper bound on the score assigned to busy pods (range 0.0-1.0). Lower values widen the gap between idle and busy.

load-aware-scorer​

Scores pods based on queue load. Pods with empty or lightly loaded queues receive higher scores.

ParameterTypeDefaultDescription
thresholdint128Queue-size threshold used when normalizing load.

no-hit-lru-scorer​

Favors pods that were least recently used for cold requests (requests that missed the prefix cache). Spreads cache growth across pods instead of piling it onto a single pod.

ParameterTypeDefaultDescription
prefixPluginTypestring"prefix-cache-scorer"Plugin type of the prefix cache scorer whose hit/miss state is observed.
prefixPluginNamestring"prefix-cache-scorer"Plugin name (instance name) of that prefix cache scorer.
lruSizeint1024Maximum number of endpoints tracked in the LRU window.

precise-prefix-cache-scorer​

Scores pods based on precise prefix-cache KV-block locality, computed from real-time KV-cache events published by each pod. Requires a tokenizer for the target model.

ParameterTypeDefaultDescription
tokenProcessorConfigObjectLibrary defaults (vllm scheme, block size 16).Configuration for the token processor.
indexerConfigObjectLibrary defaults + tokenizersPoolConfig.modelName must be set.Configuration for the KV cache indexer.
kvEventsConfigObjectLibrary defaults.Configuration for KV events subscription.
speculativeIndexingboolfalseWhen true, proactively inserts predicted cache entries into the index right after a routing decision, closing the short window between the decision and KV-event arrival.
speculativeTTLstring"2s"Go duration string. TTL for speculative entries before they are evicted. Ignored when speculativeIndexing is false.

tokenProcessorConfig​

ParameterTypeDefaultDescription
blockSizeint16Number of tokens per block. Must match the InferenceService's --block-size (the value vLLM is started with on the inference pods).

indexerConfig​

ParameterTypeDefaultDescription
kvBlockIndexConfigObject-Configuration for the KV-block index backend.
tokenizersPoolConfigObject-Configuration for the tokenizers pool. (Required; must set modelName.)

kvBlockIndexConfig​

Configure exactly one backend.

ParameterTypeDefaultDescription
inMemoryConfigObject-Configuration for in-memory index.
redisConfigObject-Configuration for Redis index.
valkeyConfigObject-Configuration for Valkey index.
costAwareMemoryConfigObject-Configuration for cost-aware memory index.
enableMetricsboolfalseWhether to enable metrics for the indexer.
metricsLoggingIntervalstring0sInterval for logging metrics (for example, "10s").
inMemoryConfig​
ParameterTypeDefaultDescription
sizeint1e8Maximum number of keys in the index.
podCacheSizeint10Maximum number of pod entries per key.
redisConfig / valkeyConfig​
ParameterTypeDefaultDescription
addressstring"redis://127.0.0.1:6379"Address of the Redis/Valkey server.
backendTypestring"redis"Backend type ("redis" or "valkey").
enableRDMAboolfalseEnable RDMA (experimental, Valkey only).
costAwareMemoryConfig​
ParameterTypeDefaultDescription
sizestring"2GiB"Maximum memory size (for example "2GiB", "500MiB").

tokenizersPoolConfig​

ParameterTypeDefaultDescription
modelNamestring-Base model name for the tokenizer. (Required)
workersCountint5Number of concurrent tokenizer workers.
hfObject-Configuration for HuggingFace tokenizer.
localObject-Configuration for local tokenizer.
udsObject-Configuration for UDS-based tokenizer.
hf (HuggingFace Tokenizer)​
ParameterTypeDefaultDescription
enabledbooltrueEnable HuggingFace tokenizer.
huggingFaceTokenstring""HuggingFace API token.
tokenizersCacheDirstringbinDirectory to cache downloaded tokenizers.
tokenizerstring""Specific tokenizer to use (defaults to model name).
tokenizerModestring"auto"Tokenizer mode. One of "auto", "hf", "slow", "mistral", "deepseek_v32".
tokenizerRevisionstring""Revision of the tokenizer.
local (Local Tokenizer)​
ParameterTypeDefaultDescription
autoDiscoveryDirstring/mnt/modelsDirectory to search for tokenizers.
autoDiscoveryTokenizerFileNamestringtokenizer.jsonFilename to search for.
modelTokenizerMapmap[string]string-Manual mapping of model names to tokenizer paths.
tokenizerstring""Specific tokenizer to use (defaults to model name).
tokenizerModestring"auto"Tokenizer mode. One of "auto", "hf", "slow", "mistral", "deepseek_v32".
tokenizerRevisionstring""Revision of the tokenizer.
uds (UDS Tokenizer)​
ParameterTypeDefaultDescription
socketFilestring/tmp/tokenizer/tokenizer-uds.socketPath to the UDS socket file.
useTCPboolfalseUse TCP instead of Unix domain socket.
modelTokenizerMapmap[string]string-Manual mapping of model names to tokenizer paths.

kvEventsConfig​

ParameterTypeDefaultDescription
zmqEndpointstring-ZMQ endpoint to connect to (for example tcp://indexer:5557).
topicFilterstring"kv@"ZMQ topic filter subscription.
concurrencyint4Number of event processing workers.
discoverPodsbooltrueEnable automatic pod discovery.
podDiscoveryConfigObject-Configuration for pod discovery.
podDiscoveryConfig​
ParameterTypeDefaultDescription
podNamespacestring""Namespace to watch pods in (empty = all).
socketPortint5557Port where pods expose their ZMQ socket.

prefix-cache-scorer​

Scores pods based on the length of an approximate prefix match against recent requests, using an in-process LRU indexer. Lighter-weight than precise-prefix-cache-scorer because it does not need a tokenizer or KV-cache event subscription.

ParameterTypeDefaultDescription
autoTunebooltrueAutomatically tunes blockSizeTokens, maxPrefixBlocksToMatch, and lruCapacityPerServer based on observed model server metrics.
blockSizeTokensint16Number of tokens per hash block. Requests shorter than one block are ignored.
blockSizeint0Deprecated. Legacy block size expressed in characters. Setting only blockSize (with blockSizeTokens left unset) fails initialization. Prefer blockSizeTokens.
maxPrefixBlocksToMatchint256Maximum number of prefix blocks to match. Longer prefixes are truncated at this limit.
lruCapacityPerServerint31250LRU indexer capacity per model server (in blocks).

session-affinity-scorer​

Routes subsequent requests in a session to the same pod as the first request. Relies on the x-session-token HTTP header to maintain affinity:

  1. Response: When a request is served, the plugin sets the x-session-token header on the response. The value is the Base64-encoded name of the serving pod.
  2. Request: For subsequent requests, the client includes this x-session-token header. The scorer decodes it to identify the target pod and assigns it a high score.

No parameters.

kv-cache-utilization-scorer​

Scores pods based on their KV cache utilization (lower utilization yields a higher score).

No parameters.

lora-affinity-scorer​

Scores pods based on LoRA adapter availability and capacity.

No parameters.

queue-scorer​

Scores pods based on their waiting queue size (smaller queue yields a higher score).

No parameters.

running-requests-size-scorer​

Scores pods based on the number of running requests.

No parameters.

context-length-aware​

Scores pods based on how well their context-length range matches the estimated token count of the request. Pods with a matching range receive higher scores. Also functions as a filter when enableFiltering is enabled.

ParameterTypeDefaultDescription
labelstring"mif.moreh.io/context-length-range"Pod label whose value specifies context length ranges (format: "min-max", comma-separated for multiple).
enableFilteringboolfalseWhether to also filter out pods that do not match the request's context length.

predicted-latency-scorer​

Advanced scorer that predicts TTFT (time-to-first-token) and TPOT (time-per-output-token) per pod using an online running-request model, then scores pods so the request is routed to the pod most likely to meet its latency SLO. Emits per-pod latency metrics.

warning

This scorer has a large parameter surface (20+ fields covering sampling, headroom weights, affinity gates, and selection strategies). Most deployments should leave every field at its default. For the complete parameter list, refer to pkg/epp/framework/plugins/scheduling/scorer/predictedlatency/scorer.go in the moreh-dev/heimdall-inference-extension repository, and tune only after establishing a baseline.


Pickers​

max-score-picker​

Picks the pod(s) with the maximum score from the list of candidates.

ParameterTypeDefaultDescription
maxNumOfEndpointsint1Maximum number of endpoints to pick.

random-picker​

Picks random pod(s) from the candidates.

ParameterTypeDefaultDescription
maxNumOfEndpointsint1Maximum number of endpoints to pick.

weighted-random-picker​

Picks pod(s) based on weighted random sampling (A-Res algorithm) derived from their scores.

ParameterTypeDefaultDescription
maxNumOfEndpointsint1Maximum number of endpoints to pick.

Pre-request handlers​

disagg-headers-handler​

Publishes the endpoints selected by disagg-profile-handler or the legacy pd-profile-handler as request headers, so the decode pod can reach prefill / encode pods:

  • mif-prefill-endpoint — host:port of the prefill endpoint, when prefill ran.
  • mif-encode-endpoints — comma-separated host:port list of the encode endpoints, when encode ran.
ParameterTypeDefaultDescription
prefillProfilestring"prefill"Name of the SchedulingProfile whose result provides the prefill endpoint.
encodeProfilestring"encode"Name of the SchedulingProfile whose result provides the encode endpoint list.
info

prefill-header-handler is kept as a legacy alias that resolves to this same plugin (both names share DisaggHeadersHandlerFactory). Existing heimdall-values.yaml files that reference prefill-header-handler continue to work.


Prepare-data plugins​

tokenizer​

Runs a tokenizer on each incoming request and stores the tokenized prompt on the request so downstream plugins (for example precise-prefix-cache-scorer, disagg-profile-handler's deciders) can reuse it without re-tokenizing. Fails open: if tokenization errors, the request continues with no tokenized prompt attached.

warning

This plugin only activates when the prepareDataPlugins feature gate is enabled. Add featureGates: [prepareDataPlugins] to the top of your EndpointPickerConfig; otherwise the plugin registration is silently skipped.

ParameterTypeDefaultDescription
modelNamestring-Base model name for the tokenizer. (Required)
udsTokenizerConfigObject(unset)Unix domain socket tokenizer configuration. When unset, falls back to the in-process default tokenizer.

udsTokenizerConfig​

ParameterTypeDefaultDescription
socketFilestring/tmp/tokenizer/tokenizer-uds.socketPath to the tokenizer UDS socket.

Example:

- type: tokenizer
parameters:
modelName: meta-llama/Llama-3.2-1B-Instruct
udsTokenizerConfig:
socketFile: /tmp/tokenizer/tokenizer-uds.socket

Response plugins​

Response plugins hook into the response lifecycle. They are invoked by the request-control layer in the following order:

  1. ResponseReceived — Called when response headers arrive from the model server, indicating the beginning of response handling.
  2. ResponseStreaming — Called after each chunk of a streaming response is sent.
  3. ResponseComplete — Called when the request lifecycle terminates (response fully sent, or request failed/disconnected after a pod was scheduled). This is the final cleanup hook.

response-header-handler​

Adds serving-pod information to the response headers. Implements the ResponseReceived extension point.

  • x-decoder-host-port: Always set to the address and port of the pod that handled the decode phase (the primary target).
  • x-prefiller-host-port: Set to the address and port of the prefill pod, if a separate prefill pod was used (PD disaggregation).

No parameters.

info

When heimdall-proxy is deployed with --response-header, the proxy natively sets the same headers. In that case, this plugin is not needed.


Data layer plugins​

Data layer plugins feed pod-level signals (metrics, running model names, and so on) into the scheduler. They are declared in the top-level plugins list and wired together through the data field of EndpointPickerConfig: a DataLayerSource references a source plugin via pluginRef and attaches a list of extractor plugins.

models-data-source​

Polls each pod's /v1/models endpoint (or a configurable path) to discover which models are currently being served.

ParameterTypeDefaultDescription
schemestring"http"URL scheme used to reach the pod ("http" or "https").
pathstring"/v1/models"URL path of the models endpoint.
insecureSkipVerifybooltrueSkip TLS certificate verification on the pod connection.

model-server-protocol-models​

Extracts the list of running model identifiers from a models-data-source and publishes them on the pod's data-layer record, where downstream plugins can read them.

No parameters.


Store plugins​

responses-store​

Persists multi-turn conversation state for the OpenAI Responses API (/v1/responses with previous_response_id). Exposes PrepareDataPlugin (to look up prior responses on request), ResponseStreaming (to accumulate streamed chunks), and ResponseComplete (to commit the final response to the store).

warning

This plugin only activates when the prepareDataPlugins feature gate is enabled. Add featureGates: [prepareDataPlugins] to the top of your EndpointPickerConfig; otherwise the plugin registration is silently skipped.

Supported backends: in-memory or a Redis/Valkey-based tier with optional MongoDB tier-2 sync. Omit storeConfig entirely to use the default in-memory backend (ttl: 24h). When storeConfig is set, configure at least one of storeConfig.inMemoryConfig or storeConfig.tieredConfig; if both are set, tieredConfig takes precedence (Redis is required inside the tiered backend; MongoDB is optional).

ParameterTypeDefaultDescription
storeConfigObjectin-memory backend, ttl: 24hBackend selection and configuration. When omitted, Heimdall uses an in-memory backend with a 24-hour TTL.

Example (in-memory backend):

- type: responses-store
parameters:
storeConfig:
inMemoryConfig:
ttl: 24h
maxEntries: 10000
maxEntryBytes: 1048576

Example (tiered Redis + MongoDB backend):

- type: responses-store
parameters:
storeConfig:
tieredConfig:
redis:
address: redis://redis.responses-store.svc:6379
ttl: 24h
mongo:
uri: mongodb://mongo.responses-store.svc:27017
database: heimdall
collection: responses
ttl: 720h
stream:
key: heimdall:responses:mongo_sync
consumerGroup: mongo-sync
batchSize: 100
blockTimeout: 1s
claimAge: 30s

storeConfig​

ParameterTypeDefaultDescription
inMemoryConfigObject-Configuration for the in-memory backend. Used when tieredConfig is not set; if both are configured, tieredConfig takes precedence.
tieredConfigObject-Configuration for the Redis/Valkey + optional MongoDB tiered backend. Takes precedence over inMemoryConfig when both are configured. At least one of the two backends must be set.
inMemoryConfig​
ParameterTypeDefaultDescription
ttlstring"24h"Go duration string. TTL applied to entries.
maxEntriesint10000Maximum number of entries retained in memory.
maxEntryBytesint1048576Maximum size in bytes for a single entry.
tieredConfig​
ParameterTypeDefaultDescription
redisObject-Required Redis/Valkey configuration. Used for tier-1 storage and stream coordination.
mongoObject-Optional MongoDB configuration. When set, enables tier-2 sync via the Redis stream consumer.
streamObject-Stream tuning for the Redis-to-Mongo sync goroutine.
redis​
ParameterTypeDefaultDescription
addressstring-Standalone redis:///valkey:// URL. Mutually exclusive with addresses.
addresses[]string-Host:port entries. Combine with masterName for Sentinel mode; multiple bare entries select Cluster.
masterNamestring""Sentinel master name. Required for Sentinel mode.
usernamestring""Username used to authenticate to Redis/Valkey.
passwordstring""Password used to authenticate to Redis/Valkey.
dbint0Database index.
ttlstring"24h"Go duration string. TTL applied to entries stored in Redis/Valkey.
maxEntryBytesint1048576Maximum size in bytes for a single entry.
mongo​
ParameterTypeDefaultDescription
uristring-MongoDB connection URI.
databasestring"heimdall"Database name.
collectionstring"responses"Collection name.
ttlstring"720h"Go duration string. TTL applied to entries.
timeoutstring"500ms"Go duration string. Per-operation timeout.
stream​
ParameterTypeDefaultDescription
keystring"heimdall:responses:mongo_sync"Redis stream key used to buffer MongoDB syncs.
consumerGroupstring"mongo-sync"Stream consumer group name.
maxLenint641000000Maximum stream length retained.
batchSizeint100Number of entries claimed per batch.
blockTimeoutstring"1s"Go duration string. Block timeout when reading.
claimAgestring"30s"Go duration string. Minimum age to re-claim entries.