Prefill-Decode Disaggregation
During LLM inference, computation occurs in two stages: prefill and decode. In the prefill phase, the model processes the entire input prompt to generate the first token — a highly parallel, compute-bound process. The decode phase then predicts one token at a time, reusing the growing KV cache, and is memory-bound.
Because these phases have fundamentally different characteristics, prefill-decode (PD) disaggregation executes them on separate GPU resources. The prefill runs first on compute-optimized machines, then the KV cache is transferred to memory-optimized ones for decoding. This separation allows each phase to use its optimal parallelization, batch size, and configurations, while preventing interference between concurrent requests.
PD disaggregation can improve key metrics such as time to first token (TTFT) and time per output token (TPOT) — since TTFT depends on prefill and TPOT on decode, dedicated optimization for each leads to better overall performance. However, because it also introduces communication overhead, which may negatively affect TTFT, PD disaggregation should be applied judiciously to ensure net efficiency gains.
Key features​
- The Heimdall scheduler runs prefill-only and decode-only instances separately, allowing each to scale independently and managing request routing between them.
- The framework can automatically determine whether to apply PD disaggregation and how to scale each phase according to defined service level objectives (SLOs).
- Moreh vLLM is optimized to efficiently execute both prefill and decode phases of various models on AMD MI200 and MI300 series GPUs. It applies distinct parallelization and optimization strategies tailored to prefill-only and decode-only instances.
Configuration​
This section covers only the differences from the standard deployment described in the Quickstart. Read the Quickstart first to understand the base setup.
To enable PD disaggregation, you must configure both Heimdall (to route requests appropriately) and Odin InferenceServices (to deploy prefill and decode instances with correct roles and hardware).
Heimdall configuration​
In the Heimdall configuration, you need to enable the pd-profile-handler plugin. This plugin selects the appropriate prefill or decode pod using the prefill and decode scheduling profiles. The prefill-filter and decode-filter plugins filter pods by the mif.moreh.io/role label within their respective profiles. The configuration is part of the EndpointPickerConfig.
Consider the following heimdall-values.yaml example:
# ... (other values)
config:
plugins:
# 1. Enable PD profile handler
- type: pd-profile-handler
- type: prefill-filter
- type: decode-filter
# 2. Add other necessary plugins
- type: queue-scorer
- type: max-score-picker
# 3. Define scheduling profiles for prefill and decode
schedulingProfiles:
- name: prefill
plugins:
- pluginRef: prefill-filter
- pluginRef: queue-scorer
- pluginRef: max-score-picker
- name: decode
plugins:
- pluginRef: decode-filter
- pluginRef: queue-scorer
- pluginRef: max-score-picker
# ... (other values)
InferenceService configuration with presets​
For the InferenceService, use presets (InferenceServiceTemplate) to simplify the configuration of hardware-specific settings (like parallelism strategy) for different GPU types. For more details on presets, see the Presets documentation.
PD disaggregation requires high-bandwidth network connectivity (e.g., RDMA) between prefill and decode pods for KV cache transfer. The specific network resource configuration (such as mellanox/hca) varies by cluster. Consult your cluster administrator to determine the correct network resource type and limits for your environment.
List available prefill and decode presets for a specific model:
kubectl get inferenceservicetemplate -n mif \
-l mif.moreh.io/template.type=preset \
-l mif.moreh.io/model.name=deepseek-r1
In this example, DeepSeek-R1 is deployed using:
- Prefill: AMD MI300X (DP=8, Expert Parallel)
- Decode: AMD MI308X (DP=8, Expert Parallel)
Two InferenceService resources are created. Because the presets use data parallelism (dp8-moe-ep8), the runtime-bases are vllm-prefill-dp and vllm-decode-dp, and the ISVC uses workerTemplate instead of template.
Choose the appropriate PD-specific runtime-base template according to your parallelism strategy:
- TP-only (Tensor Parallel): Use
vllm-prefillandvllm-decode. - DP (Data Parallel): Use
vllm-prefill-dpandvllm-decode-dp. - PP (Pipeline Parallel): Use
vllm-prefill-ppandvllm-decode-pp.
In production environments, it is highly recommended to use offline hub templates (e.g., vllm-hf-hub-offline, vllm-dp-hf-hub-offline, or vllm-pp-hf-hub-offline) instead of HF_TOKEN to load pre-downloaded models from a Persistent Volume. This ensures reliability by avoiding dependencies on external network conditions during pod startup. These templates require a PVC named models in your namespace. Refer to the Hugging Face model management with persistent volume for more details.
1. Prefill Service (MI300X)​
This service uses the MI300X prefill preset and the vllm-prefill-dp runtime-base for data-parallel prefill.
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: deepseek-r1-prefill
spec:
inferencePoolRefs:
- name: heimdall
templateRefs:
# 1. Runtime base for data-parallel prefill
- name: vllm-prefill-dp
# 2. Preset for DeepSeek-R1 on MI300X with DP=8 and Expert Parallel
- name: moreh-vllm-0.15.0-260226-rc2-deepseek-ai-deepseek-r1-prefill-amd-mi300x-dp8-moe-ep8
workerTemplate:
spec:
containers:
- name: main
env:
- name: HF_TOKEN
value: <huggingFaceToken>
# 3. Specify cluster-specific hardware resources (e.g., RDMA NICs for KV cache transfer)
resources:
limits:
mellanox/hca: "1"
2. Decode Service (MI308X)​
This service uses the MI308X decode preset and the vllm-decode-dp runtime-base for data-parallel decode.
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: deepseek-r1-decode
spec:
inferencePoolRefs:
- name: heimdall
templateRefs:
# 1. Runtime base for data-parallel decode
- name: vllm-decode-dp
# 2. Decode preset for DeepSeek-R1 on MI308X with DP=8 and Expert Parallel
- name: moreh-vllm-0.15.0-260226-rc2-deepseek-ai-deepseek-r1-decode-amd-mi308x-dp8-moe-ep8
workerTemplate:
spec:
initContainers:
- name: proxy
env:
# 3. (Optional) Enable response headers for debugging PD routing
- name: ISVC_EXTRA_ARGS
value: >-
--pd-coordinator vllm/nixl
--response-header
--log-format json
--log-level warn
containers:
- name: main
env:
- name: HF_TOKEN
value: <huggingFaceToken>
# 4. Specify cluster-specific hardware resources (e.g., RDMA NICs for KV cache transfer)
resources:
limits:
mellanox/hca: "1"
When --response-header is set on the decode proxy, responses include x-decoder-host-port and x-prefiller-host-port headers. This is useful for verifying that PD routing is working correctly. Remove it in production to avoid leaking internal pod addresses.