Version: v0.4.0

Prefix Cache-aware Routing

Prefix caching is a technique that stores the KV cache from previous queries, allowing subsequent queries with an identical prefix to reuse it. This eliminates redundant computation and significantly improves performance for workloads with common prefixes, such as system prompts, conversation history, or shared contextual documents.

In a system with multiple inference instances (pods), each instance maintains its own (L1) prefix cache in GPU memory. Consequently, the cache hit rate varies depending on which instance a request is routed to. Prefix cache-aware routing calculates the potential cache hit rate for each pod and prioritizes routing to the pod with the highest coverage. This reduces redundant KV computation, improving both Time to First Token (TTFT) and overall throughput.

In a production environment, cache hit rates are considered alongside other factors, such as pod load and hardware characteristics, to make optimal routing decisions.

Key features

Heimdall Integration: The Heimdall scheduler tokenizes request prompts, calculates cache hit rates for candidate pods, and assigns scores used for routing. It stays synchronized with pod cache states through real-time ZMQ events.
SLO-based Importance: The framework can dynamically weight prefix cache hits based on Service Level Objectives (SLOs) and the specific cost of KV cache recomputation for different GPU architectures.

Scorer configuration

To enable prefix cache-aware routing, you must configure the precise-prefix-cache-scorer plugin in Heimdall. This plugin tracks the KV block locality across all pods.

The following example shows how to enable the scorer in heimdall-values.yaml:

heimdall-values.yaml
config:
  apiVersion: inference.networking.x-k8s.io/v1alpha1
  kind: EndpointPickerConfig
  plugins:
    - type: single-profile-handler
    - type: max-score-picker
    - type: precise-prefix-cache-scorer
      parameters:
        tokenProcessorConfig:
          blockSize: 32
          hashSeed: "12345"
        indexerConfig:
          tokenizersPoolConfig:
            modelName: "meta-llama/Llama-3.2-1B-Instruct"
            workersCount: 8
            hf:
              enabled: true
              huggingFaceToken: <huggingFaceToken>
            # local:
            #   autoDiscoveryDir: /mnt/models/hub
          kvBlockIndexConfig:
            inMemoryConfig:
              size: 100000000
              podCacheSize: 10
            enableMetrics: true
        kvEventsConfig:
          # Pods connect to Heimdall. Set discoverPods to false
          # and bind to all interfaces on port 5557.
          discoverPods: false
          zmqEndpoint: "tcp://0.0.0.0:5557"
          topicFilter: "kv@"
          concurrency: 16

  schedulingProfiles:
    - name: default
      plugins:
        - pluginRef: precise-prefix-cache-scorer
        - pluginRef: max-score-picker

# extraVolumes:
#   - name: models
#     persistentVolumeClaim:
#       claimName: models

# extraVolumeMounts:
#   - name: models
#     mountPath: /mnt/models
#     readOnly: true

info

In offline environments, use the local tokenizer config. If your models are downloaded via the hf CLI, set autoDiscoveryDir to /mnt/models/hub to match the standard Hugging Face cache structure.

Components

Tokenizer and prefix store

Since vLLM instances manage prefix caches using tokenized sequences, the scorer must tokenize incoming prompts. To minimize overhead, the scorer maintains a prefix store that caches previous tokenization results. If a new request shares a significant portion (e.g., >80%) with a cached prefix, the scorer reuses the tokens to estimate the hit rate without re-tokenizing the entire prompt.

Token processor

vLLM uses block hashes to look up prefixes. The scorer emulates this by hashing the tokenized prompt into blocks of size B. The blockSize and hashSeed must match the vLLM configuration of the worker pods.

KV block index

Pods publish events (BlockStored, BlockRemoved) via ZMQ when their cache updates. The scorer subscribes to these events to maintain a KV block index, mapping prefix hashes to the list of pods holding that cache. The pod with the most matching blocks for a request is assigned the highest score.

Deployment example

This section provides a complete example of deploying an inference service with prefix cache-aware routing using the Llama 3.2 1B Instruct model on AMD MI250 GPUs. For more details on using templates, see the Presets documentation.

1. Identify available presets

First, find the appropriate preset for your model and hardware:

kubectl get inferenceservicetemplate -n mif \
  -l mif.moreh.io/template.type=preset \
  -l mif.moreh.io/model.name=llama-3.2-1b-instruct

2. Configure InferenceService

The following InferenceService uses the quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2 preset. By setting ISVC_USE_KV_EVENTS: "true", the required vLLM arguments for ZMQ event publishing are automatically added.

info

In production environments, it is highly recommended to use offline hub templates (e.g., vllm-hf-hub-offline, vllm-dp-hf-hub-offline, or vllm-pp-hf-hub-offline) instead of HF_TOKEN to load pre-downloaded models from a Persistent Volume. This ensures reliability by avoiding dependencies on external network conditions during pod startup. These templates require a PVC named models in your namespace. Refer to the Hugging Face model management with persistent volume for more details.

llama-prefix-cache.yaml
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
  name: llama-1b-prefix-cache
spec:
  replicas: 4
  inferencePoolRefs:
    - name: heimdall
  templateRefs:
    - name: vllm
    - name: quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2
    # - name: vllm-hf-hub-offline
  template:
    spec:
      containers:
        - name: main
          env:
            - name: HF_TOKEN
              value: <huggingFaceToken>
            # Enable automatic generation of --kv-events-config
            - name: ISVC_USE_KV_EVENTS
              value: "true"
            # Ensure the hash seed matches Heimdall scorer config
            - name: PYTHONHASHSEED
              value: "12345"
            # Maintain preset defaults while specifying the block size
            - name: ISVC_EXTRA_ARGS
              value: >-
                --disable-uvicorn-access-log --no-enable-log-requests
                --max-model-len 16384 --max-num-batched-tokens 8192
                --block-size 32

info

On AMD MI250, each physical GPU is recognized as two logical devices. In this example, the preset for tp2 uses 1 physical GPU (requesting amd.com/gpu: 2).

Key features​

Scorer configuration​

Components​

Tokenizer and prefix store​

Token processor​

KV block index​

Deployment example​

1. Identify available presets​

2. Configure InferenceService​

Key features

Scorer configuration

Components

Tokenizer and prefix store

Token processor

KV block index

Deployment example

1. Identify available presets

2. Configure InferenceService