Hugging Face model management with persistent volume
Efficient management of large language models (LLMs) is crucial for optimizing storage usage and reducing startup times. Instead of downloading models repeatedly for each pod, using a shared Persistent Volume (PV) allows multiple pods to access the same model files.
This workflow optimizes your LLM deployment pipeline by:
- Centralizing storage: Creating a ReadWriteMany (RWX) volume named
modelsto store multiple large models. - Decoupling management: Utilizing a dedicated model manager pod to download model parameters into the volume.
- Accelerating deployment: Enabling vLLM or other engines to load the parameters directly from the local mount when an InferenceService is created.
Create a ReadWriteMany (RWX) PVC​
Ensure your Kubernetes cluster supports a storage class that provides ReadWriteMany access mode (e.g., NFS, CephFS, or specific cloud provider storage).
Values enclosed in <> (e.g., <namespace>, <huggingFaceToken>) are placeholders. You must replace them with your actual values before applying the YAML files. Failure to do so will result in deployment errors.
First, create a PersistentVolumeClaim (PVC) with ReadWriteMany access mode. This allows multiple nodes to mount the same volume simultaneously, enabling concurrent access for distributed inference.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: models
namespace: <namespace>
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Ti
storageClassName: <storageClassForReadWriteMany>
Deploy a model manager pod​
Create a lightweight deployment to serve as a workspace for managing model files. This pod will be used to download models using the hf command-line tool.
apiVersion: apps/v1
kind: Deployment
metadata:
name: huggingface-cli
namespace: <namespace>
spec:
replicas: 1
selector:
matchLabels:
app: huggingface-cli
template:
metadata:
labels:
app: huggingface-cli
spec:
containers:
- name: main
image: ubuntu:24.04
command:
- /bin/bash
- -c
args:
- |
apt-get update \
&& apt-get install -y python3 python3-pip \
&& pip install huggingface_hub[cli] --break-system-packages
sleep infinity
env:
- name: HF_TOKEN
value: <huggingFaceToken>
- name: HF_HOME
value: /mnt/models
volumeMounts:
- name: models
mountPath: /mnt/models
volumes:
- name: models
persistentVolumeClaim:
claimName: models
Apply this manifest and wait for the pod to be ready.
Download and manage models​
Once the management pod is running, you can access it to manage your models. For more details about the hf command, refer to the Hugging Face CLI guides.
Access the pod:
kubectl exec -it deployment/huggingface-cli -n <namespace> -- bash
Download models:
Download models directly to the mounted volume. This example downloads the meta-llama/Llama-3.2-1B-Instruct model.
hf download meta-llama/Llama-3.2-1B-Instruct
List downloaded models:
hf cache ls
ID SIZE LAST_ACCESSED LAST_MODIFIED REFS
--------------------------------------- -------- ------------- ------------- ----
model/meta-llama/Llama-3.2-1B-Instruct 7.4G 1 minute ago 1 minute ago main
model/meta-llama/Llama-3.3-70B-Instruct 405.7G 1 days ago 1 days ago main
model/openai/gpt-oss-120b 195.8G 1 days ago 1 days ago main
model/openai/gpt-oss-20b 82.6G 1 days ago 1 days ago main
...
Use models in InferenceService​
To use the pre-downloaded models in an InferenceService, define an InferenceServiceTemplate that mounts the PVC and sets the environment to offline mode.
Create an offline template​
This template configures the worker pods to mount the shared storage (/mnt/models) and tells the Hugging Face library to operate in offline mode.
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceServiceTemplate
metadata:
name: template-hf-hub-offline
namespace: <namespace>
spec:
template:
spec:
containers:
- name: main
env:
- name: HF_HOME
value: /mnt/models
- name: HF_HUB_OFFLINE
value: '1'
volumeMounts:
- name: models
mountPath: /mnt/models
volumes:
- name: models
persistentVolumeClaim:
claimName: models
Create the InferenceService​
Reference the offline template in your InferenceService. Specify the model ID (e.g., meta-llama/Llama-3.2-1B-Instruct) in the args field to load the model from the offline cache.
The container name in the InferenceService (e.g., main) must match the container name defined in the InferenceServiceTemplate for the configurations to merge correctly.
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: llama3-1b-instruct
namespace: <namespace>
spec:
templateRefs:
- name: template-hf-hub-offline
template:
spec:
containers:
- name: main
image: 255250787067.dkr.ecr.ap-northeast-2.amazonaws.com/quickstart/moreh-vllm:20250915.1
command:
- vllm
- serve
args:
- meta-llama/Llama-3.2-1B-Instruct
- --port
- '8000'
resources:
requests:
amd.com/gpu: '1'
limits:
amd.com/gpu: '1'
ports:
- name: http
containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
tolerations:
- key: amd.com/gpu
operator: Exists
effect: NoSchedule
Create an InferenceService without Template​
Alternatively, you can configure the volume mount and environment variables directly within the InferenceService without using a template. However, as the number of InferenceService and models grows, managing individual configurations can become complex. In such cases, using templates is recommended to simplify management and reduce duplication.
apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
metadata:
name: llama3-1b-instruct-no-template
namespace: <namespace>
spec:
template:
spec:
containers:
- name: main
image: 255250787067.dkr.ecr.ap-northeast-2.amazonaws.com/quickstart/moreh-vllm:20250915.1
command:
- vllm
- serve
args:
- meta-llama/Llama-3.2-1B-Instruct
- --port
- '8000'
env:
- name: HF_HOME
value: /mnt/models
- name: HF_HUB_OFFLINE
value: '1'
resources:
requests:
amd.com/gpu: '1'
limits:
amd.com/gpu: '1'
ports:
- name: http
containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
volumeMounts:
- name: models
mountPath: /mnt/models
volumes:
- name: models
persistentVolumeClaim:
claimName: models
tolerations:
- key: amd.com/gpu
operator: Exists
effect: NoSchedule