📄️ Resource allocation
This document describes how to allocate resources (accelerators and NICs) to your inference containers, select specific nodes using node selectors and node affinity, and handle taints. When an InferenceService is created, it generates a Deployment or a LeaderWorkerSet, which ultimately results in the creation of Pods. Therefore, placing a Pod is synonymous with placing an inference container in this context.
📄️ Hugging Face model management with persistent volume
Efficient management of large language models (LLMs) is crucial for optimizing storage usage and reducing startup times. Instead of downloading models repeatedly for each pod, using a shared Persistent Volume (PV) allows multiple pods to access the same model files.