Skip to content

Latest commit

 

History

History
61 lines (35 loc) · 3.35 KB

README.md

File metadata and controls

61 lines (35 loc) · 3.35 KB

Examples

We provide a set of examples to help you serve large language models, by default, we use vLLM as the backend.

Table of Contents

Deploy models from Huggingface

Deploy models hosted in Huggingface, see example here.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

In theory, we support any size of model. However, the bandwidth is limited. For example, we want to load the llama2-7B model, which takes about 15GB memory size, if we have a 200Mbps bandwidth, it will take about 10mins to download the model, so the bandwidth plays a vital role here.

Deploy models from ModelScope

Deploy models hosted in ModelScope, see example here, similar to other backends.

Deploy models from ObjectStore

Deploy models stored in object stores, we support various providers, see the full list below.

In theory, if we want to load the Qwen2-7B model, which occupies about 14.2 GB memory size, and the intranet bandwidth is about 800Mbps, it will take about 2 ~ 3 minutes to download the model. However, the intranet bandwidth can be improved.

  • Alibaba Cloud OSS, see example here

    Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>

Deploy models via SGLang

By default, we use vLLM as the inference backend, however, if you want to use other backends like SGLang, see example here.

Deploy models via llama.cpp

llama.cpp can serve models on a wide variety of hardwares, such as CPU, see example here.

Deploy models via text-generation-inference

text-generation-inference is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see example here.

Deploy models via ollama

ollama based on llama.cpp, aims for local deploy. see example here.

Speculative Decoding with vLLM

Speculative Decoding can improve inference performance efficiently, see example here.

Multi-Host Inference

Model size is growing bigger and bigger, Llama 3.1 405B FP16 LLM requires more than 750 GB GPU for weights only, leaving kv cache unconsidered, even with 8 x H100 Nvidia GPUs, 80 GB size of HBM each, can not fit in a single host, requires a multi-host deployment, see example here.