From 984b1f501297d3fa004edf7c3bbcdae670e63ce6 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Mon, 13 Jan 2025 12:27:36 +0000 Subject: [PATCH] [Doc] Organise installation documentation into categories and tabs (#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --- docs/source/conf.py | 2 +- docs/source/deployment/docker.md | 4 + docs/source/features/compatibility_matrix.md | 4 +- .../hpu-gaudi.inc.md} | 72 ++-- .../installation/ai_accelerator/index.md | 375 ++++++++++++++++++ .../neuron.inc.md} | 67 ++-- .../openvino.inc.md} | 78 ++-- .../{tpu.md => ai_accelerator/tpu.inc.md} | 38 +- .../getting_started/installation/cpu-arm.md | 46 --- .../{cpu-apple.md => cpu/apple.inc.md} | 28 +- .../installation/cpu/arm.inc.md | 30 ++ .../installation/cpu/build.inc.md | 21 + .../installation/{cpu-x86.md => cpu/index.md} | 190 ++++++--- .../installation/cpu/x86.inc.md | 35 ++ .../installation/device.template.md | 17 + .../{gpu-cuda.md => gpu/cuda.inc.md} | 96 ++--- .../getting_started/installation/gpu/index.md | 300 ++++++++++++++ .../{gpu-rocm.md => gpu/rocm.inc.md} | 147 +++---- .../installation/{xpu.md => gpu/xpu.inc.md} | 51 ++- .../getting_started/installation/index.md | 13 +- .../installation/python_env_setup.inc.md | 19 + 21 files changed, 1241 insertions(+), 392 deletions(-) rename docs/source/getting_started/installation/{hpu-gaudi.md => ai_accelerator/hpu-gaudi.inc.md} (97%) create mode 100644 docs/source/getting_started/installation/ai_accelerator/index.md rename docs/source/getting_started/installation/{neuron.md => ai_accelerator/neuron.inc.md} (86%) rename docs/source/getting_started/installation/{openvino.md => ai_accelerator/openvino.inc.md} (69%) rename docs/source/getting_started/installation/{tpu.md => ai_accelerator/tpu.inc.md} (88%) delete mode 100644 docs/source/getting_started/installation/cpu-arm.md rename docs/source/getting_started/installation/{cpu-apple.md => cpu/apple.inc.md} (70%) create mode 100644 docs/source/getting_started/installation/cpu/arm.inc.md create mode 100644 docs/source/getting_started/installation/cpu/build.inc.md rename docs/source/getting_started/installation/{cpu-x86.md => cpu/index.md} (67%) create mode 100644 docs/source/getting_started/installation/cpu/x86.inc.md create mode 100644 docs/source/getting_started/installation/device.template.md rename docs/source/getting_started/installation/{gpu-cuda.md => gpu/cuda.inc.md} (84%) create mode 100644 docs/source/getting_started/installation/gpu/index.md rename docs/source/getting_started/installation/{gpu-rocm.md => gpu/rocm.inc.md} (87%) rename docs/source/getting_started/installation/{xpu.md => gpu/xpu.inc.md} (80%) create mode 100644 docs/source/getting_started/installation/python_env_setup.inc.md diff --git a/docs/source/conf.py b/docs/source/conf.py index bff0141ffbce8..7aa52db092e36 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -56,7 +56,7 @@ # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This pattern also affects html_static_path and html_extra_path. -exclude_patterns: List[str] = ["**/*.template.md"] +exclude_patterns: List[str] = ["**/*.template.md", "**/*.inc.md"] # Exclude the prompt "$" when copying code copybutton_prompt_text = r"\$ " diff --git a/docs/source/deployment/docker.md b/docs/source/deployment/docker.md index c735bfd0e87a7..9e301483ef7f9 100644 --- a/docs/source/deployment/docker.md +++ b/docs/source/deployment/docker.md @@ -2,6 +2,8 @@ # Using Docker +(deployment-docker-pre-built-image)= + ## Use vLLM's Official Docker Image vLLM offers an official Docker image for deployment. @@ -23,6 +25,8 @@ container to access the host's shared memory. vLLM uses PyTorch, which uses shar memory to share data between processes under the hood, particularly for tensor parallel inference. ``` +(deployment-docker-build-image-from-source)= + ## Building vLLM's Docker Image from Source You can build and run vLLM from source via the provided . To build vLLM: diff --git a/docs/source/features/compatibility_matrix.md b/docs/source/features/compatibility_matrix.md index ea1d545ff3d73..86a82eb36df33 100644 --- a/docs/source/features/compatibility_matrix.md +++ b/docs/source/features/compatibility_matrix.md @@ -322,7 +322,9 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar ``` -### Feature x Hardware +(feature-x-hardware)= + +## Feature x Hardware ```{list-table} :header-rows: 1 diff --git a/docs/source/getting_started/installation/hpu-gaudi.md b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md similarity index 97% rename from docs/source/getting_started/installation/hpu-gaudi.md rename to docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md index a829b1c9ff996..b4695d504b601 100644 --- a/docs/source/getting_started/installation/hpu-gaudi.md +++ b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md @@ -1,10 +1,13 @@ -(installation-gaudi)= +# Installation -# Installation for Intel® Gaudi® +This tab provides instructions on running vLLM with Intel Gaudi devices. -This README provides instructions on running vLLM with Intel Gaudi devices. +## Requirements -## Requirements and Installation +- OS: Ubuntu 22.04 LTS +- Python: 3.10 +- Intel Gaudi accelerator +- Intel Gaudi software version 1.18.0 Please follow the instructions provided in the [Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html) @@ -12,27 +15,9 @@ to set up the execution environment. To achieve the best performance, please follow the methods outlined in the [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html). -### Requirements - -- OS: Ubuntu 22.04 LTS -- Python: 3.10 -- Intel Gaudi accelerator -- Intel Gaudi software version 1.18.0 - -### Quick start using Dockerfile - -```console -docker build -f Dockerfile.hpu -t vllm-hpu-env . -docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env -``` - -```{tip} -If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. -``` +## Configure a new environment -### Build from source - -#### Environment verification +### Environment verification To verify that the Intel Gaudi software was correctly installed, run: @@ -47,7 +32,7 @@ Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade) for more details. -#### Run Docker Image +### Run Docker Image It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the [Intel Gaudi @@ -61,7 +46,13 @@ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-i docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest ``` -#### Build and Install vLLM +## Set up using Python + +### Pre-built wheels + +Currently, there are no pre-built Intel Gaudi wheels. + +### Build wheel from source To build and install vLLM from source, run: @@ -80,7 +71,26 @@ git checkout habana_main python setup.py develop ``` -## Supported Features +## Set up using Docker + +### Pre-built images + +Currently, there are no pre-built Intel Gaudi images. + +### Build image from source + +```console +docker build -f Dockerfile.hpu -t vllm-hpu-env . +docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env +``` + +```{tip} +If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. +``` + +## Extra information + +## Supported features - [Offline inference](#offline-inference) - Online serving via [OpenAI-Compatible Server](#openai-compatible-server) @@ -94,14 +104,14 @@ python setup.py develop for accelerating low-batch latency and throughput - Attention with Linear Biases (ALiBi) -## Unsupported Features +## Unsupported features - Beam search - LoRA adapters - Quantization - Prefill chunking (mixed-batch inferencing) -## Supported Configurations +## Supported configurations The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work. @@ -137,7 +147,7 @@ Gaudi2 devices. Configurations that are not listed may or may not work. - [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling -## Performance Tuning +## Performance tuning ### Execution modes @@ -368,7 +378,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM - `PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used, if `1` PyTorch Lazy backend for Gaudi will be used, `1` is default - `PT_HPU_ENABLE_LAZY_COLLECTIVES`: required to be `true` for tensor parallel inference with HPU Graphs -## Troubleshooting: Tweaking HPU Graphs +## Troubleshooting: tweaking HPU graphs If you experience device out-of-memory issues or want to attempt inference at higher batch sizes, try tweaking HPU Graphs by following diff --git a/docs/source/getting_started/installation/ai_accelerator/index.md b/docs/source/getting_started/installation/ai_accelerator/index.md new file mode 100644 index 0000000000000..a6c4c44305a4c --- /dev/null +++ b/docs/source/getting_started/installation/ai_accelerator/index.md @@ -0,0 +1,375 @@ +# Other AI accelerators + +vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions: + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::: + +## Requirements + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "## Requirements" +:end-before: "## Configure a new environment" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "## Requirements" +:end-before: "## Configure a new environment" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "## Requirements" +:end-before: "## Configure a new environment" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" +``` + +::: + +:::: + +## Configure a new environment + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "## Configure a new environment" +:end-before: "## Set up using Python" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "## Configure a new environment" +:end-before: "## Set up using Python" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "## Configure a new environment" +:end-before: "## Set up using Python" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} ../python_env_setup.inc.md +``` + +::: + +:::: + +## Set up using Python + +### Pre-built wheels + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::: + +### Build wheel from source + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::: + +## Set up using Docker + +### Pre-built images + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::: + +### Build image from source + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "### Build image from source" +:end-before: "## Extra information" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "### Build image from source" +:end-before: "## Extra information" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "### Build image from source" +:end-before: "## Extra information" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "### Build image from source" +:end-before: "## Extra information" +``` + +::: + +:::: + +## Extra information + +::::{tab-set} +:sync-group: device + +:::{tab-item} TPU +:sync: tpu + +```{include} tpu.inc.md +:start-after: "## Extra information" +``` + +::: + +:::{tab-item} Intel Gaudi +:sync: hpu-gaudi + +```{include} hpu-gaudi.inc.md +:start-after: "## Extra information" +``` + +::: + +:::{tab-item} Neuron +:sync: neuron + +```{include} neuron.inc.md +:start-after: "## Extra information" +``` + +::: + +:::{tab-item} OpenVINO +:sync: openvino + +```{include} openvino.inc.md +:start-after: "## Extra information" +``` + +::: + +:::: diff --git a/docs/source/getting_started/installation/neuron.md b/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md similarity index 86% rename from docs/source/getting_started/installation/neuron.md rename to docs/source/getting_started/installation/ai_accelerator/neuron.inc.md index 5581b1940ca46..575a9f9c2e2f0 100644 --- a/docs/source/getting_started/installation/neuron.md +++ b/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md @@ -1,6 +1,4 @@ -(installation-neuron)= - -# Installation for Neuron +# Installation vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching. Paged Attention and Chunked Prefill are currently in development and will be available soon. @@ -14,28 +12,9 @@ Data types currently supported in Neuron SDK are FP16 and BF16. - Pytorch 2.0.1/2.1.1 - AWS Neuron SDK 2.16/2.17 (Verified on python 3.8) -Installation steps: - -- [Build from source](#build-from-source-neuron) - - - [Step 0. Launch Trn1/Inf2 instances](#launch-instances) - - [Step 1. Install drivers and tools](#install-drivers) - - [Step 2. Install transformers-neuronx and its dependencies](#install-tnx) - - [Step 3. Install vLLM from source](#install-vllm) - -(build-from-source-neuron)= - -```{note} -The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel. -``` - -## Build from source - -Following instructions are applicable to Neuron SDK 2.16 and beyond. - -(launch-instances)= +## Configure a new environment -### Step 0. Launch Trn1/Inf2 instances +### Launch Trn1/Inf2 instances Here are the steps to launch trn1/inf2 instances, in order to install [PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html). @@ -45,9 +24,7 @@ Here are the steps to launch trn1/inf2 instances, in order to install [PyTorch N - When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB. - After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance -(install-drivers)= - -### Step 1. Install drivers and tools +### Install drivers and tools The installation of drivers and tools wouldn't be necessary, if [Deep Learning AMI Neuron](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) is installed. In case the drivers and tools are not installed on the operating system, follow the steps below: @@ -82,9 +59,21 @@ sudo apt-get install aws-neuronx-tools=2.* -y export PATH=/opt/aws/neuron/bin:$PATH ``` -(install-tnx)= +## Set up using Python + +### Pre-built wheels -### Step 2. Install transformers-neuronx and its dependencies +Currently, there are no pre-built Neuron wheels. + +### Build wheel from source + +```{note} +The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel. +``` + +Following instructions are applicable to Neuron SDK 2.16 and beyond. + +#### Install transformers-neuronx and its dependencies [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) will be the backend to support inference on trn1/inf2 instances. Follow the steps below to install transformer-neuronx package and its dependencies. @@ -116,9 +105,7 @@ python -m pip install awscli python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx ``` -(install-vllm)= - -### Step 3. Install vLLM from source +#### Install vLLM from source Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows: @@ -130,3 +117,19 @@ VLLM_TARGET_DEVICE="neuron" pip install . ``` If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed. + +## Set up using Docker + +### Pre-built images + +Currently, there are no pre-built Neuron images. + +### Build image from source + +See for instructions on building the Docker image. + +Make sure to use in place of the default Dockerfile. + +## Extra information + +There is no extra information for this device. diff --git a/docs/source/getting_started/installation/openvino.md b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md similarity index 69% rename from docs/source/getting_started/installation/openvino.md rename to docs/source/getting_started/installation/ai_accelerator/openvino.inc.md index d97d4173bf36b..a7867472583d6 100644 --- a/docs/source/getting_started/installation/openvino.md +++ b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md @@ -1,63 +1,65 @@ -(installation-openvino)= +# Installation -# Installation for OpenVINO +vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). -vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features: +## Requirements -- Prefix caching (`--enable-prefix-caching`) -- Chunked prefill (`--enable-chunked-prefill`) +- OS: Linux +- Instruction set architecture (ISA) requirement: at least AVX2. -**Table of contents**: +## Set up using Python -- [Requirements](#openvino-backend-requirements) -- [Quick start using Dockerfile](#openvino-backend-quick-start-dockerfile) -- [Build from source](#install-openvino-backend-from-source) -- [Performance tips](#openvino-backend-performance-tips) -- [Limitations](#openvino-backend-limitations) +### Pre-built wheels -(openvino-backend-requirements)= +Currently, there are no pre-built OpenVINO wheels. -## Requirements +### Build wheel from source -- OS: Linux -- Instruction set architecture (ISA) requirement: at least AVX2. +First, install Python. For example, on Ubuntu 22.04, you can run: -(openvino-backend-quick-start-dockerfile)= +```console +sudo apt-get update -y +sudo apt-get install python3 +``` -## Quick start using Dockerfile +Second, install prerequisites vLLM OpenVINO backend installation: ```console -docker build -f Dockerfile.openvino -t vllm-openvino-env . -docker run -it --rm vllm-openvino-env +pip install --upgrade pip +pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu ``` -(install-openvino-backend-from-source)= +Finally, install vLLM with OpenVINO backend: -## Install from source +```console +PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v . +``` -- First, install Python. For example, on Ubuntu 22.04, you can run: +:::{tip} +To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html). +::: - ```console - sudo apt-get update -y - sudo apt-get install python3 - ``` +## Set up using Docker -- Second, install prerequisites vLLM OpenVINO backend installation: +### Pre-built images - ```console - pip install --upgrade pip - pip install -r requirements-build.txt --extra-index-url https://download.pytorch.org/whl/cpu - ``` +Currently, there are no pre-built OpenVINO images. -- Finally, install vLLM with OpenVINO backend: +### Build image from source - ```console - PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE=openvino python -m pip install -v . - ``` +```console +docker build -f Dockerfile.openvino -t vllm-openvino-env . +docker run -it --rm vllm-openvino-env +``` -- [Optional] To use vLLM OpenVINO backend with a GPU device, ensure your system is properly set up. Follow the instructions provided here: [https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html). +## Extra information -(openvino-backend-performance-tips)= +## Supported features + +OpenVINO vLLM backend supports the following advanced vLLM features: + +- Prefix caching (`--enable-prefix-caching`) +- Chunked prefill (`--enable-chunked-prefill`) ## Performance tips @@ -95,8 +97,6 @@ $ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \ python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json ``` -(openvino-backend-limitations)= - ## Limitations - LoRA serving is not supported. diff --git a/docs/source/getting_started/installation/tpu.md b/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md similarity index 88% rename from docs/source/getting_started/installation/tpu.md rename to docs/source/getting_started/installation/ai_accelerator/tpu.inc.md index 1938785ade46a..6a911cc6b9eba 100644 --- a/docs/source/getting_started/installation/tpu.md +++ b/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md @@ -1,6 +1,4 @@ -(installation-tpu)= - -# Installation for TPUs +# Installation Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs @@ -54,7 +52,16 @@ In all of the following commands, replace the ALL CAPS parameter names with appropriate values. See the parameter descriptions table for more information. ``` -## Provision a Cloud TPU with the queued resource API +### Provision Cloud TPUs with GKE + +For more information about using TPUs with GKE, see: +- +- +- + +## Configure a new environment + +### Provision a Cloud TPU with the queued resource API Create a TPU v5e with 4 TPU chips: @@ -102,6 +109,14 @@ Connect to your TPU using SSH: gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE ``` +## Set up using Python + +### Pre-built wheels + +Currently, there are no pre-built TPU wheels. + +### Build wheel from source + Install Miniconda: ```bash @@ -142,16 +157,13 @@ Run the setup script: VLLM_TARGET_DEVICE="tpu" python setup.py develop ``` -## Provision Cloud TPUs with GKE +## Set up using Docker -For more information about using TPUs with GKE, see - - - +### Pre-built images -(build-docker-tpu)= +See for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`. -## Build a docker image with {code}`Dockerfile.tpu` +### Build image from source You can use to build a Docker image with TPU support. @@ -189,3 +201,7 @@ Install OpenBLAS with the following command: $ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev ``` ```` + +## Extra information + +There is no extra information for this device. diff --git a/docs/source/getting_started/installation/cpu-arm.md b/docs/source/getting_started/installation/cpu-arm.md deleted file mode 100644 index e199073ed721f..0000000000000 --- a/docs/source/getting_started/installation/cpu-arm.md +++ /dev/null @@ -1,46 +0,0 @@ -(installation-arm)= - -# Installation for ARM CPUs - -vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM (which also apply to Apple Silicon, see [Installation for macOS](#installation-apple) for more). For additional details on supported features, refer to the [x86 CPU documentation](#installation-x86) covering: - -- CPU backend inference capabilities -- Relevant runtime environment variables -- Performance optimization tips - -ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes. -Contents: - -1. [Requirements](#arm-backend-requirements) -2. [Quick Start with Dockerfile](#arm-backend-quick-start-dockerfile) -3. [Building from Source](#build-arm-backend-from-source) - -(arm-backend-requirements)= - -## Requirements - -- **Operating System**: Linux or macOS -- **Compilers**: `gcc/g++ >= 12.3.0` (optional, but recommended) or `Apple Clang >= 15.0.0` for macOS -- **Instruction Set Architecture (ISA)**: NEON support is required - -(arm-backend-quick-start-dockerfile)= - -## Quick Start with Dockerfile - -You can quickly set up vLLM on ARM using Docker: - -```console -$ docker build -f Dockerfile.arm -t vllm-cpu-env --shm-size=4g . -$ docker run -it \ - --rm \ - --network=host \ - --cpuset-cpus= \ - --cpuset-mems= \ - vllm-cpu-env -``` - -(build-arm-backend-from-source)= - -## Building from Source - -To build vLLM from source on Ubuntu 22.04 or other Linux distributions, follow a similar process as with x86. Testing has been conducted on AWS Graviton3 instances for compatibility. diff --git a/docs/source/getting_started/installation/cpu-apple.md b/docs/source/getting_started/installation/cpu/apple.inc.md similarity index 70% rename from docs/source/getting_started/installation/cpu-apple.md rename to docs/source/getting_started/installation/cpu/apple.inc.md index 1068893f5bafa..56545253b1ef7 100644 --- a/docs/source/getting_started/installation/cpu-apple.md +++ b/docs/source/getting_started/installation/cpu/apple.inc.md @@ -1,20 +1,20 @@ -(installation-apple)= +# Installation -# Installation for macOS - -vLLM has experimental support for macOS with Apple Silicon. For now, users shall build from the source vLLM to natively run on macOS. For more details, like running on vLLM in a docker container, see [ARM CPU Documentation](installation-arm) +vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS. Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. ## Requirements -- **Operating System**: `macOS Sonoma` or later -- **SDK** `XCode 15.4` or later with Command Line Tools -- **Compilers**: `Apple Clang >= 15.0.0` +- OS: `macOS Sonoma` or later +- SDK: `XCode 15.4` or later with Command Line Tools +- Compiler: `Apple Clang >= 15.0.0` + +## Set up using Python - +### Pre-built wheels -## Build and installation +### Build wheel from source After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source. @@ -29,7 +29,7 @@ pip install -e . On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device. ``` -## Troubleshooting +#### Troubleshooting If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your [Command Line Tools for Xcode](https://developer.apple.com/download/all/). @@ -46,3 +46,11 @@ If the build has error like the following snippet where standard C++ headers can | ^~~~~~~~~ 1 error generated. ``` + +## Set up using Docker + +### Pre-built images + +### Build image from source + +## Extra information diff --git a/docs/source/getting_started/installation/cpu/arm.inc.md b/docs/source/getting_started/installation/cpu/arm.inc.md new file mode 100644 index 0000000000000..08a764e1a25f4 --- /dev/null +++ b/docs/source/getting_started/installation/cpu/arm.inc.md @@ -0,0 +1,30 @@ +# Installation + +vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. + +ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes. + +## Requirements + +- OS: Linux +- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) +- Instruction Set Architecture (ISA): NEON support is required + +## Set up using Python + +### Pre-built wheels + +### Build wheel from source + +:::{include} build.inc.md +::: + +Testing has been conducted on AWS Graviton3 instances for compatibility. + +## Set up using Docker + +### Pre-built images + +### Build image from source + +## Extra information diff --git a/docs/source/getting_started/installation/cpu/build.inc.md b/docs/source/getting_started/installation/cpu/build.inc.md new file mode 100644 index 0000000000000..f8d1044a0d198 --- /dev/null +++ b/docs/source/getting_started/installation/cpu/build.inc.md @@ -0,0 +1,21 @@ +First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: + +```console +sudo apt-get update -y +sudo apt-get install -y gcc-12 g++-12 libnuma-dev +sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 +``` + +Second, install Python packages for vLLM CPU backend building: + +```console +pip install --upgrade pip +pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy +pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu +``` + +Finally, build and install vLLM CPU backend: + +```console +VLLM_TARGET_DEVICE=cpu python setup.py install +``` diff --git a/docs/source/getting_started/installation/cpu-x86.md b/docs/source/getting_started/installation/cpu/index.md similarity index 67% rename from docs/source/getting_started/installation/cpu-x86.md rename to docs/source/getting_started/installation/cpu/index.md index c49c8e0f2a18c..4ec907c0e9fda 100644 --- a/docs/source/getting_started/installation/cpu-x86.md +++ b/docs/source/getting_started/installation/cpu/index.md @@ -1,91 +1,165 @@ -(installation-x86)= +# CPU -# Installation for x86 CPUs +vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions: -vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features: +::::{tab-set} +:sync-group: device -- Tensor Parallel -- Model Quantization (`INT8 W8A8, AWQ, GPTQ`) -- Chunked-prefill -- Prefix-caching -- FP8-E5M2 KV-Caching (TODO) +:::{tab-item} x86 +:sync: x86 + +```{include} x86.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` -Table of contents: +::: + +:::{tab-item} ARM +:sync: arm + +```{include} arm.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::{tab-item} Apple silicon +:sync: apple + +```{include} apple.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` -1. [Requirements](#cpu-backend-requirements) -2. [Quick start using Dockerfile](#cpu-backend-quick-start-dockerfile) -3. [Build from source](#build-cpu-backend-from-source) -4. [Related runtime environment variables](#env-intro) -5. [Intel Extension for PyTorch](#ipex-guidance) -6. [Performance tips](#cpu-backend-performance-tips) +::: -(cpu-backend-requirements)= +:::: ## Requirements -- OS: Linux -- Compiler: `gcc/g++>=12.3.0` (optional, recommended) -- Instruction set architecture (ISA) requirement: AVX512 (optional, recommended) +- Python: 3.9 -- 3.12 -(cpu-backend-quick-start-dockerfile)= +::::{tab-set} +:sync-group: device -## Quick start using Dockerfile +:::{tab-item} x86 +:sync: x86 -```console -docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g . -docker run -it \ - --rm \ - --network=host \ - --cpuset-cpus= \ - --cpuset-mems= \ - vllm-cpu-env +```{include} x86.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" ``` -(build-cpu-backend-from-source)= +::: + +:::{tab-item} ARM +:sync: arm -## Build from source +```{include} arm.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" +``` -- First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: +::: -```console -sudo apt-get update -y -sudo apt-get install -y gcc-12 g++-12 libnuma-dev -sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 +:::{tab-item} Apple silicon +:sync: apple + +```{include} apple.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" ``` -- Second, install Python packages for vLLM CPU backend building: +::: -```console -pip install --upgrade pip -pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy -pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu +:::: + +## Set up using Python + +### Create a new Python environment + +```{include} ../python_env_setup.inc.md ``` -- Finally, build and install vLLM CPU backend: +### Pre-built wheels -```console -VLLM_TARGET_DEVICE=cpu python setup.py install +Currently, there are no pre-built CPU wheels. + +### Build wheel from source + +::::{tab-set} +:sync-group: device + +:::{tab-item} x86 +:sync: x86 + +```{include} x86.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" ``` -```{note} -- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16. -- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building. +::: + +:::{tab-item} ARM +:sync: arm + +```{include} arm.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" ``` -(env-intro)= +::: -## Related runtime environment variables +:::{tab-item} Apple silicon +:sync: apple -- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. -- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. +```{include} apple.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::: + +## Set up using Docker + +### Pre-built images + +Currently, there are no pre-build CPU images. + +### Build image from source -(ipex-guidance)= +```console +$ docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g . +$ docker run -it \ + --rm \ + --network=host \ + --cpuset-cpus= \ + --cpuset-mems= \ + vllm-cpu-env +``` -## Intel Extension for PyTorch +:::{tip} +For ARM or Apple silicon, use `Dockerfile.arm` +::: -- [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. +## Supported features -(cpu-backend-performance-tips)= +vLLM CPU backend supports the following vLLM features: + +- Tensor Parallel +- Model Quantization (`INT8 W8A8, AWQ, GPTQ`) +- Chunked-prefill +- Prefix-caching +- FP8-E5M2 KV-Caching (TODO) + +## Related runtime environment variables + +- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. +- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. ## Performance tips @@ -137,13 +211,13 @@ $ python examples/offline_inference/basic.py - If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access. -## CPU Backend Considerations +## Other considerations - The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance. - Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance. -- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel. +- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.inc.md#non-uniform-memory-access-numa). For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel. - Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving: @@ -151,4 +225,4 @@ $ python examples/offline_inference/basic.py VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp ``` - - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md). + - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.inc.md). diff --git a/docs/source/getting_started/installation/cpu/x86.inc.md b/docs/source/getting_started/installation/cpu/x86.inc.md new file mode 100644 index 0000000000000..e4f99d3cebdf2 --- /dev/null +++ b/docs/source/getting_started/installation/cpu/x86.inc.md @@ -0,0 +1,35 @@ +# Installation + +vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. + +## Requirements + +- OS: Linux +- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) +- Instruction Set Architecture (ISA): AVX512 (optional, recommended) + +## Set up using Python + +### Pre-built wheels + +### Build wheel from source + +:::{include} build.inc.md +::: + +```{note} +- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16. +- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building. +``` + +## Set up using Docker + +### Pre-built images + +### Build image from source + +## Extra information + +## Intel Extension for PyTorch + +- [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware. diff --git a/docs/source/getting_started/installation/device.template.md b/docs/source/getting_started/installation/device.template.md new file mode 100644 index 0000000000000..44f538da93659 --- /dev/null +++ b/docs/source/getting_started/installation/device.template.md @@ -0,0 +1,17 @@ +# Installation + +## Requirements + +## Set up using Python + +### Pre-built wheels + +### Build wheel from source + +## Set up using Docker + +### Pre-built images + +### Build image from source + +## Extra information diff --git a/docs/source/getting_started/installation/gpu-cuda.md b/docs/source/getting_started/installation/gpu/cuda.inc.md similarity index 84% rename from docs/source/getting_started/installation/gpu-cuda.md rename to docs/source/getting_started/installation/gpu/cuda.inc.md index 727486abbd10f..4cce65278c069 100644 --- a/docs/source/getting_started/installation/gpu-cuda.md +++ b/docs/source/getting_started/installation/gpu/cuda.inc.md @@ -1,44 +1,24 @@ -(installation-cuda)= +# Installation -# Installation for CUDA - -vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries. +vLLM contains pre-compiled C++ and CUDA (12.1) binaries. ## Requirements -- OS: Linux -- Python: 3.9 -- 3.12 - GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.) -## Install released versions +## Set up using Python ### Create a new Python environment -You can create a new Python environment using `conda`: - -```console -# (Recommended) Create a new conda environment. -conda create -n myenv python=3.12 -y -conda activate myenv -``` - ```{note} -[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages. In particular, the PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details. -``` - -Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command: - -```console -# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment. -uv venv myenv --python 3.12 --seed -source myenv/bin/activate +PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details. ``` In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations. Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-from-source) for more details. -### Install vLLM +### Pre-built wheels You can install vLLM using either `pip` or `uv pip`: @@ -59,11 +39,11 @@ pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSI (install-the-latest-code)= -## Install the latest code +#### Install the latest code LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since `v0.5.3`. -### Install the latest code using `pip` +##### Install the latest code using `pip` ```console pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly @@ -80,7 +60,7 @@ pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manyl Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before. -### Install the latest code using `uv` +##### Install the latest code using `uv` Another way to install the latest code is to use `uv`: @@ -97,26 +77,9 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. -### Install the latest code using `docker` - -Another way to access the latest code is to use the docker images: - -```console -export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch -docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT} -``` - -These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days. - -The latest code can contain bugs and may not be stable. Please use it with caution. - -(build-from-source)= +### Build wheel from source -## Build from source - -(python-only-build)= - -### Python-only build (without compilation) +#### Set up using Python-only build (without compilation) If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM: @@ -135,14 +98,14 @@ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/4 pip install --editable . ``` -You can find more information about vLLM's wheels [above](#install-the-latest-code). +You can find more information about vLLM's wheels in . ```{note} There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors. -It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [the section above](#install-the-latest-code) for instructions on how to install a specified wheel. +It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to for instructions on how to install a specified wheel. ``` -### Full build (with compilation) +#### Full build (with compilation) If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes: @@ -162,7 +125,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`. ``` -#### Use an existing PyTorch installation +##### Use an existing PyTorch installation There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.: @@ -179,7 +142,7 @@ pip install -r requirements-build.txt pip install -e . --no-build-isolation ``` -#### Use the local cutlass for compilation +##### Use the local cutlass for compilation Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead. To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory. @@ -190,7 +153,7 @@ cd vllm VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e . ``` -#### Troubleshooting +##### Troubleshooting To avoid your system being overloaded, you can limit the number of compilation jobs to be run simultaneously, via the environment variable `MAX_JOBS`. For example: @@ -224,7 +187,7 @@ nvcc --version # verify that nvcc is in your PATH ${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME ``` -### Unsupported OS build +#### Unsupported OS build vLLM can fully run only on Linux but for development purposes, you can still build it on other systems (for example, macOS), allowing for imports and a more convenient development environment. The binaries will not be compiled and won't work on non-Linux systems. @@ -234,3 +197,28 @@ Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing: export VLLM_TARGET_DEVICE=empty pip install -e . ``` + +## Set up using Docker + +### Pre-built images + +See for instructions on using the official Docker image. + +Another way to access the latest code is to use the docker images: + +```console +export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch +docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT} +``` + +These docker images are used for CI and testing only, and they are not intended for production use. They will be expired after several days. + +The latest code can contain bugs and may not be stable. Please use it with caution. + +### Build image from source + +See for instructions on building the Docker image. + +## Supported features + +See compatibility matrix for feature support information. diff --git a/docs/source/getting_started/installation/gpu/index.md b/docs/source/getting_started/installation/gpu/index.md new file mode 100644 index 0000000000000..6c007382b2c3d --- /dev/null +++ b/docs/source/getting_started/installation/gpu/index.md @@ -0,0 +1,300 @@ +# GPU + +vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions: + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "# Installation" +:end-before: "## Requirements" +``` + +::: + +:::: + +## Requirements + +- OS: Linux +- Python: 3.9 -- 3.12 + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "## Requirements" +:end-before: "## Set up using Python" +``` + +::: + +:::: + +## Set up using Python + +### Create a new Python environment + +```{include} ../python_env_setup.inc.md +``` + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "## Create a new Python environment" +:end-before: "### Pre-built wheels" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +There is no extra information on creating a new Python environment for this device. + +::: + +:::{tab-item} XPU +:sync: xpu + +There is no extra information on creating a new Python environment for this device. + +::: + +:::: + +### Pre-built wheels + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "### Pre-built wheels" +:end-before: "### Build wheel from source" +``` + +::: + +:::: + +(build-from-source)= + +### Build wheel from source + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "### Build wheel from source" +:end-before: "## Set up using Docker" +``` + +::: + +:::: + +## Set up using Docker + +### Pre-built images + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "### Pre-built images" +:end-before: "### Build image from source" +``` + +::: + +:::: + +### Build image from source + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "### Build image from source" +:end-before: "## Supported features" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "### Build image from source" +:end-before: "## Supported features" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "### Build image from source" +:end-before: "## Supported features" +``` + +::: + +:::: + +## Supported features + +::::{tab-set} +:sync-group: device + +:::{tab-item} CUDA +:sync: cuda + +```{include} cuda.inc.md +:start-after: "## Supported features" +``` + +::: + +:::{tab-item} ROCm +:sync: rocm + +```{include} rocm.inc.md +:start-after: "## Supported features" +``` + +::: + +:::{tab-item} XPU +:sync: xpu + +```{include} xpu.inc.md +:start-after: "## Supported features" +``` + +::: + +:::: diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu/rocm.inc.md similarity index 87% rename from docs/source/getting_started/installation/gpu-rocm.md rename to docs/source/getting_started/installation/gpu/rocm.inc.md index a8971bb96248c..f6f9d3c303f89 100644 --- a/docs/source/getting_started/installation/gpu-rocm.md +++ b/docs/source/getting_started/installation/gpu/rocm.inc.md @@ -1,82 +1,19 @@ -(installation-rocm)= - -# Installation for ROCm +# Installation vLLM supports AMD GPUs with ROCm 6.2. ## Requirements -- OS: Linux -- Python: 3.9 -- 3.12 - GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100) - ROCm 6.2 -Installation options: - -1. [Build from source with docker](#build-from-source-docker-rocm) -2. [Build from source](#build-from-source-rocm) - -(build-from-source-docker-rocm)= - -## Option 1: Build from source with docker (recommended) - -You can build and install vLLM from source. - -First, build a docker image from and launch a docker container from the image. -It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: - -```console -{ - "features": { - "buildkit": true - } -} -``` - - uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches. -It provides flexibility to customize the build of docker image using the following arguments: - -- `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image. -- `BUILD_FA`: specifies whether to build CK flash-attention. The default is 1. For [Radeon RX 7900 series (gfx1100)](https://rocm.docs.amd.com/projects/radeon/en/latest/index.html), this should be set to 0 before flash-attention supports this target. -- `FX_GFX_ARCHS`: specifies the GFX architecture that is used to build CK flash-attention, for example, `gfx90a;gfx942` for MI200 and MI300. The default is `gfx90a;gfx942` -- `FA_BRANCH`: specifies the branch used to build the CK flash-attention in [ROCm's flash-attention repo](https://github.com/ROCmSoftwarePlatform/flash-attention). The default is `ae7928c` -- `BUILD_TRITON`: specifies whether to build triton flash-attention. The default value is 1. - -Their values can be passed in when running `docker build` with `--build-arg` options. - -To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default: - -```console -DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm . -``` +## Set up using Python -To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should specify `BUILD_FA` as below: +### Pre-built wheels -```console -DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm . -``` +Currently, there are no pre-built ROCm wheels. -To run the above docker image `vllm-rocm`, use the below command: - -```console -$ docker run -it \ - --network=host \ - --group-add=video \ - --ipc=host \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --device /dev/kfd \ - --device /dev/dri \ - -v :/app/model \ - vllm-rocm \ - bash -``` - -Where the `` is the location where the model is stored, for example, the weights for llama2 or llama3 models. - -(build-from-source-rocm)= - -## Option 2: Build from source +### Build wheel from source 0. Install prerequisites (skip if you are already in an environment/docker with the following installed): @@ -157,7 +94,73 @@ Where the `` is the location where the model is stored, for examp - The ROCm version of PyTorch, ideally, should match the ROCm driver version. ``` - ```{tip} - - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. - For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization). - ``` +```{tip} +- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. + For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization). +``` + +## Set up using Docker + +### Pre-built images + +Currently, there are no pre-built ROCm images. + +### Build image from source + +Building the Docker image from source is the recommended way to use vLLM with ROCm. + +First, build a docker image from and launch a docker container from the image. +It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: + +```console +{ + "features": { + "buildkit": true + } +} +``` + + uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches. +It provides flexibility to customize the build of docker image using the following arguments: + +- `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image. +- `BUILD_FA`: specifies whether to build CK flash-attention. The default is 1. For [Radeon RX 7900 series (gfx1100)](https://rocm.docs.amd.com/projects/radeon/en/latest/index.html), this should be set to 0 before flash-attention supports this target. +- `FX_GFX_ARCHS`: specifies the GFX architecture that is used to build CK flash-attention, for example, `gfx90a;gfx942` for MI200 and MI300. The default is `gfx90a;gfx942` +- `FA_BRANCH`: specifies the branch used to build the CK flash-attention in [ROCm's flash-attention repo](https://github.com/ROCmSoftwarePlatform/flash-attention). The default is `ae7928c` +- `BUILD_TRITON`: specifies whether to build triton flash-attention. The default value is 1. + +Their values can be passed in when running `docker build` with `--build-arg` options. + +To build vllm on ROCm 6.2 for MI200 and MI300 series, you can use the default: + +```console +DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t vllm-rocm . +``` + +To build vllm on ROCm 6.2 for Radeon RX7900 series (gfx1100), you should specify `BUILD_FA` as below: + +```console +DOCKER_BUILDKIT=1 docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm . +``` + +To run the above docker image `vllm-rocm`, use the below command: + +```console +docker run -it \ + --network=host \ + --group-add=video \ + --ipc=host \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --device /dev/kfd \ + --device /dev/dri \ + -v :/app/model \ + vllm-rocm \ + bash +``` + +Where the `` is the location where the model is stored, for example, the weights for llama2 or llama3 models. + +## Supported features + +See compatibility matrix for feature support information. diff --git a/docs/source/getting_started/installation/xpu.md b/docs/source/getting_started/installation/gpu/xpu.inc.md similarity index 80% rename from docs/source/getting_started/installation/xpu.md rename to docs/source/getting_started/installation/gpu/xpu.inc.md index 73758f37cf0f6..577986eba74fd 100644 --- a/docs/source/getting_started/installation/xpu.md +++ b/docs/source/getting_started/installation/gpu/xpu.inc.md @@ -1,40 +1,19 @@ -(installation-xpu)= - -# Installation for XPUs +# Installation vLLM initially supports basic model inferencing and serving on Intel GPU platform. -Table of contents: - -1. [Requirements](#xpu-backend-requirements) -2. [Quick start using Dockerfile](#xpu-backend-quick-start-dockerfile) -3. [Build from source](#build-xpu-backend-from-source) - -(xpu-backend-requirements)= - ## Requirements -- OS: Linux - Supported Hardware: Intel Data Center GPU, Intel ARC GPU - OneAPI requirements: oneAPI 2024.2 -(xpu-backend-quick-start-dockerfile)= +## Set up using Python -## Quick start using Dockerfile - -```console -$ docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g . -$ docker run -it \ - --rm \ - --network=host \ - --device /dev/dri \ - -v /dev/dri/by-path:/dev/dri/by-path \ - vllm-xpu-env -``` +### Pre-built wheels -(build-xpu-backend-from-source)= +Currently, there are no pre-built XPU wheels. -## Build from source +### Build wheel from source - First, install required driver and intel OneAPI 2024.2 or later. - Second, install Python packages for vLLM XPU backend building: @@ -56,7 +35,25 @@ VLLM_TARGET_DEVICE=xpu python setup.py install type will be supported in the future. ``` -## Distributed inference and serving +## Set up using Docker + +### Pre-built images + +Currently, there are no pre-built XPU images. + +### Build image from source + +```console +$ docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g . +$ docker run -it \ + --rm \ + --network=host \ + --device /dev/dri \ + -v /dev/dri/by-path:/dev/dri/by-path \ + vllm-xpu-env +``` + +## Supported features XPU platform supports tensor-parallel inference/serving and also supports pipeline parallel as a beta feature for online serving. We requires Ray as the distributed runtime backend. For example, a reference execution likes following: diff --git a/docs/source/getting_started/installation/index.md b/docs/source/getting_started/installation/index.md index 0ebadca2ccec9..bc1d268bf0c7e 100644 --- a/docs/source/getting_started/installation/index.md +++ b/docs/source/getting_started/installation/index.md @@ -7,14 +7,7 @@ vLLM supports the following hardware platforms: ```{toctree} :maxdepth: 1 -gpu-cuda -gpu-rocm -cpu-x86 -cpu-arm -cpu-apple -hpu-gaudi -tpu -xpu -openvino -neuron +gpu/index +cpu/index +ai_accelerator/index ``` diff --git a/docs/source/getting_started/installation/python_env_setup.inc.md b/docs/source/getting_started/installation/python_env_setup.inc.md new file mode 100644 index 0000000000000..25cfac5f58aa7 --- /dev/null +++ b/docs/source/getting_started/installation/python_env_setup.inc.md @@ -0,0 +1,19 @@ +You can create a new Python environment using `conda`: + +```console +# (Recommended) Create a new conda environment. +conda create -n myenv python=3.12 -y +conda activate myenv +``` + +```{note} +[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages. +``` + +Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command: + +```console +# (Recommended) Create a new uv environment. Use `--seed` to install `pip` and `setuptools` in the environment. +uv venv myenv --python 3.12 --seed +source myenv/bin/activate +```