From 241f0f0b315d2b0b36f537a92f314369c561615b Mon Sep 17 00:00:00 2001 From: Sourashis Roy Date: Sun, 22 Dec 2024 18:40:10 +0000 Subject: [PATCH 1/4] Documentation for using EAGLE in vLLM Signed-off-by: Sourashis Roy --- docs/source/usage/spec_decode.rst | 52 +++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/docs/source/usage/spec_decode.rst b/docs/source/usage/spec_decode.rst index f1f1917f974bb..1bf1773793bfe 100644 --- a/docs/source/usage/spec_decode.rst +++ b/docs/source/usage/spec_decode.rst @@ -161,6 +161,58 @@ A variety of speculative models of this type are available on HF hub: * `granite-7b-instruct-accelerator `_ * `granite-20b-code-instruct-accelerator `_ +Speculating using Eagle based draft models +------------------------------------------- + +The following code configures vLLM to use speculative decoding where proposals are generated by +a `EAGLE(Extrapolation Algorithm for Greater Language-model Efficiency)` based draft model. + +.. code-block:: python + + from vllm import LLM, SamplingParams + + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model="meta-llama/Meta-Llama-3.1-70B-Instruct", + tensor_parallel_size=4, + speculative_model="ibm-fms/llama3-70b-accelerator", + speculative_draft_tensor_parallel_size=1, + ) + outputs = llm.generate(prompts, sampling_params) + + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + +A few important things to consider when using the EAGLE based draft models. + +1. The EAGLE based draft models currently need to be run without tensor parallelism, although +it is possible to run the main model using tensor parallelism (see example above). Since the +speculative models are relatively small, we still see significant speedups. However, this +limitation will be fixed in a future release. + +2. The EAGLE draft models available in this Hugging Face repository cannot be used directly +with vLLM due to differences in the expected layer names and model definition. To use these +models with vLLM, use the provided script to convert them. Note that this script does not +modify the model's weights. + + +3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is +expected when using EAGLE-based draft models for speculative decoding. +This issue is under investigation and tracked here: `https://github.com/vllm-project/vllm/issues/9565`. +Known differences between the vLLM implementation of EAGLE-based speculation and the original EAGLE implementation include: + + a. ...... + b. ..... + +A variety of EAGLE draft models are available on HF hub: + + Lossless guarantees of Speculative Decoding ------------------------------------------- In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of From bdb5a1564301cad1248f70a107139ee5df14cccb Mon Sep 17 00:00:00 2001 From: Sourashis Roy Date: Tue, 7 Jan 2025 01:49:10 +0000 Subject: [PATCH 2/4] Add documentation for Eagle Usage Signed-off-by: Sourashis Roy --- docs/source/features/spec_decode.md | 67 +++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md index 8c52c97a41e48..b24d23b8652f6 100644 --- a/docs/source/features/spec_decode.md +++ b/docs/source/features/spec_decode.md @@ -159,6 +159,73 @@ A variety of speculative models of this type are available on HF hub: - [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator) - [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator) +## Speculating using EAGLE based draft models + +The following code configures vLLM to use speculative decoding where proposals are generated by +an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. + +```python +from vllm import LLM, SamplingParams + +prompts = [ + "The future of AI is", +] +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + +llm = LLM( + model="meta-llama/Meta-Llama-3.1-70B-Instruct", + tensor_parallel_size=4, + speculative_model="path/to/modified/eagle/model", + speculative_draft_tensor_parallel_size=1, +) + +outputs = llm.generate(prompts, sampling_params) + +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + +``` + +A few important things to consider when using the EAGLE based draft models: + +1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be + used directly with vLLM due to differences in the expected layer names and model definition. + To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) + to convert them. Note that this script does not modify the model's weights. + In the example above, use the script to first convert + the [yuhuili/EAGLE-LLaMA3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) model + and then use the converted checkpoint as the draft model in vLLM. + +2. The EAGLE based draft models currently need to be run without tensor parallelism + (i.e. speculative_draft_tensor_parallel_size is set to 1), although + it is possible to run the main model using tensor parallelism (see example above). Since the + speculative models are relatively small, we still see significant speedups. However, this + limitation will be fixed in a future release. + +3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is + expected when using EAGLE-based draft models for speculative decoding. This issue is under + investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565). + + +A variety of EAGLE draft models are available on the Hugging Face hub: + +| Base Model | EAGLE on Hugging Face | # EAGLE Parameters | +|---------------------------------------------------------------------|-------------------------------------------|--------------------| +| Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B | +| Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B | +| Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B | +| LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B | +| LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B | +| LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B | +| Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B | +| LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B | +| LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B | +| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B | +| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B | + + ## Lossless guarantees of Speculative Decoding In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of From da31a00fc2159bc3235dcb8dec75770d553db82b Mon Sep 17 00:00:00 2001 From: Sourashis Roy Date: Tue, 7 Jan 2025 01:59:51 +0000 Subject: [PATCH 3/4] Comments Signed-off-by: Sourashis Roy --- docs/source/features/spec_decode.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md index b24d23b8652f6..18282e1b4c1e2 100644 --- a/docs/source/features/spec_decode.md +++ b/docs/source/features/spec_decode.md @@ -194,18 +194,17 @@ A few important things to consider when using the EAGLE based draft models: used directly with vLLM due to differences in the expected layer names and model definition. To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert them. Note that this script does not modify the model's weights. - In the example above, use the script to first convert + + In the above example, use the script to first convert the [yuhuili/EAGLE-LLaMA3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) model and then use the converted checkpoint as the draft model in vLLM. -2. The EAGLE based draft models currently need to be run without tensor parallelism +2. The EAGLE based draft models need to be run without tensor parallelism (i.e. speculative_draft_tensor_parallel_size is set to 1), although - it is possible to run the main model using tensor parallelism (see example above). Since the - speculative models are relatively small, we still see significant speedups. However, this - limitation will be fixed in a future release. + it is possible to run the main model using tensor parallelism (see example above). 3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is - expected when using EAGLE-based draft models for speculative decoding. This issue is under + reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565). From abd94c379426fc973275dc988cc719d6d332f6a6 Mon Sep 17 00:00:00 2001 From: Sourashis Roy Date: Tue, 7 Jan 2025 18:18:38 +0000 Subject: [PATCH 4/4] Address comments Signed-off-by: Sourashis Roy --- docs/source/features/spec_decode.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md index 18282e1b4c1e2..29f9a3b8a536b 100644 --- a/docs/source/features/spec_decode.md +++ b/docs/source/features/spec_decode.md @@ -173,7 +173,7 @@ prompts = [ sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( - model="meta-llama/Meta-Llama-3.1-70B-Instruct", + model="meta-llama/Meta-Llama-3-8B-Instruct", tensor_parallel_size=4, speculative_model="path/to/modified/eagle/model", speculative_draft_tensor_parallel_size=1, @@ -196,7 +196,7 @@ A few important things to consider when using the EAGLE based draft models: to convert them. Note that this script does not modify the model's weights. In the above example, use the script to first convert - the [yuhuili/EAGLE-LLaMA3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) model + the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model and then use the converted checkpoint as the draft model in vLLM. 2. The EAGLE based draft models need to be run without tensor parallelism