From 241f0f0b315d2b0b36f537a92f314369c561615b Mon Sep 17 00:00:00 2001
From: Sourashis Roy <sroy@roblox.com>
Date: Sun, 22 Dec 2024 18:40:10 +0000
Subject: [PATCH 1/4] Documentation for using EAGLE in vLLM

Signed-off-by: Sourashis Roy <sroy@roblox.com>
---
 docs/source/usage/spec_decode.rst | 52 +++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/docs/source/usage/spec_decode.rst b/docs/source/usage/spec_decode.rst
index f1f1917f974bb..1bf1773793bfe 100644
--- a/docs/source/usage/spec_decode.rst
+++ b/docs/source/usage/spec_decode.rst
@@ -161,6 +161,58 @@ A variety of speculative models of this type are available on HF hub:
 * `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
 * `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
 
+Speculating using Eagle based draft models
+-------------------------------------------
+
+The following code configures vLLM to use speculative decoding where proposals are generated by
+a `EAGLE(Extrapolation Algorithm for Greater Language-model Efficiency)<https://arxiv.org/pdf/2401.15077>` based draft model.
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
+        tensor_parallel_size=4,
+        speculative_model="ibm-fms/llama3-70b-accelerator",
+        speculative_draft_tensor_parallel_size=1,
+    )
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+A few important things to consider when using the EAGLE based draft models.
+
+1. The EAGLE based draft models currently need to be run without tensor parallelism, although
+it is possible to run the main model using tensor parallelism (see example above). Since the
+speculative models are relatively small, we still see significant speedups. However, this
+limitation will be fixed in a future release.
+
+2. The EAGLE draft models available in this Hugging Face repository cannot be used directly
+with vLLM due to differences in the expected layer names and model definition. To use these
+models with vLLM, use the provided script to convert them. Note that this script does not 
+modify the model's weights.
+
+
+3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
+expected when using EAGLE-based draft models for speculative decoding.
+This issue is under investigation and tracked here: `https://github.com/vllm-project/vllm/issues/9565`.  
+Known differences between the vLLM implementation of EAGLE-based speculation and the original EAGLE implementation include:  
+
+    a. ......
+    b. .....
+
+A variety of EAGLE draft models are available on HF hub:
+
+
 Lossless guarantees of Speculative Decoding
 -------------------------------------------
 In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of 

From bdb5a1564301cad1248f70a107139ee5df14cccb Mon Sep 17 00:00:00 2001
From: Sourashis Roy <sroy@roblox.com>
Date: Tue, 7 Jan 2025 01:49:10 +0000
Subject: [PATCH 2/4] Add documentation for Eagle Usage

Signed-off-by: Sourashis Roy <sroy@roblox.com>
---
 docs/source/features/spec_decode.md | 67 +++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md
index 8c52c97a41e48..b24d23b8652f6 100644
--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
@@ -159,6 +159,73 @@ A variety of speculative models of this type are available on HF hub:
 - [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
 - [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
 
+## Speculating using EAGLE based draft models
+
+The following code configures vLLM to use speculative decoding where proposals are generated by
+an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
+
+```python
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "The future of AI is",
+]
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+llm = LLM(
+    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
+    tensor_parallel_size=4,
+    speculative_model="path/to/modified/eagle/model",
+    speculative_draft_tensor_parallel_size=1,
+)
+
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+```
+
+A few important things to consider when using the EAGLE based draft models:
+
+1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
+   used directly with vLLM due to differences in the expected layer names and model definition.
+   To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) 
+   to convert them. Note that this script does not modify the model's weights.
+   In the example above, use the script to first convert
+   the [yuhuili/EAGLE-LLaMA3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) model 
+   and then use the converted checkpoint as the draft model in vLLM.
+
+2. The EAGLE based draft models currently need to be run without tensor parallelism
+   (i.e. speculative_draft_tensor_parallel_size is set to 1), although
+   it is possible to run the main model using tensor parallelism (see example above). Since the
+   speculative models are relatively small, we still see significant speedups. However, this
+   limitation will be fixed in a future release.
+
+3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
+   expected when using EAGLE-based draft models for speculative decoding. This issue is under
+   investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
+
+
+A variety of EAGLE draft models are available on the Hugging Face hub:
+
+| Base Model                                                           | EAGLE on Hugging Face                     | # EAGLE Parameters |
+|---------------------------------------------------------------------|-------------------------------------------|--------------------|
+| Vicuna-7B-v1.3                                                       | yuhuili/EAGLE-Vicuna-7B-v1.3             | 0.24B              |
+| Vicuna-13B-v1.3                                                      | yuhuili/EAGLE-Vicuna-13B-v1.3            | 0.37B              |
+| Vicuna-33B-v1.3                                                      | yuhuili/EAGLE-Vicuna-33B-v1.3            | 0.56B              |
+| LLaMA2-Chat 7B                                                       | yuhuili/EAGLE-llama2-chat-7B             | 0.24B              |
+| LLaMA2-Chat 13B                                                      | yuhuili/EAGLE-llama2-chat-13B            | 0.37B              |
+| LLaMA2-Chat 70B                                                      | yuhuili/EAGLE-llama2-chat-70B            | 0.99B              |
+| Mixtral-8x7B-Instruct-v0.1                                           | yuhuili/EAGLE-mixtral-instruct-8x7B      | 0.28B              |
+| LLaMA3-Instruct 8B                                                   | yuhuili/EAGLE-LLaMA3-Instruct-8B         | 0.25B              |
+| LLaMA3-Instruct 70B                                                  | yuhuili/EAGLE-LLaMA3-Instruct-70B        | 0.99B              |
+| Qwen2-7B-Instruct                                                    | yuhuili/EAGLE-Qwen2-7B-Instruct          | 0.26B              |
+| Qwen2-72B-Instruct                                                   | yuhuili/EAGLE-Qwen2-72B-Instruct         | 1.05B              |
+
+
 ## Lossless guarantees of Speculative Decoding
 
 In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of

From da31a00fc2159bc3235dcb8dec75770d553db82b Mon Sep 17 00:00:00 2001
From: Sourashis Roy <sroy@roblox.com>
Date: Tue, 7 Jan 2025 01:59:51 +0000
Subject: [PATCH 3/4] Comments

Signed-off-by: Sourashis Roy <sroy@roblox.com>
---
 docs/source/features/spec_decode.md | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md
index b24d23b8652f6..18282e1b4c1e2 100644
--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
@@ -194,18 +194,17 @@ A few important things to consider when using the EAGLE based draft models:
    used directly with vLLM due to differences in the expected layer names and model definition.
    To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) 
    to convert them. Note that this script does not modify the model's weights.
-   In the example above, use the script to first convert
+
+   In the above example, use the script to first convert
    the [yuhuili/EAGLE-LLaMA3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) model 
    and then use the converted checkpoint as the draft model in vLLM.
 
-2. The EAGLE based draft models currently need to be run without tensor parallelism
+2. The EAGLE based draft models need to be run without tensor parallelism
    (i.e. speculative_draft_tensor_parallel_size is set to 1), although
-   it is possible to run the main model using tensor parallelism (see example above). Since the
-   speculative models are relatively small, we still see significant speedups. However, this
-   limitation will be fixed in a future release.
+   it is possible to run the main model using tensor parallelism (see example above).
 
 3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
-   expected when using EAGLE-based draft models for speculative decoding. This issue is under
+   reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
    investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
 
 

From abd94c379426fc973275dc988cc719d6d332f6a6 Mon Sep 17 00:00:00 2001
From: Sourashis Roy <sroy@roblox.com>
Date: Tue, 7 Jan 2025 18:18:38 +0000
Subject: [PATCH 4/4] Address comments

Signed-off-by: Sourashis Roy <sroy@roblox.com>
---
 docs/source/features/spec_decode.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md
index 18282e1b4c1e2..29f9a3b8a536b 100644
--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
@@ -173,7 +173,7 @@ prompts = [
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 
 llm = LLM(
-    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
+    model="meta-llama/Meta-Llama-3-8B-Instruct",
     tensor_parallel_size=4,
     speculative_model="path/to/modified/eagle/model",
     speculative_draft_tensor_parallel_size=1,
@@ -196,7 +196,7 @@ A few important things to consider when using the EAGLE based draft models:
    to convert them. Note that this script does not modify the model's weights.
 
    In the above example, use the script to first convert
-   the [yuhuili/EAGLE-LLaMA3-Instruct-70B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-70B) model 
+   the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model 
    and then use the converted checkpoint as the draft model in vLLM.
 
 2. The EAGLE based draft models need to be run without tensor parallelism