From 8ff87ee1cfcc1a09c405262c5d53fc305fbeb812 Mon Sep 17 00:00:00 2001
From: GitHub Actions <actions@github.com>
Date: Mon, 15 Apr 2024 15:27:33 +0000
Subject: [PATCH 1/5] Update <LAST_UPDATE> placeholder in llama2.md and
 README.md

---
 README.md      | 25 +++++++++----------------
 docs/llama2.md | 34 ++++++++++++++--------------------
 2 files changed, 23 insertions(+), 36 deletions(-)

diff --git a/README.md b/README.md
index 9c546c3..af323a6 100644
--- a/README.md
+++ b/README.md
@@ -34,22 +34,15 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Pre
 
 | Engine                                      | float32      | float16        | int8          | int4          |
 |---------------------------------------------|--------------|----------------|---------------|---------------|
-| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
-| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
-| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
-| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
-| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
-| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        | 114.69 ± 11.20|
-| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
-| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
-| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
-| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
-| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
-| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
-| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
-| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
-
-*(Data updated: `05th April 2024`)
+| [transformers (pytorch)](/bench_pytorch/)   | 37.37 ± 0.45 | 34.42 ± 0.45   | 7.07 ± 0.08   | 18.88 ± 0.08  |
+
+
+| Engine                                      | float32  | float16  | int8     | int4     |
+|---------------------------------------------|----------|----------|----------|----------|
+| [transformers (pytorch)](/bench_pytorch/)   | 29114.76 | 41324.38 | 21384.66 | 12830.38 |
+
+
+*(Data updated: `15th April 2024`)
 
 
 
diff --git a/docs/llama2.md b/docs/llama2.md
index d1fd4bb..cb8baa7 100644
--- a/docs/llama2.md
+++ b/docs/llama2.md
@@ -3,30 +3,24 @@
 ## A100 80GB Inference Bench:
 
 **Environment:**
-- Model: LLAMA-2-7B
-- CUDA Version: 11.7
-- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'`
+- Model: Llama 2 7B Chat
+- CUDA Version: 12.1
+- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'`
 
 **Performance Metrics:** (unit: Tokens / second)
 
 | Engine                                      | float32      | float16        | int8          | int4          |
 |---------------------------------------------|--------------|----------------|---------------|---------------|
-| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
-| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
-| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
-| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
-| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
-| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        | 114.69 ± 11.20|
-| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
-| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
-| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
-| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
-| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
-| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
-| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
-| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
-
-*(Data updated: `05th April 2024`)
+| [transformers (pytorch)](/bench_pytorch/)   | 37.37 ± 0.45 | 34.42 ± 0.45   | 7.07 ± 0.08   | 18.88 ± 0.08  |
+
+**Performance Metrics:** GPU Memory Consumption (unit: MB)
+
+| Engine                                      | float32  | float16  | int8     | int4     |
+|---------------------------------------------|----------|----------|----------|----------|
+| [transformers (pytorch)](/bench_pytorch/)   | 29114.76 | 41324.38 | 21384.66 | 12830.38 |
+
+
+*(Data updated: `15th April 2024`)
 
 
 ## M2 MAX 32GB Inference Bench:
@@ -57,4 +51,4 @@
 | [llama.cpp](/bench_llamacpp/)           |      -       |      -        | 30.11 ± 0.45 | 44.27 ± 0.12 |
 | [ctransformers](/bench_ctransformers/)  |      -       |      -        | 20.75 ± 0.36 | 34.04 ± 2.11 |
 
-*(Data updated: `05th April 2024`)
+*(Data updated: `15th April 2024`)

From cab6dbde693624f9d437ec4809368535cd50adf9 Mon Sep 17 00:00:00 2001
From: Anindyadeep <proanindyadeep@gmail.com>
Date: Wed, 17 Apr 2024 16:00:53 +0530
Subject: [PATCH 2/5] Update llama2.md.template with previous values.

With the latest merge for benchmark base class, this file got overriden. Till the release of v2, this patch reverts into existing state
---
 docs/llama2.md.template | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/docs/llama2.md.template b/docs/llama2.md.template
index 912d1a9..fa9e729 100644
--- a/docs/llama2.md.template
+++ b/docs/llama2.md.template
@@ -15,9 +15,22 @@
 
 **Performance Metrics:** GPU Memory Consumption (unit: MB)
 
-| Engine                                      | float32  | float16  | int8     | int4     |
-|---------------------------------------------|----------|----------|----------|----------|
-| [transformers (pytorch)](/bench_pytorch/)   | 29114.76 | 41324.38 | 21384.66 | 12830.38 |
+| Engine                                      | float32      | float16        | int8          | int4          |
+|---------------------------------------------|--------------|----------------|---------------|---------------|
+| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
+| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
+| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
+| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
+| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
+| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        | 114.69 ± 11.20|
+| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
+| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
+| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
+| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
+| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
+| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
+| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
+| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
 
 
 *(Data updated: `<LAST_UPDATE>`)

From 0860789d5f4c11299067209d23ffdc3d5c13f5c1 Mon Sep 17 00:00:00 2001
From: GitHub Actions <actions@github.com>
Date: Wed, 17 Apr 2024 10:35:55 +0000
Subject: [PATCH 3/5] Update <LAST_UPDATE> placeholder in llama2.md and
 README.md

---
 README.md      | 25 +++++++++++++++++++------
 docs/llama2.md | 23 ++++++++++++++++++-----
 2 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/README.md b/README.md
index af323a6..31d0557 100644
--- a/README.md
+++ b/README.md
@@ -37,12 +37,25 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Pre
 | [transformers (pytorch)](/bench_pytorch/)   | 37.37 ± 0.45 | 34.42 ± 0.45   | 7.07 ± 0.08   | 18.88 ± 0.08  |
 
 
-| Engine                                      | float32  | float16  | int8     | int4     |
-|---------------------------------------------|----------|----------|----------|----------|
-| [transformers (pytorch)](/bench_pytorch/)   | 29114.76 | 41324.38 | 21384.66 | 12830.38 |
-
-
-*(Data updated: `15th April 2024`)
+| Engine                                      | float32      | float16        | int8          | int4          |
+|---------------------------------------------|--------------|----------------|---------------|---------------|
+| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
+| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
+| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
+| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
+| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
+| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        | 114.69 ± 11.20|
+| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
+| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
+| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
+| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
+| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
+| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
+| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
+| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
+
+
+*(Data updated: `17th April 2024`)
 
 
 
diff --git a/docs/llama2.md b/docs/llama2.md
index cb8baa7..717bc19 100644
--- a/docs/llama2.md
+++ b/docs/llama2.md
@@ -15,12 +15,25 @@
 
 **Performance Metrics:** GPU Memory Consumption (unit: MB)
 
-| Engine                                      | float32  | float16  | int8     | int4     |
-|---------------------------------------------|----------|----------|----------|----------|
-| [transformers (pytorch)](/bench_pytorch/)   | 29114.76 | 41324.38 | 21384.66 | 12830.38 |
+| Engine                                      | float32      | float16        | int8          | int4          |
+|---------------------------------------------|--------------|----------------|---------------|---------------|
+| [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
+| [llama.cpp](/bench_llamacpp/)               |      -       |      -         | 79.15 ± 1.20  | 100.90 ± 1.46 |
+| [ctranslate](/bench_ctranslate/)            | 35.23 ± 4.01 | 55.72 ± 16.66  | 35.73 ± 10.87 |      -        |
+| [onnx](/bench_onnxruntime/)                 |      -       | 54.16 ± 3.15   |      -        |      -        |
+| [transformers (pytorch)](/bench_pytorch/)   | 43.79 ± 0.61 | 46.39 ± 0.28   | 6.98 ± 0.05   | 21.72 ± 0.11  |
+| [vllm](/bench_vllm/)                        | 90.78 ± 1.60 | 90.54 ± 2.22   |      -        | 114.69 ± 11.20|
+| [exllamav2](/bench_exllamav2/)              |      -       |      -         | 121.63 ± 0.74 | 130.16 ± 0.35 |
+| [ctransformers](/bench_ctransformers/)      |      -       |      -         | 76.75 ± 10.36 | 84.26 ± 5.79  |
+| [AutoGPTQ](/bench_autogptq/)                | 42.01 ± 1.03 | 30.24 ± 0.41   |      -        |      -        |
+| [AutoAWQ](/bench_autoawq/)                  |      -       |      -         |      -        | 109.20 ± 3.28 |
+| [DeepSpeed](/bench_deepspeed/)              |      -       | 81.44 ± 8.13   |      -        |               |
+| [PyTorch Lightning](/bench_lightning/)      | 24.85 ± 0.07 | 44.56 ± 2.89   | 10.50 ± 0.12  | 24.83 ± 0.05  |
+| [Optimum Nvidia](/bench_optimum_nvidia/)    | 110.36 ± 0.52| 109.09 ± 4.26  |      -        |      -        |
+| [Nvidia TensorRT-LLM](/bench_tensorrtllm/)  | 55.19 ± 1.03 | 85.03 ± 0.62   | 167.66 ± 2.05 | 235.18 ± 3.20 |
 
 
-*(Data updated: `15th April 2024`)
+*(Data updated: `17th April 2024`)
 
 
 ## M2 MAX 32GB Inference Bench:
@@ -51,4 +64,4 @@
 | [llama.cpp](/bench_llamacpp/)           |      -       |      -        | 30.11 ± 0.45 | 44.27 ± 0.12 |
 | [ctransformers](/bench_ctransformers/)  |      -       |      -        | 20.75 ± 0.36 | 34.04 ± 2.11 |
 
-*(Data updated: `15th April 2024`)
+*(Data updated: `17th April 2024`)

From d9f29703818c23c21c477a2f9a29b68f9823af67 Mon Sep 17 00:00:00 2001
From: Anindyadeep <proanindyadeep@gmail.com>
Date: Wed, 17 Apr 2024 16:26:43 +0530
Subject: [PATCH 4/5] Update llama2.md.template

---
 docs/llama2.md.template | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/docs/llama2.md.template b/docs/llama2.md.template
index fa9e729..678972c 100644
--- a/docs/llama2.md.template
+++ b/docs/llama2.md.template
@@ -5,16 +5,10 @@
 **Environment:**
 - Model: Llama 2 7B Chat
 - CUDA Version: 12.1
-- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'`
+- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'`
 
 **Performance Metrics:** (unit: Tokens / second)
 
-| Engine                                      | float32      | float16        | int8          | int4          |
-|---------------------------------------------|--------------|----------------|---------------|---------------|
-| [transformers (pytorch)](/bench_pytorch/)   | 37.37 ± 0.45 | 34.42 ± 0.45   | 7.07 ± 0.08   | 18.88 ± 0.08  |
-
-**Performance Metrics:** GPU Memory Consumption (unit: MB)
-
 | Engine                                      | float32      | float16        | int8          | int4          |
 |---------------------------------------------|--------------|----------------|---------------|---------------|
 | [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |

From d89116f85a82f1943b87c7ec49168de2e153d01e Mon Sep 17 00:00:00 2001
From: GitHub Actions <actions@github.com>
Date: Wed, 17 Apr 2024 10:56:54 +0000
Subject: [PATCH 5/5] Update <LAST_UPDATE> placeholder in llama2.md and
 README.md

---
 README.md      | 5 -----
 docs/llama2.md | 8 +-------
 2 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/README.md b/README.md
index 31d0557..8c70afe 100644
--- a/README.md
+++ b/README.md
@@ -32,11 +32,6 @@
 Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: `tokens/sec`
 
 
-| Engine                                      | float32      | float16        | int8          | int4          |
-|---------------------------------------------|--------------|----------------|---------------|---------------|
-| [transformers (pytorch)](/bench_pytorch/)   | 37.37 ± 0.45 | 34.42 ± 0.45   | 7.07 ± 0.08   | 18.88 ± 0.08  |
-
-
 | Engine                                      | float32      | float16        | int8          | int4          |
 |---------------------------------------------|--------------|----------------|---------------|---------------|
 | [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |
diff --git a/docs/llama2.md b/docs/llama2.md
index 717bc19..624a798 100644
--- a/docs/llama2.md
+++ b/docs/llama2.md
@@ -5,16 +5,10 @@
 **Environment:**
 - Model: Llama 2 7B Chat
 - CUDA Version: 12.1
-- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'`
+- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'`
 
 **Performance Metrics:** (unit: Tokens / second)
 
-| Engine                                      | float32      | float16        | int8          | int4          |
-|---------------------------------------------|--------------|----------------|---------------|---------------|
-| [transformers (pytorch)](/bench_pytorch/)   | 37.37 ± 0.45 | 34.42 ± 0.45   | 7.07 ± 0.08   | 18.88 ± 0.08  |
-
-**Performance Metrics:** GPU Memory Consumption (unit: MB)
-
 | Engine                                      | float32      | float16        | int8          | int4          |
 |---------------------------------------------|--------------|----------------|---------------|---------------|
 | [candle](/bench_candle/)                    |      -       | 36.78 ± 2.17   |      -        |      -        |