From 8ff87ee1cfcc1a09c405262c5d53fc305fbeb812 Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Mon, 15 Apr 2024 15:27:33 +0000 Subject: [PATCH 1/5] Update placeholder in llama2.md and README.md --- README.md | 25 +++++++++---------------- docs/llama2.md | 34 ++++++++++++++-------------------- 2 files changed, 23 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index 9c546c3..af323a6 100644 --- a/README.md +++ b/README.md @@ -34,22 +34,15 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Pre | Engine | float32 | float16 | int8 | int4 | |---------------------------------------------|--------------|----------------|---------------|---------------| -| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | -| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | -| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | -| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | -| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| -| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | -| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | -| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | -| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | -| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | - -*(Data updated: `05th April 2024`) +| [transformers (pytorch)](/bench_pytorch/) | 37.37 ± 0.45 | 34.42 ± 0.45 | 7.07 ± 0.08 | 18.88 ± 0.08 | + + +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|----------|----------|----------|----------| +| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 41324.38 | 21384.66 | 12830.38 | + + +*(Data updated: `15th April 2024`) diff --git a/docs/llama2.md b/docs/llama2.md index d1fd4bb..cb8baa7 100644 --- a/docs/llama2.md +++ b/docs/llama2.md @@ -3,30 +3,24 @@ ## A100 80GB Inference Bench: **Environment:** -- Model: LLAMA-2-7B -- CUDA Version: 11.7 -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'` +- Model: Llama 2 7B Chat +- CUDA Version: 12.1 +- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'` **Performance Metrics:** (unit: Tokens / second) | Engine | float32 | float16 | int8 | int4 | |---------------------------------------------|--------------|----------------|---------------|---------------| -| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | -| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | -| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | -| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | -| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| -| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | -| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | -| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | -| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | -| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | - -*(Data updated: `05th April 2024`) +| [transformers (pytorch)](/bench_pytorch/) | 37.37 ± 0.45 | 34.42 ± 0.45 | 7.07 ± 0.08 | 18.88 ± 0.08 | + +**Performance Metrics:** GPU Memory Consumption (unit: MB) + +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|----------|----------|----------|----------| +| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 41324.38 | 21384.66 | 12830.38 | + + +*(Data updated: `15th April 2024`) ## M2 MAX 32GB Inference Bench: @@ -57,4 +51,4 @@ | [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | | [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | -*(Data updated: `05th April 2024`) +*(Data updated: `15th April 2024`) From cab6dbde693624f9d437ec4809368535cd50adf9 Mon Sep 17 00:00:00 2001 From: Anindyadeep Date: Wed, 17 Apr 2024 16:00:53 +0530 Subject: [PATCH 2/5] Update llama2.md.template with previous values. With the latest merge for benchmark base class, this file got overriden. Till the release of v2, this patch reverts into existing state --- docs/llama2.md.template | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/docs/llama2.md.template b/docs/llama2.md.template index 912d1a9..fa9e729 100644 --- a/docs/llama2.md.template +++ b/docs/llama2.md.template @@ -15,9 +15,22 @@ **Performance Metrics:** GPU Memory Consumption (unit: MB) -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|----------|----------|----------|----------| -| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 41324.38 | 21384.66 | 12830.38 | +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|--------------|----------------|---------------|---------------| +| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | +| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | +| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | +| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | +| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| +| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | +| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | +| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | +| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | +| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | *(Data updated: ``) From 0860789d5f4c11299067209d23ffdc3d5c13f5c1 Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Wed, 17 Apr 2024 10:35:55 +0000 Subject: [PATCH 3/5] Update placeholder in llama2.md and README.md --- README.md | 25 +++++++++++++++++++------ docs/llama2.md | 23 ++++++++++++++++++----- 2 files changed, 37 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index af323a6..31d0557 100644 --- a/README.md +++ b/README.md @@ -37,12 +37,25 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Pre | [transformers (pytorch)](/bench_pytorch/) | 37.37 ± 0.45 | 34.42 ± 0.45 | 7.07 ± 0.08 | 18.88 ± 0.08 | -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|----------|----------|----------|----------| -| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 41324.38 | 21384.66 | 12830.38 | - - -*(Data updated: `15th April 2024`) +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|--------------|----------------|---------------|---------------| +| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | +| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | +| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | +| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | +| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| +| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | +| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | +| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | +| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | +| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | + + +*(Data updated: `17th April 2024`) diff --git a/docs/llama2.md b/docs/llama2.md index cb8baa7..717bc19 100644 --- a/docs/llama2.md +++ b/docs/llama2.md @@ -15,12 +15,25 @@ **Performance Metrics:** GPU Memory Consumption (unit: MB) -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|----------|----------|----------|----------| -| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 41324.38 | 21384.66 | 12830.38 | +| Engine | float32 | float16 | int8 | int4 | +|---------------------------------------------|--------------|----------------|---------------|---------------| +| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | +| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | +| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | +| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | +| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| +| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | +| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | +| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | +| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | +| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | -*(Data updated: `15th April 2024`) +*(Data updated: `17th April 2024`) ## M2 MAX 32GB Inference Bench: @@ -51,4 +64,4 @@ | [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | | [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | -*(Data updated: `15th April 2024`) +*(Data updated: `17th April 2024`) From d9f29703818c23c21c477a2f9a29b68f9823af67 Mon Sep 17 00:00:00 2001 From: Anindyadeep Date: Wed, 17 Apr 2024 16:26:43 +0530 Subject: [PATCH 4/5] Update llama2.md.template --- docs/llama2.md.template | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/docs/llama2.md.template b/docs/llama2.md.template index fa9e729..678972c 100644 --- a/docs/llama2.md.template +++ b/docs/llama2.md.template @@ -5,16 +5,10 @@ **Environment:** - Model: Llama 2 7B Chat - CUDA Version: 12.1 -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'` +- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'` **Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [transformers (pytorch)](/bench_pytorch/) | 37.37 ± 0.45 | 34.42 ± 0.45 | 7.07 ± 0.08 | 18.88 ± 0.08 | - -**Performance Metrics:** GPU Memory Consumption (unit: MB) - | Engine | float32 | float16 | int8 | int4 | |---------------------------------------------|--------------|----------------|---------------|---------------| | [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | From d89116f85a82f1943b87c7ec49168de2e153d01e Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Wed, 17 Apr 2024 10:56:54 +0000 Subject: [PATCH 5/5] Update placeholder in llama2.md and README.md --- README.md | 5 ----- docs/llama2.md | 8 +------- 2 files changed, 1 insertion(+), 12 deletions(-) diff --git a/README.md b/README.md index 31d0557..8c70afe 100644 --- a/README.md +++ b/README.md @@ -32,11 +32,6 @@ Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: `tokens/sec` -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [transformers (pytorch)](/bench_pytorch/) | 37.37 ± 0.45 | 34.42 ± 0.45 | 7.07 ± 0.08 | 18.88 ± 0.08 | - - | Engine | float32 | float16 | int8 | int4 | |---------------------------------------------|--------------|----------------|---------------|---------------| | [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | diff --git a/docs/llama2.md b/docs/llama2.md index 717bc19..624a798 100644 --- a/docs/llama2.md +++ b/docs/llama2.md @@ -5,16 +5,10 @@ **Environment:** - Model: Llama 2 7B Chat - CUDA Version: 12.1 -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'` +- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'` **Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [transformers (pytorch)](/bench_pytorch/) | 37.37 ± 0.45 | 34.42 ± 0.45 | 7.07 ± 0.08 | 18.88 ± 0.08 | - -**Performance Metrics:** GPU Memory Consumption (unit: MB) - | Engine | float32 | float16 | int8 | int4 | |---------------------------------------------|--------------|----------------|---------------|---------------| | [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - |