diff --git a/.github/workflows/update_benchmark.yaml b/.github/workflows/update_benchmark.yaml deleted file mode 100644 index f0e5ca90..00000000 --- a/.github/workflows/update_benchmark.yaml +++ /dev/null @@ -1,34 +0,0 @@ -name: Update Benchmark - -on: - push: - branches: ["main"] - paths: - - docs/llama2.md.template - workflow_dispatch: - -jobs: - update-readme: - runs-on: ubuntu-latest - steps: - - name: Checkout Code Repository - uses: actions/checkout@v3 - - - name: Update Benchmark - run: | - sed "s||$(date -u +"%dth %B %Y")|g" docs/llama2.md.template > docs/llama2.md - sed -n '/^## A100 80GB Inference Bench:/,/^## M2 MAX 32GB Inference Bench:/p' docs/llama2.md | sed '$d' | awk '/^\*\*Performance Metrics:\*\*/{p=1; next} p; /^\*\*\(Data updated:/{exit}' > first_table.md - awk '//{system("cat first_table.md"); next} 1' README.md.template > README.md - - - name: Commit changes - run: | - git config --global user.email "actions@github.com" - git config --global user.name "GitHub Actions" - git add docs/llama2.md README.md - git commit -m "Update placeholder in llama2.md and README.md" || true - - - name: Push changes - uses: ad-m/github-push-action@master - with: - github_token: ${{ secrets.GITHUB_TOKEN }} - branch: ${{ github.ref }} diff --git a/README.md b/README.md index 9c546c31..4170c02a 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@
-

🕹️ Benchmarks

-

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

+

🕹️ Benchmarks

+

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models

[![GitHub contributors](https://img.shields.io/github/contributors/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/graphs/contributors) @@ -11,109 +11,212 @@ [![GitHub issues](https://img.shields.io/github/issues/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/issues) [![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) - -
- Table of Contents -
    -
  1. Quick glance towards performance metrics for Llama-2-7B
  2. -
  3. Getting started
  4. -
  5. Usage
  6. -
  7. Contribute
  8. -
  9. Roadmap
  10. -
  11. Introducing Prem Grant Program
  12. -
+ Table of Contents +
    +
  1. Quick glance towards performance metrics
  2. +
  3. ML Engines
  4. +
  5. Why Benchmarks
  6. +
  7. Usage and workflow
  8. +
  9. Contribute
  10. +
-
- -## 📊 Quick glance towards performance metrics for Llama-2-7B - -Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: `tokens/sec` - - -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | -| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | -| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | -| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | -| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| -| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | -| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | -| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | -| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | -| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | - -*(Data updated: `05th April 2024`) - +## 🥽 Quick glance towards performance benchmarks --- The above benchmarking is done on A100-80GB GPU. You can find more details for other devices like CPU/Metal under [docs](docs/llama2.md) folder. +Take a first glance at [Mistral 7B v0.1 Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and [Llama 2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports. -- Also if you want to see more detailed information about each of the benchmark, you can find those details the respective benchmark folders. +**Environment:** +- Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat +- CUDA Version: 12.1 +- Batch size: 1 -- If you want to compare side by side which inference engines supports which precision and device, you can check out the [ml_engines.md](/docs/ml_engines.md) file. Please note that this file is incomplete and a better comparision of engines will be added in the later versions. +**Command:** -Benchmarks can also be considered as a repository of hackable scripts, that contains the code and all the knowledge base to run the popular inference engines. - -## 🚀 Getting Started - -Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Here's a quick guide to get you started: - -- **Benchmark Organization:** Each benchmark is uniquely identified as `bench_name` and resides in its dedicated folder, named `bench_{bench_name}`. - -- **Benchmark Script (`bench.sh`):** Within these benchmark folders, you'll find a common script named `bench.sh`. This script takes care of everything from setup and environment configuration to actual execution. - -### Benchmark Script Parameters +``` +./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture' +``` -The `bench.sh` script supports the following key parameters, allowing for customization and flexibility: +### Mistral 7B v0.1 Instruct + +**Performance Metrics:** (unit: Tokens/second) + +| Engine | float32 | float16 | int8 | int4 | +| ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | +| [transformers (pytorch)](/bench_pytorch/) | 39.61 ± 0.65 | 37.05 ± 0.49 | 5.08 ± 0.01 | 19.58 ± 0.38 | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 63.12 ± 2.19 | +| [AutoGPTQ](/bench_autogptq/) | 39.11 ± 0.42 | 42.94 ± 0.80 | | | +| [DeepSpeed](/bench_deepspeed/) | | 79.88 ± 0.32 | | | +| [ctransformers](/bench_ctransformers/) | - | - | 86.14 ± 1.40 | 87.22 ± 1.54 | +| [llama.cpp](/bench_llamacpp/) | - | - | 88.27 ± 0.72 | 95.33 ± 5.54 | +| [ctranslate](/bench_ctranslate/) | 43.17 ± 2.97 | 68.03 ± 0.27 | 45.14 ± 0.24 | - | +| [PyTorch Lightning](/bench_lightning/) | 32.79 ± 2.74 | 43.01 ± 2.90 | 7.75 ± 0.12 | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 117.04 ± 2.16 | 206.59 ± 6.93 | 390.49 ± 4.86 | 427.40 ± 4.84 | +| [vllm](/bench_vllm/) | 84.91 ± 0.27 | 84.89 ± 0.28 | - | 106.03 ± 0.53 | +| [exllamav2](/bench_exllamav2/) | - | - | 114.81 ± 1.47 | 126.29 ± 3.05 | +| [onnx](/bench_onnxruntime/) | 15.75 ± 0.15 | 22.39 ± 0.14 | - | - | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 50.77 ± 0.85 | 50.91 ± 0.19 | - | - | + +**Performance Metrics:** GPU Memory Consumption (unit: MB) + +| Engine | float32 | float16 | int8 | int4 | +| ------------------------------------------ | -------- | -------- | -------- | -------- | +| [transformers (pytorch)](/bench_pytorch/) | 31071.4 | 15976.1 | 10963.91 | 5681.18 | +| [AutoGPTQ](/bench_autogptq/) | 13400.80 | 6633.29 | | | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 6572.47 | +| [DeepSpeed](/bench_deepspeed/) | | 80104.34 | | | +| [ctransformers](/bench_ctransformers/) | - | - | 10255.07 | 6966.74 | +| [llama.cpp](/bench_llamacpp/) | - | - | 9141.49 | 5880.41 | +| [ctranslate](/bench_ctranslate/) | 32602.32 | 17523.8 | 10074.72 | - | +| [PyTorch Lightning](/bench_lightning/) | 48783.95 | 18738.05 | 10680.32 | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 79536.59 | 78341.21 | 77689.0 | 77311.51 | +| [vllm](/bench_vllm/) | 73568.09 | 73790.39 | - | 74016.88 | +| [exllamav2](/bench_exllamav2/) | - | - | 21483.23 | 9460.25 | +| [onnx](/bench_onnxruntime/) | 33629.93 | 19537.07 | - | - | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 79563.85 | 79496.74 | - | - | + +*(Data updated: `30th April 2024`) + +### Llama 2 7B Chat + +**Performance Metrics:** (unit: Tokens / second) + +| Engine | float32 | float16 | int8 | int4 | +| ------------------------------------------ | ------------- | ------------- | ------------- | ------------- | +| [transformers (pytorch)](/bench_pytorch/) | 36.65 ± 0.61 | 34.20 ± 0.51 | 6.91 ± 0.14 | 17.83 ± 0.40 | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 63.59 ± 1.86 | +| [AutoGPTQ](/bench_autogptq/) | 34.36 ± 0.51 | 36.63 ± 0.61 | | | +| [DeepSpeed](/bench_deepspeed/) | | 84.60 ± 0.25 | | | +| [ctransformers](/bench_ctransformers/) | - | - | 85.50 ± 1.00 | 86.66 ± 1.06 | +| [llama.cpp](/bench_llamacpp/) | - | - | 89.90 ± 2.26 | 97.35 ± 4.71 | +| [ctranslate](/bench_ctranslate/) | 46.26 ± 1.59 | 79.41 ± 0.37 | 48.20 ± 0.14 | - | +| [PyTorch Lightning](/bench_lightning/) | 38.01 ± 0.09 | 48.09 ± 1.12 | 10.68 ± 0.43 | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 104.07 ± 1.61 | 191.00 ± 4.60 | 316.77 ± 2.14 | 358.49 ± 2.38 | +| [vllm](/bench_vllm/) | 89.40 ± 0.22 | 89.43 ± 0.19 | - | 115.52 ± 0.49 | +| [exllamav2](/bench_exllamav2/) | - | - | 125.58 ± 1.23 | 159.68 ± 1.85 | +| [onnx](/bench_onnxruntime/) | 14.28 ± 0.12 | 19.42 ± 0.08 | - | - | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 53.64 ± 0.78 | 53.82 ± 0.11 | - | - | + + +**Performance Metrics:** GPU Memory Consumption (unit: MB) + +| Engine | float32 | float16 | int8 | int4 | +| ------------------------------------------ | -------- | -------- | -------- | -------- | +| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 14931.72 | 8596.23 | 5643.44 | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 7149.19 | +| [AutoGPTQ](/bench_autogptq/) | 10718.54 | 5706.35 | | | +| [DeepSpeed](/bench_deepspeed/) | | 83978.35 | | | +| [ctransformers](/bench_ctransformers/) | - | - | 9774.83 | 6889.14 | +| [llama.cpp](/bench_llamacpp/) | - | - | 8797.55 | 5783.95 | +| [ctranslate](/bench_ctranslate/) | 29951.52 | 16282.29 | 9470.74 | - | +| [PyTorch Lightning](/bench_lightning/) | 42748.35 | 14736.69 | 8028.16 | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 79421.24 | 78295.07 | 77642.86 | 77256.98 | +| [vllm](/bench_vllm/) | 77928.07 | 77928.07 | - | 77768.69 | +| [exllamav2](/bench_exllamav2/) | - | - | 16582.18 | 7201.62 | +| [onnx](/bench_onnxruntime/) | 33072.09 | 19180.55 | - | - | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 79429.63 | 79295.41 | - | - | + +*(Data updated: `30th April 2024`) + +> Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the [archive.md](/docs/archive.md) file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated. + +## 🛳 ML Engines + +In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances [here](/docs/ml_engines.md). + +| Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training | +| ------------------------------------------ | :-----: | :-----: | :---: | :---: | :---: | :---: | :-------: | :------: | +| [candle](/bench_candle/) | ⚠️ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | 🚧 | ❌ | +| [llama.cpp](/bench_llamacpp/) | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | +| [ctranslate](/bench_ctranslate/) | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | 🚧 | ❌ | +| [onnx](/bench_onnxruntime/) | ✅ | ✅ | ❌ | ❌ | ✅ | ⚠️ | ❌ | ❌ | +| [transformers (pytorch)](/bench_pytorch/) | ✅ | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | +| [vllm](/bench_vllm/) | ✅ | ✅ | ❌ | ✅ | ✅ | 🚧 | ❌ | ❌ | +| [exllamav2](/bench_exllamav2/) | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | ❌ | ❌ | +| [ctransformers](/bench_ctransformers/) | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | +| [AutoGPTQ](/bench_autogptq/) | ✅ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | ❌ | ❌ | +| [AutoAWQ](/bench_autoawq/) | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | +| [DeepSpeed-MII](/bench_deepspeed/) | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ⚠️ | +| [PyTorch Lightning](/bench_lightning/) | ✅ | ✅ | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | ✅ | +| [Optimum Nvidia](/bench_optimum_nvidia/) | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | + + +### Legend: +- ✅ Supported +- ❌ Not Supported +- ⚠️ There is a catch related to this +- 🚧 It is supported but not implemented in this current version + +You can check out the nuances related to ⚠️ and 🚧 in details [here](/docs/ml_engines.md) + +## 🤔 Why Benchmarks + +This can be a common question. What are the benefits you can expect from this repository? So here are some quick pointers to answer those. + +1. Oftentimes, we are confused when given several choices on which engines or precision to use for our LLM inference workflow. Because sometimes we have constraints on computing and sometimes we have other requirements. So this repository helps you to get a quick idea of what to use based on your requirements. + +2. Sometimes there comes a quality vs speed tradeoff between engines and precisions. So this repository keeps track of those and gives you an idea to understand the tradeoffs so that you can give more importance to your priorities. + +3. A fully reproducible and hackable script. The latest benchmarks come with a lot of best practices so that they can be robust enough to run on GPU devices. Also, you can reference and extend the implementations to build your own workflows out of it. + +## 🚀 Usage and workflow + +Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Each benchmark runs an inference engine that provides some sort of optimizations either through just quantization or device-specific optimizations like custom cuda kernels. + +To get started you need to download the models first. This will download the following models: [Llama2 7B Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and [Mistral-7B v0.1 Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1). You can start download by typing this command: -- `prompt`: Benchmark-specific prompt. -- `max_tokens`: Maximum tokens for the benchmark. -- `repetitions`: Number of benchmark repetitions. -- `log_file`: File for storing benchmark logs. -- `device`: Specify the device for benchmark execution (CPU, CUDA, Metal). -- `models_dir`: Directory containing necessary model files. +```bash +./download.sh +``` -### Streamlined Execution +Please make sure that when you are running [Llama2-7B Chat weights](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), we would assume that you already agreed to the required [terms and conditions](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and got verified to download the weights. -The overarching [`benchmark.sh`](./benchmark.sh) script further simplifies the benchmark execution process: +### A Benchmark workflow -- **File Download:** It automatically downloads essential files required for benchmarking. -- **Folder Iteration:** The script iterates through all benchmark folders in the repository, streamlining the process for multiple benchmarks. +When you run a benchmark, the following set of events occurs: -This approach empowers users to effortlessly execute benchmarks based on their preferences. To run a specific benchmark, navigate to the corresponding benchmark folder (e.g., `bench_{bench_name}`) and execute the `bench.sh` script with the required parameters. +- Automatically setting up the environments and installing the required dependencies. +- Converting the models to some specific format (if required) and saving them. +- Running the benchmarks and storing them inside the logs folder. Each log folder has the following structure: -## 📄 Usage + - `performance.log`: This will track the model run performances. You can see the `token/sec` and `memory consumption (MB)` here. + - `quality.md`: This file is an automatically generated readme file, which contains qualitative comparisons of different precisions of some engines. We take 5 prompts and run them for the set of supported precisions of that engine. We then put those results side by side. Our ground truth is the output from huggingface PyTorch model with raw float32 weights. + - `quality.json` Same as the readme file but more in raw format. -To utilize the benchmarking capabilities of this repository, follow these usage examples: +Inside each benchmark folder, you will also see a readme.md file which contains all the information and the qualitative comparison of the engine. For example: [bench_tensorrtllm](/bench_tensorrtllm/README.md). -### Run a Specific Benchmark +### Running a Benchmark -Navigate to the benchmark folder and execute the `bench.sh` script with the desired parameters: +Here is how we run benchmarks for an inference engine. ```bash -./bench_{bench_name}/bench.sh --prompt --max_tokens --repetitions --log_file --device --models_dir +./bench_/bench.sh \ + --prompt \ # Enter a prompt string + --max_tokens \ # Maximum number of tokens to output + --repetitions \ # Number of repetitions to be made for the prompt. + --device \ # The device in which we want to benchmark. + --model_name # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1) ``` -Replace `` with the specific values for your benchmark, and `` and `` with the appropriate file and directory paths. - -### Run All Benchmarks Collectively - -For a comprehensive execution of all benchmarks, use the overarching `benchmark.sh` script: +Here is an example. Let's say we want to benchmark Nvidia TensorRT LLM. So here is how the command would look like: ```bash -./bench.sh --prompt --max_tokens --repetitions --log_file --device --models_dir +./bench_tensorrtllm/bench.sh -d cuda -n llama -r 10 ``` -Again, customize the parameters according to your preferences, ensuring that and point to the correct locations. +To know more, here is more detailed info on each command line argument. -Feel free to adjust the parameters as needed for your specific benchmarking requirements. Please note that, running all the benchmarks collectively can requires lot of storage (around 500 GB). Please make sure that you have enough storage to run all of them at once. +``` + -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture') + -r, --repetitions Number of repetitions for benchmarks (default: 10) + -m, --max_tokens Maximum number of tokens for benchmarks (default: 512) + -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'CPU', default: 'cuda') + -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1) + -lf, --log_file Logging file name. + -h, --help Show this help message +``` ## 🤝 Contribute @@ -135,9 +238,9 @@ Inside the new benchmark folder, include the following structure ``` bench_{new_bench_name} -├── bench.sh # Benchmark script for setup and execution -├── requirements.txt # Dependencies required for the benchmark -└── ... # Any additional files needed for the benchmark +├── bench.sh # Benchmark script for setup and execution +├── requirements.txt # Dependencies required for the benchmark +└── ... # Any additional files needed for the benchmark ``` **3. Benchmark Script (`bench.sh`):** @@ -161,24 +264,3 @@ pre-commit install ``` The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards. - - -## 🗾 Roadmap - -In our upcoming versions, we will be adding support for the following: - -1. Add more metrics on memory consumption. This includes how much RAM/GPU memory is consumed when we run the benchmarks. -2. Add support for more models. Upcoming versions will support popular LLMs like [Mamba](https://huggingface.co/state-spaces/mamba-2.8b), [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), [Phi2](https://huggingface.co/microsoft/phi-2) etc. -3. Add ways to understand and articulate on change of generation quality with the change of frameworks and precision. We will try to add ways to understand how the generation quality of an LLM changes when we change the precision of the models or use a different inference engine framework. -4. Add support for batching. Since batching is very important while deploying LLMs. So coming versions will benchmark LLMs on batched inputs. - -If you feel like there is something more to add, feel free to open an issue or a PR. We would be super happy to take contributions from the community. - - -## 🏆 Introducing Prem Grant Program - -![Alt Text](https://blog.premai.io/content/images/size/w1200/2024/01/IMG.jpg) - -🌟 Exciting news, AI enthusiasts! Prem is thrilled to launch the Prem Grant Program, exclusively designed for forward-thinking AI startups ready to reshape the future. With this program, you get six months of free access to OpenAI, Anthropic, Cohere, Llama2, Mistral (or any other open-source model) APIs, opening doors to endless AI possibilities at zero cost. Enjoy free fine-tuning, seamless model deployment, and expert ML support. This is more than a grant; it's an invite to lead the AI revolution. Don't miss out – apply now and let's build the future of AI together with Prem! 🌟 - -Read more about the Prem Startup grant program [here](https://blog.premai.io/announcing-our-startup-grants-program/). You can directly apply to the program from [here](https://docs.google.com/forms/d/e/1FAIpQLSdv1WuZ5aC7raefnupMTla5z_-7p1XD9D28HK0nZ7JkKkQwRQ/viewform). diff --git a/README.md.template b/README.md.template deleted file mode 100644 index 44e46fc6..00000000 --- a/README.md.template +++ /dev/null @@ -1,164 +0,0 @@ -
- -

🕹️ Benchmarks

-

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

-
- -[![GitHub contributors](https://img.shields.io/github/contributors/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/graphs/contributors) -[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/commits/master) -[![GitHub last commit](https://img.shields.io/github/last-commit/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/commits/master) -[![GitHub top language](https://img.shields.io/github/languages/top/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks) -[![GitHub issues](https://img.shields.io/github/issues/premAI-io/benchmarks.svg)](https://github.com/premAI-io/benchmarks/issues) -[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) - - - -
- Table of Contents -
    -
  1. Quick glance towards performance metrics for Llama-2-7B
  2. -
  3. Getting started
  4. -
  5. Usage
  6. -
  7. Contribute
  8. -
  9. Roadmap
  10. -
  11. Introducing Prem Grant Program
  12. -
-
- -
- -## 📊 Quick glance towards performance metrics for Llama-2-7B - -Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: `tokens/sec` - - - --- The above benchmarking is done on A100-80GB GPU. You can find more details for other devices like CPU/Metal under [docs](docs/llama2.md) folder. - -- Also if you want to see more detailed information about each of the benchmark, you can find those details the respective benchmark folders. - -- If you want to compare side by side which inference engines supports which precision and device, you can check out the [ml_engines.md](/docs/ml_engines.md) file. Please note that this file is incomplete and a better comparision of engines will be added in the later versions. - -Benchmarks can also be considered as a repository of hackable scripts, that contains the code and all the knowledge base to run the popular inference engines. - -## 🚀 Getting Started - -Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Here's a quick guide to get you started: - -- **Benchmark Organization:** Each benchmark is uniquely identified as `bench_name` and resides in its dedicated folder, named `bench_{bench_name}`. - -- **Benchmark Script (`bench.sh`):** Within these benchmark folders, you'll find a common script named `bench.sh`. This script takes care of everything from setup and environment configuration to actual execution. - -### Benchmark Script Parameters - -The `bench.sh` script supports the following key parameters, allowing for customization and flexibility: - -- `prompt`: Benchmark-specific prompt. -- `max_tokens`: Maximum tokens for the benchmark. -- `repetitions`: Number of benchmark repetitions. -- `log_file`: File for storing benchmark logs. -- `device`: Specify the device for benchmark execution (CPU, CUDA, Metal). -- `models_dir`: Directory containing necessary model files. - -### Streamlined Execution - -The overarching [`benchmark.sh`](./benchmark.sh) script further simplifies the benchmark execution process: - -- **File Download:** It automatically downloads essential files required for benchmarking. -- **Folder Iteration:** The script iterates through all benchmark folders in the repository, streamlining the process for multiple benchmarks. - -This approach empowers users to effortlessly execute benchmarks based on their preferences. To run a specific benchmark, navigate to the corresponding benchmark folder (e.g., `bench_{bench_name}`) and execute the `bench.sh` script with the required parameters. - -## 📄 Usage - -To utilize the benchmarking capabilities of this repository, follow these usage examples: - -### Run a Specific Benchmark - -Navigate to the benchmark folder and execute the `bench.sh` script with the desired parameters: - -```bash -./bench_{bench_name}/bench.sh --prompt --max_tokens --repetitions --log_file --device --models_dir -``` - -Replace `` with the specific values for your benchmark, and `` and `` with the appropriate file and directory paths. - -### Run All Benchmarks Collectively - -For a comprehensive execution of all benchmarks, use the overarching `benchmark.sh` script: - -```bash -./bench.sh --prompt --max_tokens --repetitions --log_file --device --models_dir -``` - -Again, customize the parameters according to your preferences, ensuring that and point to the correct locations. - -Feel free to adjust the parameters as needed for your specific benchmarking requirements. Please note that, running all the benchmarks collectively can requires lot of storage (around 500 GB). Please make sure that you have enough storage to run all of them at once. - -## 🤝 Contribute - -We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps: - -### Creating a New Benchmark - -**1. Create a New Folder** - -Start by creating a new folder for your benchmark. Name it `bench_{new_bench_name}` for consistency. - -```bash -mkdir bench_{new_bench_name} -``` - -**2. Folder Structure** - -Inside the new benchmark folder, include the following structure - -``` -bench_{new_bench_name} -├── bench.sh # Benchmark script for setup and execution -├── requirements.txt # Dependencies required for the benchmark -└── ... # Any additional files needed for the benchmark -``` - -**3. Benchmark Script (`bench.sh`):** - -The `bench.sh` script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the [Benchmark Script Parameters](#benchmark-script-parameters) section. - -### Pre-commit Hooks - -We use pre-commit hooks to maintain code quality and consistency. - -**1. Install Pre-commit:** Ensure you have `pre-commit` installed - -```bash -pip install pre-commit -``` - -**2. Install Hooks:** Run the following command to install the pre-commit hooks - -```bash -pre-commit install -``` - -The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards. - - -## 🗾 Roadmap - -In our upcoming versions, we will be adding support for the following: - -1. Add more metrics on memory consumption. This includes how much RAM/GPU memory is consumed when we run the benchmarks. -2. Add support for more models. Upcoming versions will support popular LLMs like [Mamba](https://huggingface.co/state-spaces/mamba-2.8b), [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), [Phi2](https://huggingface.co/microsoft/phi-2) etc. -3. Add ways to understand and articulate on change of generation quality with the change of frameworks and precision. We will try to add ways to understand how the generation quality of an LLM changes when we change the precision of the models or use a different inference engine framework. -4. Add support for batching. Since batching is very important while deploying LLMs. So coming versions will benchmark LLMs on batched inputs. - -If you feel like there is something more to add, feel free to open an issue or a PR. We would be super happy to take contributions from the community. - - -## 🏆 Introducing Prem Grant Program - -![Alt Text](https://blog.premai.io/content/images/size/w1200/2024/01/IMG.jpg) - -🌟 Exciting news, AI enthusiasts! Prem is thrilled to launch the Prem Grant Program, exclusively designed for forward-thinking AI startups ready to reshape the future. With this program, you get six months of free access to OpenAI, Anthropic, Cohere, Llama2, Mistral (or any other open-source model) APIs, opening doors to endless AI possibilities at zero cost. Enjoy free fine-tuning, seamless model deployment, and expert ML support. This is more than a grant; it's an invite to lead the AI revolution. Don't miss out – apply now and let's build the future of AI together with Prem! 🌟 - -Read more about the Prem Startup grant program [here](https://blog.premai.io/announcing-our-startup-grants-program/). You can directly apply to the program from [here](https://docs.google.com/forms/d/e/1FAIpQLSdv1WuZ5aC7raefnupMTla5z_-7p1XD9D28HK0nZ7JkKkQwRQ/viewform). diff --git a/docs/archive.md b/docs/archive.md new file mode 100644 index 00000000..506cf03a --- /dev/null +++ b/docs/archive.md @@ -0,0 +1,63 @@ +# ⚙️ Benchmarking ML Engines + +This file contains numbers for different engines and precision. Since a lot of upgrades in models and engines were made. So these +results are now archived. However latest implementation does not have benchmarks for Metal or Mac CPU. So if you want to see that, feel free to check those out here. + +## A100 80GB Inference Bench: + +**Environment:** +- Model: LLAMA-2-7B +- CUDA Version: 11.7 +- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'` + +**Performance Metrics:** (unit: Tokens / second) + +| Engine | float32 | float16 | int8 | int4 | +| ------------------------------------------ | ------------- | ------------- | ------------- | -------------- | +| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | +| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | +| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | +| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | +| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20 | +| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | +| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | +| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | +| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | +| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | +| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | +| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52 | 109.09 ± 4.26 | - | - | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | + +*(Data updated: `05th April 2024`) + + +## M2 MAX 32GB Inference Bench: + +### CPU + +**Environment:** +- Model: LLAMA-2-7B +- CUDA Version: NA +- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'` + +**Performance Metrics:** (unit: Tokens / second) +| Engine | float32 | float16 | int8 | int4 | +| -------------------------------------- | ------- | ----------- | ------------ | ------------ | +| [candle](/bench_candle/) | - | 3.43 ± 0.02 | - | - | +| [llama.cpp](/bench_llamacpp/) | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 | +| [ctranslate](/bench_ctranslate/) | - | - | 1.87 ± 0.14 | - | +| [ctransformers](/bench_ctransformers/) | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 | + + +### GPU (Metal) + +**Command:** `./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'` + +**Performance Metrics:** (unit: Tokens / second) +| Engine | float32 | float16 | int8 | int4 | +| -------------------------------------- | ------- | ------- | ------------ | ------------ | +| [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | +| [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | + +*(Data updated: `05th April 2024`) diff --git a/docs/llama2.md b/docs/llama2.md deleted file mode 100644 index d1fd4bbd..00000000 --- a/docs/llama2.md +++ /dev/null @@ -1,60 +0,0 @@ -# ⚙️ Benchmarking ML Engines - -## A100 80GB Inference Bench: - -**Environment:** -- Model: LLAMA-2-7B -- CUDA Version: 11.7 -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) - -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [candle](/bench_candle/) | - | 36.78 ± 2.17 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 79.15 ± 1.20 | 100.90 ± 1.46 | -| [ctranslate](/bench_ctranslate/) | 35.23 ± 4.01 | 55.72 ± 16.66 | 35.73 ± 10.87 | - | -| [onnx](/bench_onnxruntime/) | - | 54.16 ± 3.15 | - | - | -| [transformers (pytorch)](/bench_pytorch/) | 43.79 ± 0.61 | 46.39 ± 0.28 | 6.98 ± 0.05 | 21.72 ± 0.11 | -| [vllm](/bench_vllm/) | 90.78 ± 1.60 | 90.54 ± 2.22 | - | 114.69 ± 11.20| -| [exllamav2](/bench_exllamav2/) | - | - | 121.63 ± 0.74 | 130.16 ± 0.35 | -| [ctransformers](/bench_ctransformers/) | - | - | 76.75 ± 10.36 | 84.26 ± 5.79 | -| [AutoGPTQ](/bench_autogptq/) | 42.01 ± 1.03 | 30.24 ± 0.41 | - | - | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 109.20 ± 3.28 | -| [DeepSpeed](/bench_deepspeed/) | - | 81.44 ± 8.13 | - | | -| [PyTorch Lightning](/bench_lightning/) | 24.85 ± 0.07 | 44.56 ± 2.89 | 10.50 ± 0.12 | 24.83 ± 0.05 | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 110.36 ± 0.52| 109.09 ± 4.26 | - | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 55.19 ± 1.03 | 85.03 ± 0.62 | 167.66 ± 2.05 | 235.18 ± 3.20 | - -*(Data updated: `05th April 2024`) - - -## M2 MAX 32GB Inference Bench: - -### CPU - -**Environment:** -- Model: LLAMA-2-7B -- CUDA Version: NA -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -|----------------------------------------|--------------|--------------|--------------|--------------| -| [candle](/bench_candle/) | - | 3.43 ± 0.02 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 | -| [ctranslate](/bench_ctranslate/) | - | - | 1.87 ± 0.14 | - | -| [ctransformers](/bench_ctransformers/) | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 | - - -### GPU (Metal) - -**Command:** `./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -|-----------------------------------------|--------------|---------------|--------------|--------------| -| [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | -| [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | - -*(Data updated: `05th April 2024`) diff --git a/docs/llama2.md.template b/docs/llama2.md.template deleted file mode 100644 index 621d66fd..00000000 --- a/docs/llama2.md.template +++ /dev/null @@ -1,78 +0,0 @@ -# ⚙️ Benchmarking ML Engines - -## A100 80GB Inference Bench: - -**Environment:** -- Model: Llama 2 7B Chat -- CUDA Version: 12.1 -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model llama --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) - -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [transformers (pytorch)](/bench_pytorch/) | 36.65 ± 0.61 | 34.20 ± 0.51 | 6.91 ± 0.14 | 17.83 ± 0.40 | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 63.59 ± 1.86 | -| [AutoGPTQ](/bench_autogptq/) | 34.36 ± 0.51 | 36.63 ± 0.61 | | | -| [DeepSpeed](/bench_deepspeed/) | | 84.60 ± 0.25 | | | -| [ctransformers](/bench_ctransformers/) | - | - | 85.50 ± 1.00 | 86.66 ± 1.06 | -| [llama.cpp](/bench_llamacpp/) | - | - | 89.90 ± 2.26 | 97.35 ± 4.71 | -| [ctranslate](/bench_ctranslate/) | 46.26 ± 1.59 | 79.41 ± 0.37 | 48.20 ± 0.14 | - | -| [PyTorch Lightning](/bench_lightning/) | 38.01 ± 0.09 | 48.09 ± 1.12 | 10.68 ± 0.43 | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 104.07 ± 1.61| 191.00 ± 4.60 | 316.77 ± 2.14 | 358.49 ± 2.38 | -| [vllm](/bench_vllm/) | 89.40 ± 0.22 | 89.43 ± 0.19 | - | 115.52 ± 0.49 | -| [exllamav2](/bench_exllamav2/) | - | - | 125.58 ± 1.23 | 159.68 ± 1.85 | -| [onnx](/bench_onnxruntime/) | 14.28 ± 0.12 | 19.42 ± 0.08 | - | - | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 53.64 ± 0.78 | 53.82 ± 0.11 | - | - | - - -**Performance Metrics:** GPU Memory Consumption (unit: MB) - -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|----------|----------|----------|----------| -| [transformers (pytorch)](/bench_pytorch/) | 29114.76 | 14931.72 | 8596.23 | 5643.44 | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 7149.19 | -| [AutoGPTQ](/bench_autogptq/) | 10718.54 | 5706.35 | | | -| [DeepSpeed](/bench_deepspeed/) | | 83978.35 | | | -| [ctransformers](/bench_ctransformers/) | - | - | 9774.83 | 6889.14 | -| [llama.cpp](/bench_llamacpp/) | - | - | 8797.55 | 5783.95 | -| [ctranslate](/bench_ctranslate/) | 29951.52 | 16282.29 | 9470.74 | - | -| [PyTorch Lightning](/bench_lightning/) | 42748.35 | 14736.69 | 8028.16 | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 79421.24 | 78295.07 | 77642.86 | 77256.98 | -| [vllm](/bench_vllm/) | 77928.07 | 77928.07 | - | 77768.69 | -| [exllamav2](/bench_exllamav2/) | - | - | 16582.18 | 7201.62 | -| [onnx](/bench_onnxruntime/) | 33072.09 | 19180.55 | - | - | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 79429.63 | 79295.41 | - | - | - -*(Data updated: ``) - - -## M2 MAX 32GB Inference Bench: - -### CPU - -**Environment:** -- Model: LLAMA-2-7B -- CUDA Version: NA -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -|----------------------------------------|--------------|--------------|--------------|--------------| -| [candle](/bench_candle/) | - | 3.43 ± 0.02 | - | - | -| [llama.cpp](/bench_llamacpp/) | - | - | 13.24 ± 0.62 | 21.43 ± 0.47 | -| [ctranslate](/bench_ctranslate/) | - | - | 1.87 ± 0.14 | - | -| [ctransformers](/bench_ctransformers/) | - | - | 13.50 ± 0.48 | 20.57 ± 2.50 | - - -### GPU (Metal) - -**Command:** `./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) -| Engine | float32 | float16 | int8 | int4 | -|-----------------------------------------|--------------|---------------|--------------|--------------| -| [llama.cpp](/bench_llamacpp/) | - | - | 30.11 ± 0.45 | 44.27 ± 0.12 | -| [ctransformers](/bench_ctransformers/) | - | - | 20.75 ± 0.36 | 34.04 ± 2.11 | - -*(Data updated: ``) diff --git a/docs/mistral.md b/docs/mistral.md deleted file mode 100644 index e69de29b..00000000 diff --git a/docs/mistral.md.template b/docs/mistral.md.template deleted file mode 100644 index 3df227cf..00000000 --- a/docs/mistral.md.template +++ /dev/null @@ -1,47 +0,0 @@ -# ⚙️ Benchmarking ML Engines - -## A100 80GB Inference Bench: - -**Environment:** -- Model: Mistral 7B v0.1 Instruct -- CUDA Version: 12.1 -- Command: `./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral --prompt 'Write an essay about the transformer model architecture'` - -**Performance Metrics:** (unit: Tokens / second) - -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|--------------|----------------|---------------|---------------| -| [transformers (pytorch)](/bench_pytorch/) | 39.61 ± 0.65 | 37.05 ± 0.49 | 5.08 ± 0.01 | 19.58 ± 0.38 | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 63.12 ± 2.19 | -| [AutoGPTQ](/bench_autogptq/) | 39.11 ± 0.42 | 42.94 ± 0.80 | | | -| [DeepSpeed](/bench_deepspeed/) | | 79.88 ± 0.32 | | | -| [ctransformers](/bench_ctransformers/) | - | - | 86.14 ± 1.40 | 87.22 ± 1.54 | -| [llama.cpp](/bench_llamacpp/) | - | - | 88.27 ± 0.72 | 95.33 ± 5.54 | -| [ctranslate](/bench_ctranslate/) | 43.17 ± 2.97 | 68.03 ± 0.27 | 45.14 ± 0.24 | - | -| [PyTorch Lightning](/bench_lightning/) | 32.79 ± 2.74 | 43.01 ± 2.90 | 7.75 ± 0.12 | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 117.04 ± 2.16| 206.59 ± 6.93 | 390.49 ± 4.86 | 427.40 ± 4.84 | -| [vllm](/bench_vllm/) | 84.91 ± 0.27 | 84.89 ± 0.28 | - | 106.03 ± 0.53 | -| [exllamav2](/bench_exllamav2/) | - | - | 114.81 ± 1.47 | 126.29 ± 3.05 | -| [onnx](/bench_onnxruntime/) | 15.75 ± 0.15 | 22.39 ± 0.14 | - | - | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 50.77 ± 0.85 | 50.91 ± 0.19 | - | - | - -**Performance Metrics:** GPU Memory Consumption (unit: MB) - -| Engine | float32 | float16 | int8 | int4 | -|---------------------------------------------|----------|----------|----------|----------| -| [transformers (pytorch)](/bench_pytorch/) | 31071.4 | 15976.1 | 10963.91 | 5681.18 | -| [AutoGPTQ](/bench_autogptq/) | 13400.80 | 6633.29 | | | -| [AutoAWQ](/bench_autoawq/) | - | - | - | 6572.47 | -| [DeepSpeed](/bench_deepspeed/) | | 80104.34 | | | -| [ctransformers](/bench_ctransformers/) | - | - | 10255.07 | 6966.74 | -| [llama.cpp](/bench_llamacpp/) | - | - | 9141.49 | 5880.41 | -| [ctranslate](/bench_ctranslate/) | 32602.32 | 17523.8 | 10074.72 | - | -| [PyTorch Lightning](/bench_lightning/) | 48783.95 | 18738.05 | 10680.32 | - | -| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | 79536.59 | 78341.21 | 77689.0 | 77311.51 | -| [vllm](/bench_vllm/) | 73568.09 | 73790.39| - | 74016.88 | -| [exllamav2](/bench_exllamav2/) | - | - | 21483.23 | 9460.25 | -| [onnx](/bench_onnxruntime/) | 33629.93 | 19537.07 | - | - | -| [Optimum Nvidia](/bench_optimum_nvidia/) | 79563.85 | 79496.74 | - | - | - - -*(Data updated: ``) diff --git a/docs/ml_engines.md b/docs/ml_engines.md index 70f93bf1..10e4a450 100644 --- a/docs/ml_engines.md +++ b/docs/ml_engines.md @@ -1,23 +1,44 @@ # 🔧 ML Engines -## Features - -| Features | pytorch | burn | llama.cpp | candle | tinygrad | onnxruntime | CTranslate2 | -| --------------------------- | ------- | ---- | --------- | ------ | -------- | ----------- | ----------- | -| Inference support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| 16-bit quantization support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| 8-bit quantization support | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | -| 4-bit quantization support | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | -| 2/3bit quantization support | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | -| CUDA support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | -| ROCM support | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | -| Intel OneAPI/SYCL support | ✅** | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | -| Mac M1/M2 support | ✅ | ✅ | ✅ | ⭐ | ✅ | ✅ | ⭐ | -| BLAS support(CPU) | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | -| Model Parallel support | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | -| Tensor Parallel support | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | -| Onnx Format support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | -| Training support | ✅ | 🌟 | ❌ | 🌟 | ❌ | ❌ | ❌ | - -⭐ = No Metal Support -🌟 = Partial Support for Training (Finetuning already works, but training from scratch may not work) +### Model Framework Support Matrix + +| Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training | +| ------------------------------------------ | :-----: | :-----: | :---: | :---: | :---: | :---: | :-------: | :------: | +| [candle](/bench_candle/) | ⚠️ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | 🚧 | ❌ | +| [llama.cpp](/bench_llamacpp/) | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | +| [ctranslate](/bench_ctranslate/) | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | 🚧 | ❌ | +| [onnx](/bench_onnxruntime/) | ✅ | ✅ | ❌ | ❌ | ✅ | ⚠️ | ❌ | ❌ | +| [transformers (pytorch)](/bench_pytorch/) | ✅ | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | +| [vllm](/bench_vllm/) | ✅ | ✅ | ❌ | ✅ | ✅ | 🚧 | ❌ | ❌ | +| [exllamav2](/bench_exllamav2/) | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | ❌ | ❌ | +| [ctransformers](/bench_ctransformers/) | ❌ | ❌ | ✅ | ✅ | ✅ | 🚧 | 🚧 | ❌ | +| [AutoGPTQ](/bench_autogptq/) | ✅ | ✅ | ⚠️ | ⚠️ | ✅ | ❌ | ❌ | ❌ | +| [AutoAWQ](/bench_autoawq/) | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | +| [DeepSpeed-MII](/bench_deepspeed/) | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ⚠️ | +| [PyTorch Lightning](/bench_lightning/) | ✅ | ✅ | ✅ | ✅ | ✅ | ⚠️ | ⚠️ | ✅ | +| [Optimum Nvidia](/bench_optimum_nvidia/) | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | +| [Nvidia TensorRT-LLM](/bench_tensorrtllm/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | + + +### Legend: +- ✅ Supported +- ❌ Not Supported +- ⚠️ There is a catch related to this +- 🚧 It is supported but not implemented in this current version + + +### Some pointers to note: +The names are by the name of engines. Except when the name is `Generic` then it means that the nuance applies to all the engines. + + +| Name | Type | Description | +| ----------------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| candle | ⚠️ | Metal backend is supported but it gives terrible performance even in small models like Phi2. For AMD ROCM there is no support as per this [issue](https://github.com/huggingface/candle/issues/346). | +| candle | 🚧 | Latest performance for Candle is not implemented. If you want to see the numbers, please check out [archive.md](/docs/archive.md) which contains the benchmark numbers for [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b). | +| ctranslate2 | ⚠️ | ROCM is not supported; however, works are in progress to have this feature on CTranslate2. No support for Mac M1/M2. | +| onnxruntime | ⚠️ | ONNXRuntime in general supports ROCM, but specific to LLMs and ONNXRuntime with HuggingFace Optimum only supports CUDAExecution provider right now. For CPU, it is available but super slow. | +| pytorch lightning | ⚠️ | ROCM is supported but not tested for PyTorch Lightning. See this [issue](https://github.com/Lightning-AI/litgpt/issues/1220). | +| pytorch lightning | ⚠️ | Metal is supported in PyTorch Lightning, but for Llama 2 7B Chat or Mistral 7B, it is super slow. | +| AutoGPTQ | ⚠️ | AutoGPTQ is a weight-only quantization algorithm. Activation still remains in either float32 or float16. We used a 4-bit weight quantized model for our benchmarks experiment. | +| Generic | 🚧 | For all the engines which support metal, please check out [archive.md](/docs/archive.md) which contains the benchmark numbers for [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b). | +| Deepspeed | ⚠️ | [DeepSpeed](https://github.com/microsoft/DeepSpeed) supports training; however, for inference, we have used [DeepSpeed MII](https://github.com/microsoft/DeepSpeed-MII). |