Skip to content

Commit

Permalink
Add LLMs quantization model list and recipes (#1504)
Browse files Browse the repository at this point in the history
Signed-off-by: chensuyue <[email protected]>
  • Loading branch information
chensuyue authored Dec 29, 2023
1 parent 7634409 commit f19cc9d
Show file tree
Hide file tree
Showing 7 changed files with 48 additions and 8 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ Intel® Neural Compressor
<h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet)</h3>

[![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
[![version](https://img.shields.io/badge/release-2.4-green)](https://github.com/intel/neural-compressor/releases)
[![version](https://img.shields.io/badge/release-2.4.1-green)](https://github.com/intel/neural-compressor/releases)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
[![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
[![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)

[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/README.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[LLMs Recipes](./docs/source/llm_recipes.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)

---
<div align="left">
Expand Down Expand Up @@ -72,8 +72,9 @@ q_model = fit(
<tr>
<td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
<td colspan="2" align="center"><a href="./docs/source/design.md#workflow">Workflow</a></td>
<td colspan="1" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
<td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
<td colspan="2" align="center"><a href="examples/README.md">Examples</a></td>
<td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
</tr>
</tbody>
<thead>
Expand Down
2 changes: 1 addition & 1 deletion conda_meta/basic/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{% set version = "2.4" %}
{% set version = "2.4.1" %}
{% set buildnumber = 0 %}
package:
name: neural-compressor
Expand Down
2 changes: 1 addition & 1 deletion conda_meta/neural_insights/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{% set version = "2.4" %}
{% set version = "2.4.1" %}
{% set buildnumber = 0 %}
package:
name: neural-insights
Expand Down
2 changes: 1 addition & 1 deletion conda_meta/neural_solution/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{% set version = "2.4" %}
{% set version = "2.4.1" %}
{% set buildnumber = 0 %}
package:
name: neural-solution
Expand Down
27 changes: 27 additions & 0 deletions docs/source/llm_recipes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
LLM Quantization Models and Recipes
---

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ),
and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/),
[Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

> Notes:
> - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).
> - The model list are continuing update, please expect to find more LLMs in the future.
## IPEX key models
| Models | SQ INT8 | WOQ INT8 | WOQ INT4 |
|:-------------------------:|---------|:--------:|:--------:|
| EleutherAI/gpt-j-6b ||||
| facebook/opt-1.3b ||||
| facebook/opt-30b ||||
| meta-llama/Llama-2-7b-hf ||||
| meta-llama/Llama-2-13b-hf ||||
| meta-llama/Llama-2-70b-hf ||||
| tiiuae/falcon-40b ||||

**Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.md).**
> Notes:
> - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html).
> - WOQ INT4 recipes will be published soon.
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,10 @@ function run_benchmark {
model_name_or_path="facebook/opt-125m"
approach="weight_only"
extra_cmd=$extra_cmd" --woq_algo GPTQ"
elif [ "${topology}" = "opt_125m_woq_gptq_debug_int4" ]; then
model_name_or_path="facebook/opt-125m"
approach="weight_only"
extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_scheme asym --woq_group_size 128 --gptq_use_max_length --gptq_debug"
elif [ "${topology}" = "opt_125m_woq_teq" ]; then
model_name_or_path="facebook/opt-125m"
approach="weight_only"
Expand All @@ -98,13 +102,21 @@ function run_benchmark {
elif [ "${topology}" = "gpt_j_ipex_sq" ]; then
model_name_or_path="EleutherAI/gpt-j-6b"
extra_cmd=$extra_cmd" --ipex --sq --alpha 1.0"
elif [ "${topology}" = "gpt_j_woq_rtn" ]; then
elif [ "${topology}" = "gpt_j_woq_rtn_int4" ]; then
model_name_or_path="EleutherAI/gpt-j-6b"
approach="weight_only"
extra_cmd=$extra_cmd" --woq_algo RTN --woq_bits 4 --woq_group_size 128 --woq_scheme asym --woq_enable_mse_search"
elif [ "${topology}" = "gpt_j_woq_gptq_debug_int4" ]; then
model_name_or_path="EleutherAI/gpt-j-6b"
approach="weight_only"
extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_group_size 128 --woq_scheme asym --gptq_use_max_length --gptq_debug"
elif [ "${topology}" = "falcon_7b_sq" ]; then
model_name_or_path="tiiuae/falcon-7b-instruct"
extra_cmd=$extra_cmd" --sq --alpha 0.5"
elif [ "${topology}" = "falcon_7b_woq_gptq_debug_int4" ]; then
model_name_or_path="tiiuae/falcon-7b-instruct"
approach="weight_only"
extra_cmd=$extra_cmd" --woq_algo GPTQ --woq_bits 4 --woq_group_size 128 --woq_scheme asym --gptq_use_max_length --gptq_debug"
fi

python -u run_clm_no_trainer.py \
Expand Down
2 changes: 1 addition & 1 deletion neural_compressor/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Intel® Neural Compressor: An open-source Python library supporting popular model compression techniques."""
__version__ = "2.4"
__version__ = "2.4.1"

0 comments on commit f19cc9d

Please sign in to comment.