Name		Name	Last commit message	Last commit date
parent directory ..
Videos		Videos
csrc		csrc
layer		layer
openvino		openvino
.gitignore		.gitignore
README.md		README.md
benchmark_deepsparse.sh		benchmark_deepsparse.sh
compare_stock_vs_custom_linear.py		compare_stock_vs_custom_linear.py
compare_stock_vs_onednn_linear.py		compare_stock_vs_onednn_linear.py
compare_stock_vs_sparse_linear.py		compare_stock_vs_sparse_linear.py
custom_llama_attention.py		custom_llama_attention.py
deepsparse_optimized_llama2.py		deepsparse_optimized_llama2.py
generate_attention_experiments.py		generate_attention_experiments.py
generate_experiments.py		generate_experiments.py
llm_pipeline.py		llm_pipeline.py
requirements.txt		requirements.txt
run_experiments.sh		run_experiments.sh
run_experiments_attention.sh		run_experiments_attention.sh
setup.py		setup.py
test_avx_sparse_attention.py		test_avx_sparse_attention.py
test_avx_sparse_layer.py		test_avx_sparse_layer.py
test_dense_attention.py		test_dense_attention.py
test_dense_layer.py		test_dense_layer.py
test_onednn_layer.py		test_onednn_layer.py
test_quantized_dense_layer.py		test_quantized_dense_layer.py
test_quantized_sparse_layer.py		test_quantized_sparse_layer.py
test_sparse_attention.py		test_sparse_attention.py
test_sparse_layer.py		test_sparse_layer.py

README.md

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

Official implementation of SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs.

This repo contains the code for SparAMX, a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our customized layer. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation and achieving \textbf{1.14}$\times$ speedup over the current systems without compromising accuracy.

Stock PyTorch	SparAMX

torch-custom-linear

Custom implementation of linear through torch extension

Dependency

pip install -r requirements.txt

Build & Install Custom Kernel

python setup.py install

Run Experiments Example

Please make sure you're logged in to HuggingFace through the CLI if you'll be using a private model.

You need to define the experiments you want to run in generate_experiments.py then run

python generate_experiments.py

A file experiments.csv is generated. Modify it if needed. After that run

./run_experiments.sh

Your results will be saved inside folder experiment_results/YYYY-MM-DD_HH-MM-SS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparAMX

SparAMX

README.md

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

torch-custom-linear

Dependency

Build & Install Custom Kernel

Run Experiments Example

Files

SparAMX

Directory actions

More options

Directory actions

More options

Latest commit

History

SparAMX

Folders and files

parent directory

README.md

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

torch-custom-linear

Dependency

Build & Install Custom Kernel

Run Experiments Example