Official implementation of SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs.
This repo contains the code for SparAMX, a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our customized layer. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation and achieving \textbf{1.14}$\times$ speedup over the current systems without compromising accuracy.
Stock PyTorch | SparAMX |
---|---|
Custom implementation of linear through torch extension
pip install -r requirements.txt
python setup.py install
Please make sure you're logged in to HuggingFace through the CLI if you'll be using a private model.
You need to define the experiments you want to run in generate_experiments.py
then run
python generate_experiments.py
A file experiments.csv
is generated. Modify it if needed. After that run
./run_experiments.sh
Your results will be saved inside folder experiment_results/YYYY-MM-DD_HH-MM-SS
.