Simplistic small language model 3D-parallelism training using NumPy and MPI. Inspired by Megatron-LM and Nanotron and based only on NumPy and MPI for Python, NuMPItron offers a variety of ways to train your Transformer at a snail's pace.
This library is meant as a learning experience for implementing distributed training strategies. Ideally the library will be capable of both 3D parallelism (TP + MP + DP) and ZeRO. If you want to follow along, make sure to check out my blog.
Core functionality will be 3D parallel and ZeRO stage 1 since these can be combined in general:
- Single Core
- Tensor Parallel
- Distributed Data Parallel
- Pipeline Parallel
- Distributed sampling strategies
- ZeRO
When/if this is done, we will look at expert parallel strategies.
First, ensure mpi4py
is installed by following the instructions on the MPI for Python page.
Then, install the library using:
git clone https://github.com/lweitkamp/numpitron
cd numpitron
pip install -e . # -e .[dev] for unit tests
You will need to download the shakespeare dataset (shakespeare_char_{train|val}.bin
) from Google Drive and place it in the data
folder.
Training with tensor/data parallelism can be done using the train_shakespeare.py
script:
mpirun -n {1, 2, ...} python train_shakespeare.py \
--tensor-parallel-size {1, 2, ...} \
--data-parallel-size {1, 2, ...}
Make sure that the product of --{tensor, data}-parallel-size
is equal to -n
. Parameters and optimizer state will be stored at data/model.npy
to be used for sampling. Training takes about 12 hours for --tensor-parallel-size 2
and 32 hours without tensor parallel, reaching a loss of about ~1.801 after a couple of hours, depending on your hardware (I'm using a 2015 macbook pro):
Note that the graph above only implies that on CPU you are better off performing smaller matmuls (data/tensor parallel combinations). This makes sense given that you are compute bound quite easily on the CPU.
Run a sample generation using the following:
mpirun -n {1, 2, ...} python sample.py \
--tensor-parallel-size {1, 2, ..}
With the pretrained model loaded you would expect to see the following text below. Not bad, not great.
Seecon:
Commendom:
Who tear pout mine so I profit in.
BRUTUS:
Why, bear are dreadful he gnot letted and Chrown.
AUFIDIUS:
The may my heart, John my moone, with have glo:
But the bluike to ther opeesusate! Camille,
A marin curstifies will to a lise