In addition to providing the implementation code of the modules and architectures in the Wonderful Matrices paper, this project is also a continuation of the discussion section.
The Doge
architecture is a Transformer model that uses Dynamic Mask Attention
to understand training with self-attention and inference with state-space.
We hope to further explore whether the Transformer framework allows for more complex feedforward network structures by training a small language model (SLM) based on the Doge
architecture, enabling the model to have fewer cache states and larger knowledge capacity.
We also hope to use open-source tools and frameworks as much as possible to simplify the process from data processing to model training, so that beginners can easily understand and use them.
- Windows or Linux
- NVIDIA GPU
- Python 3.10+
- PyTorch 2.0+
- CUDA 11.8+
We highly recommend that you install the latest version of PyTorch and CUDA for optimal performance.
Of course, you can also use the open-source Docker PyTorch image to avoid the hassle of configuring the environment.
docker pull nvcr.io/nvidia/pytorch:24.12-py3
docker run --privileged --gpus all -it --name PyTorch --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 -v <your code path>:/workspace -v <your datasets path>:/workspace/Doge/datasets nvcr.io/nvidia/pytorch:24.12-py3
pip install transformers
: The core framework for all subsequent work.pip install datasets sentencepiece boto3
: Used to download and process datasets.pip install accelerate
: Used for distributed training.pip install einx
: Fast implementation dependency for the CDMoE module.
git clone https://github.com/LoserCheems/WonderfulMatrices.git
cd WonderfulMatrices
pip install -e .
We have written a notebook (still being updated) to demonstrate the entire process of datasets processing, model training, and model evaluation. You can use the following complete architecture or individual modules.
The modeling code of the Cheems architecture.
Source code: modeling_cheems.py
Usage:
import torch
from wonderful_matrices.models.configuration_cheems import CheemsConfig
from wonderful_matrices.models.modeling_cheems import CheemsForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("<your_model_path_or_name>")
config = CheemsConfig()
model = CheemsForCausalLM(config)
input_ids = tokenizer("Hi, how are you today?", return_tensors="pt")
outputs = model.generate(**input_ids, max_length=100)
print(tokenizer.batch_decode(outputs))
The modeling code of the Doge architecture.
Source code: modeling_doge.py
Usage:
import torch
from wonderful_matrices.models.configuration_doge import DogeConfig
from wonderful_matrices.models.modeling_doge import DogeForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("<your_model_path_or_name>")
config = DogeConfig()
model = DogeForCausalLM(config)
input_ids = tokenizer("Hi, how are you today?", return_tensors="pt")
outputs = model.generate(**input_ids, max_length=100)
print(tokenizer.batch_decode(outputs))
The sequence transformation module of the Doge model.
Source code: dmattn.py
Usage:
import torch
from wonderful_matrices.modules.dmattn import DMAttn
batch, seq_len, dim = 2, 16, 64
x = torch.rand(batch, seq_len, dim)
attention_mask = torch.ones(batch, seq_len)
attn = DMAttn(
d_model=dim,
n_heads=1,
max_position_embeddings=seq_len,
layer_idx=0,
)
y, past_key_values = attn(x, attention_mask)
print(f"Input shape: {x.shape}, Output shape: {y.shape}")
The state transformation module of the Doge model.
Source code: cdmoe.py
Usage:
import torch
from wonderful_matrices.modules.cdmoe import CDMoE
batch, seq_len, dim = 2, 16, 64
x = torch.rand(batch, seq_len, dim)
cdmoe = CDMoE(
d_model=dim,
act_fn="silu",
d_ff=dim * 4,
d_private_expert_retrieval=64,
n_experts=64,
n_experts_heads=1,
n_experts_per_head=2,
)
y = cdmoe(x)
print(f"Input shape: {x.shape}, Output shape: {y.shape}")
If you use this codebase, or otherwise find our work valuable, please cite our paper:
@misc{shi2024wonderfulmatrices,
title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture},
author={Jingze Shi and Bingheng Wu},
year={2024},
eprint={2412.11834},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.11834},
}