Skip to content

training_bert

神霄 edited this page Oct 21, 2022 · 1 revision

Bert training tutorial

[TOC]

Environment setup

we recommend using Anaconda to set up your own python virtual environment.

# in case of pip install error, change the pip source may help
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# build virtual environment
conda env create -f environment.yaml

# activate virtual environment
conda activate maas

Fetch language resources

  1. Install git-lfs Git Large File Storage

  2. Clone resources repository

git clone http://www.modelscope.cn/damo/speech_sambert-hifigan_tts_zhitian_emo_zh-cn_16k.git
.
├── configuration.json
├── description
├── README.md
├── resource
├── resource.zip      <----- dependency resources
└── voices.zip

The resource.zip will be used in the next step.

Data processing

Currently, we support plain text data. Make sure the plain text data file like the following:

徐玠诡谲多智,善揣摩,知道徐知询不可辅佐,掌握着他的短处以归附徐知诰。
许乐夫生于山东省临朐县杨善镇大辛庄,毕业于抗大一分校。
宣统元年(1909年),顺德绅士冯国材在香山大黄圃成立安洲农务分会,管辖东海十六沙,冯国材任总理。
学生们大多住在校区宿舍,通过参加不同的体育文化俱乐部及社交活动,形成一个友谊长存的社会圈。
学校的“三节一会”(艺术节、社团节、科技节、运动会)是显示青春才华的盛大活动。
雪是先天自闭症患者,不懂与人沟通,却拥有灵敏听觉,而且对复杂动作过目不忘。
勋章通过一柱状螺孔和螺钉附着在衣物上。
雅恩雷根斯堡足球俱乐部()是一家位于德国雷根斯堡的足球俱乐部,处于德国足球丙级联赛。
亚历山大·格罗滕迪克于1957年证明了一个深远的推广,现在叫做格罗滕迪克–黎曼–罗赫定理。
...
...

For quick start: A demo dataset is available on xxx.

python kantts/preprocess/text_process.py --text_file TEXT_FILE_PATH --resources_zip_file RESOURCE_ZIPFILE_PATH --output_dir OUTPUT_DATA_FEATURE_PATH 

Then you get the features for training.

.
├── bert_train.lst
├── bert_valid.lst
└── raw_metafile.txt

Training

Our training recipe is config driven, a default Bert model config can be found kantts/configs/sybert.yaml, you can do some modifications on that config and create your own Bert model :)

Now you have got the sword and shield(data and model :-|), go have a try.

CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_sybert.py --model_config YOUR_MODEL_CONFIG  --root_dir OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH

Distributed training

If your GPU devices are enough, you can use distributed training, which is a lot faster than single GPU training. For example, assign GPU device indexes with CUDA_VISIBLE_DEVICES system variable, --nproc_per_node denotes the count of GPU devices.

CUDA_VISIBLE_DEVICES=0,1,2,4 python -m pytorch.distributed.launch --nproc_per_node=4 kantts/bin/train_sybert.py --model_config YOUR_MODEL_CONFIG  --root_dir ~OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH

Resume training

--resume_path can be used to resume training with a pre-trained model, or continue training from a previous checkpoint.

CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_sybert.py --model_config YOUR_MODEL_CONFIG  --root_dir ~OUTPUT_DATA_FEATURE_PATH --stage_dir TRAINING_STAGE_PATH --resume_path CHECKPOINT_PATH

After training is done, your TRAIING_STAGE_PATH looks like below

.
├── ckpt/
├── config.yaml
├── log/
└── stdout.log

Model checkpoints are stored in ckpt directory,

./ckpt
├── checkpoint_10000.pth
├── checkpoint_12000.pth
├── checkpoint_14000.pth
├── checkpoint_16000.pth
├── checkpoint_18000.pth
├── checkpoint_2000.pth
├── checkpoint_4000.pth
├── checkpoint_6000.pth
└── checkpoint_8000.pth

Pretrained Model (TODO)

XXXXXX XXXXXX

Plugins (TODO)

XXXXXX XXXXXX

References

Our implementation refers to the following repositories and papers.

[ming024's FastSpeech2 Implementation] (https://github.com/ming024/FastSpeech2)

[Bert] (https://arxiv.org/abs/1810.04805)