We used the pre-trained weight provided by CodeBERT(Feng at al, 2020) as the initial weight.
To train the added model, you can train it using CodeBERT's official repository. For training data, the cleaned CodeSearchNet was used. See this document for details. I took about 23 hours with 256 batch size.
cd code2nl
lang=python #programming language
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
data_dir=../data/code2nl/CodeSearchNet
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others
train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others
pretrained_model=microsoft/codebert-base #Roberta: roberta-base
python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps
To train the Diff model we have to use our code. We need an implementation to differentiate between added and diff.
As for the training data, only the top 100 repositories of the Python language in the document were cloned (gitcloner.py), and the commit message, added and deleted were preprocessed in jsonl format (gitparser). The data we used was put on a google drive.
Like the added model, it took about 20 hours at 256 batch size for training.
Note that the weight of the added model was used as the initial weight. Be sure to set this with the load_model_path
argument.
lr=5e-5
batch_size=64
beam_size=10
source_length=256
target_length=128
output_dir=model/python
train_file=train.jsonl
dev_file=valid.jsonl
eval_steps=1000
train_steps=50000
saved_model=pytorch_model.bin # this is added model weight
python train.py --do_train --do_eval --model_type roberta \
--model_name_or_path microsoft/codebert-base \
--load_model_path $saved_model \
--train_filename $train_file \
--dev_filename $dev_file \
--output_dir $output_dir \
--max_source_length $source_length \
--max_target_length $target_length \
--beam_size $beam_size \
--train_batch_size $batch_size \
--eval_batch_size $batch_size \
--learning_rate $lr \
--train_steps $train_steps \
--eval_steps $eval_steps
See the Diff model section above for the role of the code.
This code clones all repositories in repositories.txt.
usage: gitcloner.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR [--num_worker_threads NUM_WORKER_THREADS]
optional arguments:
-h, --help show this help message and exit
--repositories REPOSITORIES
repositories file path.
--repos_dir REPOS_DIR
directory that all repositories will be downloaded.
--num_worker_threads NUM_WORKER_THREADS
number of threads in a worker
This code preprocesses cloned repositories and divides them into train, valid, and test data.
usage: gitparser.py [-h] --repositories REPOSITORIES --repos_dir REPOS_DIR --output_dir OUTPUT_DIR [--tokenizer_name TOKENIZER_NAME] [--num_workers NUM_WORKERS]
[--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]
optional arguments:
-h, --help show this help message and exit
--repositories REPOSITORIES
repositories file path.
--repos_dir REPOS_DIR
directory that all repositories had been downloaded.
--output_dir OUTPUT_DIR
The output directory where the preprocessed data will be written.
--tokenizer_name TOKENIZER_NAME
The name of tokenizer
--num_workers NUM_WORKERS
number of process
--max_source_length MAX_SOURCE_LENGTH
The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
--max_target_length MAX_TARGET_LENGTH
The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
If
UnicodeDecodeError
occurs while using gitparser.py, you must use the GitPython package at least this commit.
Python has learned the Added model. So, if you only want to make a Diff model for the Python language, step 3 can be ignored. However, for other languages (JavaScript, GO, Ruby, PHP and JAVA), Code2NL training is required to use as the initial weight of the model to be used in step 4.
Train the Diff model as the initial weight of the added model for each languages.
usage: train.py [-h] --model_type MODEL_TYPE --model_name_or_path MODEL_NAME_OR_PATH --output_dir OUTPUT_DIR [--load_model_path LOAD_MODEL_PATH]
[--train_filename TRAIN_FILENAME] [--dev_filename DEV_FILENAME] [--test_filename TEST_FILENAME] [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME]
[--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH] [--do_train] [--do_eval] [--do_test] [--do_lower_case] [--no_cuda]
[--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--learning_rate LEARNING_RATE] [--beam_size BEAM_SIZE] [--weight_decay WEIGHT_DECAY] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM]
[--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--train_steps TRAIN_STEPS] [--warmup_steps WARMUP_STEPS]
[--local_rank LOCAL_RANK] [--seed SEED]
optional arguments:
-h, --help show this help message and exit
--model_type MODEL_TYPE
Model type: e.g. roberta
--model_name_or_path MODEL_NAME_OR_PATH
Path to pre-trained model: e.g. roberta-base
--output_dir OUTPUT_DIR
The output directory where the model predictions and checkpoints will be written.
--load_model_path LOAD_MODEL_PATH
Path to trained model: Should contain the .bin files
--train_filename TRAIN_FILENAME
The train filename. Should contain the .jsonl files for this task.
--dev_filename DEV_FILENAME
The dev filename. Should contain the .jsonl files for this task.
--test_filename TEST_FILENAME
The test filename. Should contain the .jsonl files for this task.
--config_name CONFIG_NAME
Pretrained config name or path if not the same as model_name
--tokenizer_name TOKENIZER_NAME
Pretrained tokenizer name or path if not the same as model_name
--max_source_length MAX_SOURCE_LENGTH
The maximum total source sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
--max_target_length MAX_TARGET_LENGTH
The maximum total target sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
--do_train Whether to run training.
--do_eval Whether to run eval on the dev set.
--do_test Whether to run eval on the dev set.
--do_lower_case Set this flag if you are using an uncased model.
--no_cuda Avoid using CUDA when available
--train_batch_size TRAIN_BATCH_SIZE
Batch size per GPU/CPU for training.
--eval_batch_size EVAL_BATCH_SIZE
Batch size per GPU/CPU for evaluation.
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
Number of updates steps to accumulate before performing a backward/update pass.
--learning_rate LEARNING_RATE
The initial learning rate for Adam.
--beam_size BEAM_SIZE
beam size for beam search
--weight_decay WEIGHT_DECAY
Weight deay if we apply some.
--adam_epsilon ADAM_EPSILON
Epsilon for Adam optimizer.
--max_grad_norm MAX_GRAD_NORM
Max gradient norm.
--num_train_epochs NUM_TRAIN_EPOCHS
Total number of training epochs to perform.
--max_steps MAX_STEPS
If > 0: set total number of training steps to perform. Override num_train_epochs.
--eval_steps EVAL_STEPS
--train_steps TRAIN_STEPS
--warmup_steps WARMUP_STEPS
Linear warmup over warmup_steps.
--local_rank LOCAL_RANK
For distributed training: local_rank
--seed SEED random seed for initialization