Skip to content

Latest commit

 

History

History
136 lines (117 loc) · 7.35 KB

TRAIN_EVAL.md

File metadata and controls

136 lines (117 loc) · 7.35 KB

Preliminary

Download Pretrained Checkpoints

We use clip-vit-large-patch14-336 as the vision encoder for both teacher and student models. Additionally, we use Qwen-1.5 / Qwen-2 of different sizes respectively as the LLM for the teacher and student models. These pretrained checkpoints can be downloaded from HuggingFace.

Prepare Teacher Model

We follow the approach of LLaVA-1.5 to train the teacher model, replacing Vicuna-1.5-7B with Qwen-2-7B, while keeping the training dataset and strategy unchanged.

Training

The training of LLaVA-MoD comprises three stages:

  • Adaptor Initialization: 0.6 million general captioning samples are employed to bridge the gap between visual and language modalities.
  • Mimic Distillation:
    • Dense-to-Dense Distillation: 2.4 million general captioning and conversation samples are utilized to distill general knowledge.
    • Dense-to-Sparse Distillation: 1.4 million multi-task data, including VQA, documents, science, and OCR, are used to distill specialized knowledge.
  • Preference Distillation tuning stage: 80,000 preference data samples are utilized to distill preference knowledge.

Adaptor Initialization

  • first, download the caption dataset LLaVA-Pretrain
  • then run the following scripts:
bash shells/train/qwen/pretrain.sh

Mimic Distillation

In this stage, we initially conduct Dense-to-Dense Distillation on the dense student model. Subsequently, we up-cycle the student model from dense to sparse and conduct Dense-to-Sparse Distillation.

Dense-to-Dense Distillation

# KD config
POLICY_MODEL_TYPE='dense'
REF_MODEL_TYPE='dense'
LOSS_TYPE='only_kd'  # kd_lm | only_kd
DISTILL_ALL_TOKENS=False  # False: only response, True: multimodal instruction + response

# MoE config
MOE_LOSS_ENABLE=False
MOE_ENABLE=False
MOE_FINETUNE=False
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5
  • finally, run the following scripts:
bash shells/train/qwen/dense2dense_distillation.sh

Dense-to-Sparse Distillation

# KD config
POLICY_MODEL_TYPE='dense'
REF_MODEL_TYPE='dense'
LOSS_TYPE='only_kd'  # kd_lm | only_kd
DISTILL_ALL_TOKENS=False  # False: only response, True: multimodal instruction + response

# MoE config
MOE_LOSS_ENABLE=False
MOE_ENABLE=False
MOE_FINETUNE=False
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5
  • finally, run the following scripts:
bash shells/train/qwen/dense2sparse_distillation.sh

Preference Distillation

  • first, download preference dataset from RLAIF-V.
  • then, set the distillation and model configuration:
# KD config
POLICY_MODEL_TYPE='sparse'
REF_MODEL_TYPE='dense'
LOSS_TYPE='kto_pair'  # kto_pair | sigmoid
DISTILL_ALL_TOKENS=False  # False: only response, True: multimodal instruction + response


# MoE config
MOE_LOSS_ENABLE=True
MOE_ENABLE=True
MOE_FINETUNE=True
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5
  • finally, run the following scripts:
bash shells/train/qwen/preference_distillation.sh

Evaluation

We follow LLaVA-1.5 to evaluate on comprehension benchmarks (TextVQA, GQA, ScienceQA, VizWiz, MME, MMBench) and RLAIF-V to evaluate on hallucination benchmarks (MMHal Bench, POPE and Object HalBench). Please refer to these resources to organize the evaluation datasets. All the evaluation scripts are located under shells/eval. Here is an example for MMBench.

#!/bin/bash
MODEL_NAME='your_model_name'
MODEL_PATH='your_model_path'

CONV="qwen"
SPLIT="mmbench_dev_en_20231003"
EVAL="benchmark"

deepspeed --include localhost:0 --master_port 20029 llavamod/eval/model_vqa_mmbench.py \
     --model-path ${MODEL_PATH} \
     --question-file ${EVAL}/mmbench/$SPLIT.tsv \
     --answers-file ${EVAL}/mmbench/answers/$SPLIT/${MODEL_NAME}.jsonl \
     --single-pred-prompt \
     --temperature 0 \
     --conv-mode ${CONV}

mkdir -p ${EVAL}/mmbench/answers_upload/$SPLIT

python3 scripts/convert_mmbench_for_submission.py \
    --annotation-file ${EVAL}/mmbench/$SPLIT.tsv \
    --result-dir ${EVAL}/mmbench/answers/$SPLIT \
    --upload-dir ${EVAL}/mmbench/answers_upload/$SPLIT \
    --experiment ${MODEL_NAME}