We use clip-vit-large-patch14-336 as the vision encoder for both teacher and student models. Additionally, we use Qwen-1.5 / Qwen-2 of different sizes respectively as the LLM for the teacher and student models. These pretrained checkpoints can be downloaded from HuggingFace.
We follow the approach of LLaVA-1.5 to train the teacher model, replacing Vicuna-1.5-7B with Qwen-2-7B, while keeping the training dataset and strategy unchanged.
The training of LLaVA-MoD comprises three stages:
- Adaptor Initialization: 0.6 million general captioning samples are employed to bridge the gap between visual and language modalities.
- Mimic Distillation:
- Dense-to-Dense Distillation: 2.4 million general captioning and conversation samples are utilized to distill general knowledge.
- Dense-to-Sparse Distillation: 1.4 million multi-task data, including VQA, documents, science, and OCR, are used to distill specialized knowledge.
- Preference Distillation tuning stage: 80,000 preference data samples are utilized to distill preference knowledge.
- first, download the caption dataset LLaVA-Pretrain
- then run the following scripts:
bash shells/train/qwen/pretrain.sh
In this stage, we initially conduct Dense-to-Dense Distillation on the dense student model. Subsequently, we up-cycle the student model from dense to sparse and conduct Dense-to-Sparse Distillation.
- first, download general caption datasets (ShareGPT4V-Captioner and ALLaVA-Caption-LAION-4V) and general conversation datasets (SViT, LVIS, LRV, MIMIC-IT). The general datasets have also been packaged and can be downloaded from MoE-LLaVA.
- then, set the distillation and model configuration:
# KD config
POLICY_MODEL_TYPE='dense'
REF_MODEL_TYPE='dense'
LOSS_TYPE='only_kd' # kd_lm | only_kd
DISTILL_ALL_TOKENS=False # False: only response, True: multimodal instruction + response
# MoE config
MOE_LOSS_ENABLE=False
MOE_ENABLE=False
MOE_FINETUNE=False
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5
- finally, run the following scripts:
bash shells/train/qwen/dense2dense_distillation.sh
- first, download multi-task datasets (Text-VQA, IConQA, SQA, SBU, follow ShareGPT4V to download images from: LAION-CC-SBU-558K, COCO, WebData, SAM, GQA, OCR-VQA, TextVQA, VisualGnome (Part1, Part2), follow InternVL to download DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN). The json files have also been packaged and can be downloaded from MobileVLM and InternVL.
- then, set the distillation and model configuration:
# KD config
POLICY_MODEL_TYPE='dense'
REF_MODEL_TYPE='dense'
LOSS_TYPE='only_kd' # kd_lm | only_kd
DISTILL_ALL_TOKENS=False # False: only response, True: multimodal instruction + response
# MoE config
MOE_LOSS_ENABLE=False
MOE_ENABLE=False
MOE_FINETUNE=False
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5
- finally, run the following scripts:
bash shells/train/qwen/dense2sparse_distillation.sh
- first, download preference dataset from RLAIF-V.
- then, set the distillation and model configuration:
# KD config
POLICY_MODEL_TYPE='sparse'
REF_MODEL_TYPE='dense'
LOSS_TYPE='kto_pair' # kto_pair | sigmoid
DISTILL_ALL_TOKENS=False # False: only response, True: multimodal instruction + response
# MoE config
MOE_LOSS_ENABLE=True
MOE_ENABLE=True
MOE_FINETUNE=True
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5
- finally, run the following scripts:
bash shells/train/qwen/preference_distillation.sh
We follow LLaVA-1.5 to evaluate on comprehension benchmarks (TextVQA, GQA, ScienceQA, VizWiz, MME, MMBench) and RLAIF-V to evaluate on hallucination benchmarks (MMHal Bench, POPE and Object HalBench). Please refer to these resources to organize the evaluation datasets. All the evaluation scripts are located under shells/eval
. Here is an example for MMBench.
#!/bin/bash
MODEL_NAME='your_model_name'
MODEL_PATH='your_model_path'
CONV="qwen"
SPLIT="mmbench_dev_en_20231003"
EVAL="benchmark"
deepspeed --include localhost:0 --master_port 20029 llavamod/eval/model_vqa_mmbench.py \
--model-path ${MODEL_PATH} \
--question-file ${EVAL}/mmbench/$SPLIT.tsv \
--answers-file ${EVAL}/mmbench/answers/$SPLIT/${MODEL_NAME}.jsonl \
--single-pred-prompt \
--temperature 0 \
--conv-mode ${CONV}
mkdir -p ${EVAL}/mmbench/answers_upload/$SPLIT
python3 scripts/convert_mmbench_for_submission.py \
--annotation-file ${EVAL}/mmbench/$SPLIT.tsv \
--result-dir ${EVAL}/mmbench/answers/$SPLIT \
--upload-dir ${EVAL}/mmbench/answers_upload/$SPLIT \
--experiment ${MODEL_NAME}