Preliminary

Download Pretrained Checkpoints

We use clip-vit-large-patch14-336 as the vision encoder for both teacher and student models. Additionally, we use Qwen-1.5 / Qwen-2 of different sizes respectively as the LLM for the teacher and student models. These pretrained checkpoints can be downloaded from HuggingFace.

Prepare Teacher Model

We follow the approach of LLaVA-1.5 to train the teacher model, replacing Vicuna-1.5-7B with Qwen-2-7B, while keeping the training dataset and strategy unchanged.

Training

The training of LLaVA-MoD comprises three stages:

Adaptor Initialization: 0.6 million general captioning samples are employed to bridge the gap between visual and language modalities.
Mimic Distillation:
- Dense-to-Dense Distillation: 2.4 million general captioning and conversation samples are utilized to distill general knowledge.
- Dense-to-Sparse Distillation: 1.4 million multi-task data, including VQA, documents, science, and OCR, are used to distill specialized knowledge.
Preference Distillation tuning stage: 80,000 preference data samples are utilized to distill preference knowledge.

Adaptor Initialization

first, download the caption dataset LLaVA-Pretrain
then run the following scripts:

bash shells/train/qwen/pretrain.sh

Mimic Distillation

In this stage, we initially conduct Dense-to-Dense Distillation on the dense student model. Subsequently, we up-cycle the student model from dense to sparse and conduct Dense-to-Sparse Distillation.

Dense-to-Dense Distillation

first, download general caption datasets (ShareGPT4V-Captioner and ALLaVA-Caption-LAION-4V) and general conversation datasets (SViT, LVIS, LRV, MIMIC-IT). The general datasets have also been packaged and can be downloaded from MoE-LLaVA.
then, set the distillation and model configuration:

# KD config
POLICY_MODEL_TYPE='dense'
REF_MODEL_TYPE='dense'
LOSS_TYPE='only_kd'  # kd_lm | only_kd
DISTILL_ALL_TOKENS=False  # False: only response, True: multimodal instruction + response

# MoE config
MOE_LOSS_ENABLE=False
MOE_ENABLE=False
MOE_FINETUNE=False
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5

finally, run the following scripts:

bash shells/train/qwen/dense2dense_distillation.sh

Dense-to-Sparse Distillation

first, download multi-task datasets (Text-VQA, IConQA, SQA, SBU, follow ShareGPT4V to download images from: LAION-CC-SBU-558K, COCO, WebData, SAM, GQA, OCR-VQA, TextVQA, VisualGnome (Part1, Part2), follow InternVL to download DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN). The json files have also been packaged and can be downloaded from MobileVLM and InternVL.
then, set the distillation and model configuration:

# KD config
POLICY_MODEL_TYPE='dense'
REF_MODEL_TYPE='dense'
LOSS_TYPE='only_kd'  # kd_lm | only_kd
DISTILL_ALL_TOKENS=False  # False: only response, True: multimodal instruction + response

# MoE config
MOE_LOSS_ENABLE=False
MOE_ENABLE=False
MOE_FINETUNE=False
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5

finally, run the following scripts:

bash shells/train/qwen/dense2sparse_distillation.sh

Preference Distillation

first, download preference dataset from RLAIF-V.
then, set the distillation and model configuration:

# KD config
POLICY_MODEL_TYPE='sparse'
REF_MODEL_TYPE='dense'
LOSS_TYPE='kto_pair'  # kto_pair | sigmoid
DISTILL_ALL_TOKENS=False  # False: only response, True: multimodal instruction + response


# MoE config
MOE_LOSS_ENABLE=True
MOE_ENABLE=True
MOE_FINETUNE=True
MOE_MODE="sparse"
NUM_EXPERTS=4
TOP_K_EXPERTS=2
USE_RESIDUAL=False
ROUTER_AUX_LOSS_COEF=0.01
CAPACITY_FACTOR=1.5

finally, run the following scripts:

bash shells/train/qwen/preference_distillation.sh

Evaluation

We follow LLaVA-1.5 to evaluate on comprehension benchmarks (TextVQA, GQA, ScienceQA, VizWiz, MME, MMBench) and RLAIF-V to evaluate on hallucination benchmarks (MMHal Bench, POPE and Object HalBench). Please refer to these resources to organize the evaluation datasets. All the evaluation scripts are located under shells/eval. Here is an example for MMBench.

#!/bin/bash
MODEL_NAME='your_model_name'
MODEL_PATH='your_model_path'

CONV="qwen"
SPLIT="mmbench_dev_en_20231003"
EVAL="benchmark"

deepspeed --include localhost:0 --master_port 20029 llavamod/eval/model_vqa_mmbench.py \
     --model-path ${MODEL_PATH} \
     --question-file ${EVAL}/mmbench/$SPLIT.tsv \
     --answers-file ${EVAL}/mmbench/answers/$SPLIT/${MODEL_NAME}.jsonl \
     --single-pred-prompt \
     --temperature 0 \
     --conv-mode ${CONV}

mkdir -p ${EVAL}/mmbench/answers_upload/$SPLIT

python3 scripts/convert_mmbench_for_submission.py \
    --annotation-file ${EVAL}/mmbench/$SPLIT.tsv \
    --result-dir ${EVAL}/mmbench/answers/$SPLIT \
    --upload-dir ${EVAL}/mmbench/answers_upload/$SPLIT \
    --experiment ${MODEL_NAME}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TRAIN_EVAL.md

TRAIN_EVAL.md

Preliminary

Download Pretrained Checkpoints

Prepare Teacher Model

Training

Adaptor Initialization

Mimic Distillation

Dense-to-Dense Distillation

Dense-to-Sparse Distillation

Preference Distillation

Evaluation

Files

TRAIN_EVAL.md

Latest commit

History

TRAIN_EVAL.md

File metadata and controls

Preliminary

Download Pretrained Checkpoints

Prepare Teacher Model

Training

Adaptor Initialization

Mimic Distillation

Dense-to-Dense Distillation

Dense-to-Sparse Distillation

Preference Distillation

Evaluation