[Online Demo] [Paper] [Project] [Models]
Figure 1: Multi-token in, Multi-token out Training and Inference.
Figure 2: Unified Foundation Vision Tower.
- [2025/01] 🎉 VILA-U has been accepted to ICLR2025!
- [2024/10] Online demo of VILA-U is available: https://vila-u.mit.edu. Have a try!
- [2024/10] We release the code and models for VILA-U!
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.
git clone https://github.com/mit-han-lab/vila-u
cd vila-u
./environment_setup.sh vila-u
Please download our models from HuggingFace.
git lfs install
git clone https://huggingface.co/mit-han-lab/vila-u-7b-256
Run the following command to launch a local gradio demo:
CUDA_VISIBLE_DEVICES=0 python app.py --model_path path/to/your_downloaded_model
# Image Understanding
CUDA_VISIBLE_DEVICES=0 python inference.py --model_path path/to/your_downloaded_model --image_path assets/example_image1.jpg --query "Can you describe what is happening?"
# Video Understanding
CUDA_VISIBLE_DEVICES=0 python inference.py --model_path path/to/your_downloaded_model --video_path assets/example_video1.mp4 --query "Elaborate on the visual and narrative elements of the video in detail."
# Image Generation
CUDA_VISIBLE_DEVICES=0 python inference.py --model_path path/to/your_downloaded_model --prompt "A snowy mountain." --save_path path/to/save_images --generation_nums 8
# Video Generation
CUDA_VISIBLE_DEVICES=0 python inference.py --model_path path/to/your_downloaded_model --prompt "Fireworks in the air." --video_generation True --save_path path/to/save_videos
Evaluate VILA-U on visual language benchmarks with the following command:
vila_u-eval -m path/to/model -c vicuna_v1 -ti local
Please refer to vila_u/cli/eval.py
for more argument details.
Note: Please prepare data before training. Data preparation details are in the file vila_u/data/datasets_mixture.py
.
# Pretrain
srun -p your_slurm_partition -N 8 -t 04:00:00 -A your_slurm_account -J vila-u:pretrain --gpus-per-node 8 --exclusive --dependency singleton bash scripts/train/pretrain.sh &
# SFT
srun -p your_slurm_partition -N 8 -t 04:00:00 -A your_slurm_account -J vila-u:sft --gpus-per-node 8 --exclusive --dependency singleton bash scripts/train/sft.sh &
We thank Zhijian Liu from NVIDIA for his assistance with the evaluation setup.
If you find VILA-U useful or relevant to your project and research, please kindly cite our paper:
@article{wu2024vila,
title={Vila-u: a unified foundation model integrating visual understanding and generation},
author={Wu, Yecheng and Zhang, Zhuoyang and Chen, Junyu and Tang, Haotian and Li, Dacheng and Fang, Yunhao and Zhu, Ligeng and Xie, Enze and Yin, Hongxu and Yi, Li and others},
journal={arXiv preprint arXiv:2409.04429},
year={2024}
}