UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
This repo contains evaluation code for the paper "UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios" [AAAI 2025]
🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 📖 arXiv
- 🔥[2024.12.11] UrBench has been accepted to AAAI 2025 main track!
We propose UrBench, a multi-view benchmark designed to evaluate LMMs’ performances in urban environments. Our benchmark includes 14 urban tasks that we categorize into various dimensions. These tasks encompass both region-level evaluations that assess LMMs’ capabilities in urban planning, as well as role-level evaluations that examine LMMs’ responses to daily issues.
- Region-level and role-level questions. UrBench contains diverse questions at both region and role level, while previous benchmarks generally focus on region-level questions.
- Multi-view data. UrBench incorporates both street and satellite data, as well as their paired-up cross-view data. Prior benchmarks generally focus on evaluations from a single view perspective.
- Diverse task types. UrBench contains 14 diverse task types categorized into four task dimensions, while previous benchmarks only offer limited task types such as counting, object recognition, etc.
UrBench poses significant challenges to current SoTA LMMs. We find that the best performing closed-source model GPT-4o and open-source model VILA-1.5- 40B only achieve a 61.2% and a 53.1% accuracy, respectively. Interestingly, our findings indicate that the primary limitation of these models lies in their ability to comprehend UrBench questions, not in their capacity to process multiple images, as the performance between multi-image and their single-image counterparts shows little difference, such as LLaVA-NeXT-8B and LLaVA-NeXT-Interleave in the table. Overall, the challenging nature of our benchmark indicates that current LMMs’ strong performance on the general benchmarks are not generalized to the multi-view urban scenarios.
Performances of LMMs and human experts on the UrBench test set.Please clone our repository and change to that folder
git clone https://github.com/opendatalab/Urbench.git
cd urbench
Create a new python environment and install relevant requirements
conda create -n urbench python=3.10
conda activate urbench
pip install -e .
Here's an example of running evaluation on UrBench's test set with TinyLLaVA
python -m accelerate.commands.launch --num_processes=2 --main_process_port=10043 -m lmms_eval --model=llava_hf --model_args="pretrained="bczhou/tiny-llava-v1-hf",device=""" --log_samples --log_samples_suffix tinyllava --tasks citybench_test_all --output_path ./logs
@article{zhou2024urbench,
title={Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios},
author={Zhou, Baichuan and Yang, Haote and Chen, Dairong and Ye, Junyan and Bai, Tianyi and Yu, Jinhua and Zhang, Songyang and Lin, Dahua and He, Conghui and Li, Weijia},
journal={arXiv preprint arXiv:2408.17267},
year={2024}
}