Skip to content

[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”

License

Notifications You must be signed in to change notification settings

opendatalab/UrBench

Repository files navigation

This repo contains evaluation code for the paper "UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios" [AAAI 2025]

🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 📖 arXiv

🎉 News

  • 🔥[2024.12.11] UrBench has been accepted to AAAI 2025 main track!

Introduction

We propose UrBench, a multi-view benchmark designed to evaluate LMMs’ performances in urban environments. Our benchmark includes 14 urban tasks that we categorize into various dimensions. These tasks encompass both region-level evaluations that assess LMMs’ capabilities in urban planning, as well as role-level evaluations that examine LMMs’ responses to daily issues.

UrBench Overview

Comparison with Existing Benchmarks

Compared to previous benchmarks, UrBench offers:
  • Region-level and role-level questions. UrBench contains diverse questions at both region and role level, while previous benchmarks generally focus on region-level questions.
  • Multi-view data. UrBench incorporates both street and satellite data, as well as their paired-up cross-view data. Prior benchmarks generally focus on evaluations from a single view perspective.
  • Diverse task types. UrBench contains 14 diverse task types categorized into four task dimensions, while previous benchmarks only offer limited task types such as counting, object recognition, etc.

Evaluation Results

UrBench poses significant challenges to current SoTA LMMs. We find that the best performing closed-source model GPT-4o and open-source model VILA-1.5- 40B only achieve a 61.2% and a 53.1% accuracy, respectively. Interestingly, our findings indicate that the primary limitation of these models lies in their ability to comprehend UrBench questions, not in their capacity to process multiple images, as the performance between multi-image and their single-image counterparts shows little difference, such as LLaVA-NeXT-8B and LLaVA-NeXT-Interleave in the table. Overall, the challenging nature of our benchmark indicates that current LMMs’ strong performance on the general benchmarks are not generalized to the multi-view urban scenarios.

Performances of LMMs and human experts on the UrBench test set.

📊 Evaluation

🛠️ Installation

Please clone our repository and change to that folder

git clone https://github.com/opendatalab/Urbench.git
cd urbench

Create a new python environment and install relevant requirements

conda create -n urbench python=3.10
conda activate urbench
pip install -e .

Start evaluating

Here's an example of running evaluation on UrBench's test set with TinyLLaVA

python -m accelerate.commands.launch --num_processes=2 --main_process_port=10043 -m lmms_eval --model=llava_hf --model_args="pretrained="bczhou/tiny-llava-v1-hf",device=""" --log_samples --log_samples_suffix tinyllava --tasks citybench_test_all --output_path ./logs

Citation

@article{zhou2024urbench,
  title={Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios},
  author={Zhou, Baichuan and Yang, Haote and Chen, Dairong and Ye, Junyan and Bai, Tianyi and Yu, Jinhua and Zhang, Songyang and Lin, Dahua and He, Conghui and Li, Weijia},
  journal={arXiv preprint arXiv:2408.17267},
  year={2024}
}

About

[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published