GitHub - opendatalab/UrBench: [AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

This repo contains evaluation code for the paper "UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios" [AAAI 2025]

🌐 Homepage | 🤗 Dataset | 📑 Paper | 💻 Code | 📖 arXiv

🎉 News

🔥[2024.12.11] UrBench has been accepted to AAAI 2025 main track!

Introduction

We propose UrBench, a multi-view benchmark designed to evaluate LMMs’ performances in urban environments. Our benchmark includes 14 urban tasks that we categorize into various dimensions. These tasks encompass both region-level evaluations that assess LMMs’ capabilities in urban planning, as well as role-level evaluations that examine LMMs’ responses to daily issues.

Comparison with Existing Benchmarks

Compared to previous benchmarks, UrBench offers:

Region-level and role-level questions. UrBench contains diverse questions at both region and role level, while previous benchmarks generally focus on region-level questions.
Multi-view data. UrBench incorporates both street and satellite data, as well as their paired-up cross-view data. Prior benchmarks generally focus on evaluations from a single view perspective.
Diverse task types. UrBench contains 14 diverse task types categorized into four task dimensions, while previous benchmarks only offer limited task types such as counting, object recognition, etc.

Evaluation Results

UrBench poses significant challenges to current SoTA LMMs. We find that the best performing closed-source model GPT-4o and open-source model VILA-1.5- 40B only achieve a 61.2% and a 53.1% accuracy, respectively. Interestingly, our findings indicate that the primary limitation of these models lies in their ability to comprehend UrBench questions, not in their capacity to process multiple images, as the performance between multi-image and their single-image counterparts shows little difference, such as LLaVA-NeXT-8B and LLaVA-NeXT-Interleave in the table. Overall, the challenging nature of our benchmark indicates that current LMMs’ strong performance on the general benchmarks are not generalized to the multi-view urban scenarios.

Performances of LMMs and human experts on the UrBench test set.

📊 Evaluation

🛠️ Installation

Please clone our repository and change to that folder

git clone https://github.com/opendatalab/Urbench.git
cd urbench

Create a new python environment and install relevant requirements

conda create -n urbench python=3.10
conda activate urbench
pip install -e .

Start evaluating

Here's an example of running evaluation on UrBench's test set with TinyLLaVA

python -m accelerate.commands.launch --num_processes=2 --main_process_port=10043 -m lmms_eval --model=llava_hf --model_args="pretrained="bczhou/tiny-llava-v1-hf",device=""" --log_samples --log_samples_suffix tinyllava --tasks citybench_test_all --output_path ./logs

Citation

@article{zhou2024urbench,
  title={Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios},
  author={Zhou, Baichuan and Yang, Haote and Chen, Dairong and Ye, Junyan and Bai, Tianyi and Yu, Jinhua and Zhang, Songyang and Lin, Dahua and He, Conghui and Li, Weijia},
  journal={arXiv preprint arXiv:2408.17267},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
assets		assets
docs		docs
lmms_eval		lmms_eval
miscs		miscs
static		static
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

🎉 News

Introduction

Comparison with Existing Benchmarks

Evaluation Results

📊 Evaluation

🛠️ Installation

Start evaluating

Citation

About

Releases

Packages

Languages

License

opendatalab/UrBench

Folders and files

Latest commit

History

Repository files navigation

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

🎉 News

Introduction

Comparison with Existing Benchmarks

Evaluation Results

📊 Evaluation

🛠️ Installation

Start evaluating

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages