Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation tasks. However, these models occasionally generate hallucinatory texts, resulting in descriptions that seem reasonable but do not correspond to the image. This phenomenon can lead to wrong driving decisions of the autonomous driving system. To address this challenge, this paper proposes HCOENet, a plug-and-play chain-of-thought correction method designed to eliminate object hallucinations and generate enhanced descriptions for critical objects overlooked in the initial response. Specifically, HCOENet employs a cross-checking mechanism to filter entities and directly extracts critical objects from the given image, enriching the descriptive text. Experimental results on the POPE benchmark demonstrate that HCOENet improves the F1-score of the Mini-InternVL-4B and mPLUG-Owl3 models by 12.58% and 4.28%, respectively. Additionally, qualitative results using images collected in open campus scene further highlight the practical applicability of the proposed method. Compared with the GPT-4o model, HCOENet achieves comparable descriptive performance while significantly reducing costs. Finally, two novel semantic understanding datasets, CODA_desc and nuScenes_desc, are created for traffic scenarios to support future research.

If you have any question, please feel free to email [email protected].

🔥 News

Read our paper [https://arxiv.org/abs/2412.07518]
Video demo [https://github.com/fjq-tongji/HCOENet/releases/download/demo/Video.Demo.mp4]
CODA_desc.rar and nuScenes_desc.rar are the created datasets.

📖 Model

💊 Installation

LLaVA: https://github.com/haotian-liu/LLaVA
mPLUG-Owl: https://github.com/X-PLUG/mPLUG-Owl
MiniGPT-4: https://github.com/Vision-CAIR/MiniGPT-4
InternVL: https://github.com/OpenGVLab/InternVL
BLIP-2: https://huggingface.co/Salesforce/blip2-flan-t5-xxl
InstructBLIP: https://huggingface.co/Salesforce/instructblip-flan-t5-xxl
RAM: https://github.com/xinyu1205/recognize-anything
GroundingDINO: https://github.com/IDEA-Research/GroundingDINO

⭐ Inference

Download the traffic dataset from CODA website (https://coda-dataset.github.io);
Generate question-answer pair using POPE code

$ python POPE codes/CODA2022/CODA2022_pope_random.json
$ python POPE codes/CODA2022/CODA2022_pope_popular.json
$ python POPE codes/CODA2022/CODA2022_pope_adversarial.json

Generate initial response for each image using specific LVLM, such as LLaVA-1.5, mPLUG-Owl.
Generate refined response using HCOENet framework:

$ python inference_split_sents.py
$ python inference_named_entity.py
$ python inference_blip2_3.py
$ python inference_instructblip_3.py
$ python inference_entity_update_3.py
$ python inference_groundingdino_4.py
$ python inference_groundingdino_words_update_4.py
$ python inference_groundingdino_write_captions_5.py
$ python inference_entity_captions_update_6.py

Evaluate the model under POPE benchmark.
Generate more refined descriptions using nuScenes dataset.

🏆 Experimental Results

Ablation studies

Table 1. Ablation studies of the effectiveness of each stage in the HCOENet. Stage1 refers the sentence split and key entity extraction, stage2 refers the entity cross-checking, stage3 refers the hallucination correction, stage4 refers the critical-object identification, stage5 refers the object description, and stage6 refers integrating descriptions from two frameworks. (%)

Quantitative results

Table 2. Evaluation results of five LVLMs on the POPE benchmark under three negative sampling settings. (%)
Table 3. Comparison results between different mPLUG-Owl models on the POPE benchmark. (%)

Table 4. Comparison with the GPT-4o model on the POPE benchmark. B denotes billion and T denotes trillion. (%)

Qualitative results

🌻 Acknowledgement

This repository benefits from the following codes. Thanks for their awesome works.

📜 Citation

@article{fan2024hallucination,
title={Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios},
author={Fan, Jiaqi and Wu, Jianhua and Chu, Hongqing and Ge, Quanbo and Gao, Bingzhao},
journal={arXiv preprint arXiv:2412.07518},
year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
Each_stage_texts		Each_stage_texts
LVLM inference codes		LVLM inference codes
POPE codes		POPE codes
RAM++ Codes		RAM++ Codes
__pycache__		__pycache__
build		build
groundingdino.egg-info		groundingdino.egg-info
groundingdino		groundingdino
images		images
models		models
CODA_desc.rar		CODA_desc.rar
README.md		README.md
inference_overall.py		inference_overall.py
nuScenes_desc.rar		nuScenes_desc.rar
vis_corrector_recap_w.py		vis_corrector_recap_w.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

🔥 News

📖 Model

💊 Installation

⭐ Inference

🏆 Experimental Results

Ablation studies

Quantitative results

Qualitative results

🌻 Acknowledgement

📜 Citation

About

Releases 1

Packages

Languages

fjq-tongji/HCOENet

Folders and files

Latest commit

History

Repository files navigation

Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

🔥 News

📖 Model

💊 Installation

⭐ Inference

🏆 Experimental Results

Ablation studies

Quantitative results

Qualitative results

🌻 Acknowledgement

📜 Citation

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages