Awesome-ECCV2024-AIGC

A Collection of Papers and Codes for ECCV2024 AIGC

整理汇总下今年ECCV AIGC相关的论文和代码，具体如下。

欢迎star，fork和PR~

Please feel free to star, fork or PR if helpful~

参考或转载请注明出处

ECCV2024官网：https://eccv.ecva.net/

ECCV接收论文列表：

ECCV完整论文库：https://www.ecva.net/papers.php

开会时间：2024年9月29日-10月4日

论文接收公布时间：2024年

【Contents】

1.图像生成(Image Generation/Image Synthesis)
2.图像编辑（Image Editing)
3.视频生成(Video Generation/Image Synthesis)
4.视频编辑(Video Editing)
5.3D生成(3D Generation/3D Synthesis)
6.3D编辑(3D Editing)
7.多模态大语言模型(Multi-Modal Large Language Model)
8.其他多任务(Others)

1.图像生成(Image Generation/Image Synthesis)

∞-Brush : Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Paper: https://arxiv.org/abs/2407.14709
Code:

Accelerating Diffusion Sampling with Optimized Time Steps

Paper: https://arxiv.org/abs/2402.17376
Code: https://github.com/scxue/DM-NonUniform

Accelerating Image Generation with Sub-path Linear Approximation Model

Paper: https://arxiv.org/abs/2404.13903
Code: https://github.com/MCG-NJU/SPLAM

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Paper: https://arxiv.org/abs/2407.10738v1
Code: https://github.com/lzhxmu/AccDiffusion

AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation

Paper: https://arxiv.org/abs/2409.00342
Code: https://github.com/LeapLabTHU/AdaNAT

AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling

Paper: https://arxiv.org/abs/2407.05546v1
Code: https://github.com/SherryXTChen/AID-Appeal

AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation

Paper: https://arxiv.org/abs/2406.18958
Code: https://github.com/open-mmlab/AnyControl

Arc2Face: A Foundation Model for ID-Consistent Human Faces

Paper: https://arxiv.org/abs/2403.11641
Code: https://github.com/foivospar/Arc2Face

Assessing Sample Quality via the Latent Space of Generative Models

Paper: https://arxiv.org/abs/2407.15171
Code: https://github.com/cvlab-stonybrook/LS-sample-quality

AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild

Paper: https://arxiv.org/abs/2407.18034
Code: https://github.com/redorangeyellowy/AttentionHand

A Watermark-Conditioned Diffusion Model for IP Protection

Paper: https://arxiv.org/abs/2403.10893
Code: https://github.com/rmin2000/WaDiff

Beta-Tuned Timestep Diffusion Model

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/328_ECCV_2024_paper.php
Code:

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Paper: https://arxiv.org/abs/2404.04544
Code: https://github.com/gwang-kim/BeyondScene

Block-removed Knowledge-distilled Stable Diffusion

Paper: https://arxiv.org/abs/2305.15798
Code: https://github.com/Nota-NetsPresso/BK-SDM

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Paper: https://arxiv.org/abs/2403.07860
Code: https://github.com/ShihaoZhaoZSH/LaVi-Bridge

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

Paper: https://arxiv.org/abs/2407.11294
Code: https://github.com/Arking1995/COHO

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

Paper: https://arxiv.org/abs/2407.07197
Code: https://github.com/moatifbutt/color-peel

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

Paper: https://arxiv.org/abs/2402.11849
Code:

ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

Paper: https://arxiv.org/abs/2407.07077
Code: https://github.com/haoosz/ConceptExpress

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Paper: https://arxiv.org/abs/2404.07987
Code: https://github.com/liming-ai/ControlNet_Plus_Plus

Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2037_ECCV_2024_paper.php
Code:

D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

Paper: https://arxiv.org/abs/2407.15111
Code:

Data Augmentation for Saliency Prediction via Latent Diffusion

Paper:
Code: https://github.com/IVRL/AugSal

DataDream: Few-shot Guided Dataset Generation

Paper: https://arxiv.org/abs/2407.10910
Code: https://github.com/ExplainableML/DataDream

DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation

Paper: https://arxiv.org/abs/2409.03755
Code: https://github.com/wl-zhao/DC-Solver

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

Paper: https://arxiv.org/abs/2310.17316
Code: https://github.com/EnVision-Research/Defect_Spectrum

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Paper: https://arxiv.org/abs/2312.03048
Code: https://github.com/prs-eth/DGInStyle

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Paper:
Code: https://github.com/murphytju/DiffFAS

DiffiT: Diffusion Vision Transformers for Image Generation

Paper: https://arxiv.org/abs/2312.02139
Code: https://github.com/NVlabs/DiffiT

Diffusion2GAN: Distilling Diffusion Models into Conditional GANs

Paper: https://arxiv.org/abs/2405.05967
Code: https://github.com/mingukkang/elatentlpips

Distilling Diffusion Models into Conditional GANs

Paper: https://arxiv.org/abs/2405.05967
Code:

Efficient Training with Denoised Neural Weights

Paper: https://arxiv.org/abs/2407.11966
Code:

Energy-Calibrated VAE with Test Time Free Lunch

Paper: https://arxiv.org/abs/2311.04071
Code: https://github.com/Luo-Yihong/EC-VAE

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Paper: https://arxiv.org/abs/2311.15657
Code: https://github.com/chaofengc/TexForce

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Paper: https://arxiv.org/abs/2403.06381
Code: https://github.com/YaNgZhAnG-V5/attention_regulation

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Paper: https://arxiv.org/abs/2403.12963
Code: https://github.com/LeonHLJ/FouriScale

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2529_ECCV_2024_paper.php
Code: https://github.com/aim-uofa/FreeCompose

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper: https://arxiv.org/abs/2404.01197
Code: https://github.com/SPRIGHT-T2I/SPRIGHT

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Paper: https://arxiv.org/abs/2403.09622
Code: https://github.com/AIGText/Glyph-ByT5

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Paper: https://arxiv.org/abs/2311.17528
Code: https://github.com/megvii-research/HiDiffusion

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Paper: https://arxiv.org/abs/2407.06937
Code: https://github.com/Enderfga/HumanRefiner

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Paper: https://arxiv.org/abs/2403.05139
Code: https://github.com/yisol/IDM-VTON

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

Paper: https://arxiv.org/abs/2406.04551
Code: https://github.com/facebookresearch/Contextualized-Vendi-Score-Guidance

Improving Virtual Try-On with Garment-focused Diffusion Models

Paper: https://arxiv.org/abs/2409.08258
Code: https://github.com/siqi0905/GarDiff/tree/master

Inserting Anybody in Diffusion Models via Celeb Basis

Paper: https://arxiv.org/abs/2306.00926
Code: https://github.com/rishubhpar/PreciseControl

Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models

Paper: https://arxiv.org/abs/2407.06937
Code: https://github.com/liuxiao-guan/IET_AGC

Large-scale Reinforcement Learning for Diffusion Models

Paper: https://arxiv.org/abs/2401.12244
Code: https://github.com/pinterest/atg-research/tree/main/joint-rl-diffusion

Latent Guard: a Safety Framework for Text-to-image Generation

Paper: https://arxiv.org/abs/2404.08031
Code: https://github.com/rt219/LatentGuard

LayoutFlow: Flow Matching for Layout Generation

Paper: https://arxiv.org/abs/2403.18187
Code: https://github.com/JulianGuerreiro/LayoutFlow

Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/1066_ECCV_2024_paper.php
Code:

Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2309_ECCV_2024_paper.php
Code:

LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

Paper: https://arxiv.org/abs/2407.13752
Code:

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Paper:
Code: https://github.com/RossoneriZhao/iced_coke

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Paper: https://arxiv.org/abs/2402.10491
Code: https://github.com/GuoLanqing/Self-Cascade

MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation

Paper: https://arxiv.org/abs/2405.05806
Code: https://github.com/csyxwei/MasterWeaver

Memory-Efficient Fine-Tuning for Quantized Diffusion Model

Paper: https://arxiv.org/abs/2401.04339
Code: https://github.com/ugonfor/TuneQDM

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Paper: https://arxiv.org/abs/2408.15660
Code: https://github.com/aimagelab/MAD

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

Paper: https://arxiv.org/abs/2405.17873
Code: https://github.com/thu-nics/MixDQ

Navigating Text-to-Image Generative Bias across Indic Languages

Paper: https://arxiv.org/abs/2408.00283
Code: https://github.com/surbhim18/IndicTTI

NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation

Paper: https://arxiv.org/abs/2403.18211
Code: https://github.com/jingyanghuo/neuropictor

Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Paper: https://arxiv.org/abs/2404.07389
Code: https://github.com/YasminZhang/EBAMA

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Paper: https://arxiv.org/abs/2403.10983
Code: https://github.com/kongzhecn/OMG

One-Shot Diffusion Mimicker for Handwritten Text Generation

Paper: https://arxiv.org/abs/2409.04004
Code: https://github.com/dailenson/One-DM

PartCraft: Crafting Creative Objects by Parts

Paper: https://arxiv.org/abs/2311.15477
Code: https://github.com/kamwoh/partcraft

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Paper: https://arxiv.org/abs/2404.00995
Code:

Post-training Quantization for Text-to-Image Diffusion Models with Progressive Calibration and Activation Relaxing

Paper: https://arxiv.org/abs/2403.04692
Code: https://github.com/PixArt-alpha/PixArt-sigma

PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation

Paper: https://arxiv.org/abs/2404.00995
Code:

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Paper: https://arxiv.org/abs/2407.06642
Code: https://github.com/wfanyue/DPG-T2I-Personalization

ProCreate, Dont Reproduce! Propulsive Energy Diffusion for Creative Generation

Paper: https://arxiv.org/abs/2408.02226
Code: https://github.com/Agentic-Learning-AI-Lab/procreate-diffusion-public

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Paper: https://arxiv.org/abs/2311.17717
Code:

ReGround: Improving Textual and Spatial Grounding at No Cost

Paper: https://arxiv.org/abs/2403.13589
Code:

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Paper: https://arxiv.org/abs/2407.12383
Code: https://github.com/CharlesGong12/RECE

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Paper: https://arxiv.org/abs/2407.21032
Code: https://github.com/nannullna/safeguard-hfi

Self-Guided Generation of Minority Samples Using Diffusion Models

Paper: https://arxiv.org/abs/2407.11555
Code: https://github.com/soobin-um/sg-minority

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

Paper: https://arxiv.org/abs/2403.17377
Code: https://github.com/sunovivid/Perturbed-Attention-Guidance

SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow

Paper: https://arxiv.org/abs/2407.12718
Code: https://github.com/yuanzhi-zhu/SlimFlow

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

Paper: https://arxiv.org/abs/2404.06451
Code: https://github.com/liuxiaoyu1104/SmartControl

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

Paper: https://arxiv.org/abs/2408.14176
Code:

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Paper: https://arxiv.org/abs/2403.09176
Code: https://github.com/byeongjun-park/Switch-DiT

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

Paper: https://arxiv.org/abs/2404.05979
Code: https://github.com/tobran/StoryImager

T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models

Paper:
Code: https://github.com/Robin-WZQ/T2IShield

The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations

Paper: https://arxiv.org/abs/2408.12352
Code:

Timestep-Aware Correction for Quantized Diffusion Models

Paper: https://arxiv.org/abs/2407.03917
Code:

Towards Reliable Advertising Image Generation Using Human Feedback

Paper: https://arxiv.org/abs/2408.00418
Code: https://github.com/ZhenbangDu/Reliable_AD

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Paper: https://arxiv.org/abs/2407.13609
Code:

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Paper: https://arxiv.org/abs/2312.04884
Code: https://github.com/ZYM-PKU/UDiffText

Unmasking Bias in Diffusion Model Training

Paper: https://arxiv.org/abs/2310.08442
Code: https://github.com/yuhuUSTC/Debias

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Paper: https://arxiv.org/abs/2404.02905
Code: https://github.com/yuhuUSTC/Debias

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Paper: https://arxiv.org/abs/2407.12642
Code:

ZigMa: A DiT-Style Mamba-based Diffusion Model

Paper: https://arxiv.org/abs/2403.13802
Code: https://github.com/CompVis/zigma

2.图像编辑(Image Editing)

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Paper: https://arxiv.org/abs/2312.03594
Code: https://github.com/open-mmlab/PowerPaint

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

Paper: https://arxiv.org/abs/2403.06976
Code: https://github.com/TencentARC/BrushNet

COMPOSE: Comprehensive Portrait Shadow Editing

Paper: https://arxiv.org/abs/2408.13922
Code:

CQS: CBAM and Query-Selection Diffusion Model for text-driven Content-aware Image Style Transfer

Paper:
Code: https://github.com/john09282922/CQS

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/1909_ECCV_2024_paper.php
Code: https://github.com/JS-Lee525/PIC

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

Paper: https://arxiv.org/abs/2403.11415
Code: https://github.com/DreamSampler/dream-sampler

EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2096_ECCV_2024_paper.php
Code:

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Paper: https://arxiv.org/abs/2406.04413
Code: https://github.com/VIROBO-15/Efficient-3D-Aware-Facial-Image-Editing

Enhanced Controllability of Diffusion Models via Feature Disentanglement and Realism-Enhanced Sampling Methods

Paper: https://arxiv.org/abs/2302.14368
Code:

Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2157_ECCV_2024_paper.php
Code: https://github.com/furiosa-ai/eta-inversion

Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization

Paper: https://arxiv.org/abs/2407.04245
Code: https://github.com/Kaminyou/Dense-Normalization

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Paper: https://arxiv.org/abs/2407.04245
Code: https://github.com/FaceAdapter/Face-Adapter

Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

Paper: https://arxiv.org/abs/2312.14223
Code: https://github.com/nina-weng/FastDiME_Med

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

Paper: https://arxiv.org/abs/2405.12970
Code: https://github.com/kookie12/FlexiEdit

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/759_ECCV_2024_paper.php
Code: https://github.com/Thermal-Dynamics/FreeDiff

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

Paper: https://arxiv.org/abs/2409.12952
Code: https://github.com/trustinai/gdvaecode

Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing

Paper: https://arxiv.org/abs/2409.01322
Code: https://github.com/FusionBrainLab/Guide-and-Rescale

GroupDiff: Diffusion-based Group Portrait Editing

Paper: https://arxiv.org/abs/2409.14379
Code: https://github.com/yumingj/GroupDiff

InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser

Paper: https://arxiv.org/abs/2311.15040
Code: https://github.com/cuixing100876/InstaStyle

InstructGIE: Towards Generalizable Image Editing

Paper: https://arxiv.org/abs/2403.05018
Code:

Lazy Diffusion Transformer for Interactive Image Editing

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/3436_ECCV_2024_paper.php
Code:

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Paper: https://arxiv.org/abs/2409.13431
Code: https://github.com/wzx99/TMIM

MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo

Paper: https://arxiv.org/abs/2409.00674
Code:

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

Paper: https://arxiv.org/abs/2308.14469
Code: https://github.com/yangxy/PASD

RadEdit: stress-testing biomedical vision models via diffusion image editing

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/1923_ECCV_2024_paper.php
Code:

Real-time 3D-aware Portrait Editing from a Single Image

Paper: https://arxiv.org/abs/2402.14000
Code: https://github.com/EzioBy/3dpe

RegionDrag: Fast Region-Based Image Editing with Diffusion Models

Paper: https://arxiv.org/abs/2407.18247
Code: https://github.com/Visual-AI/RegionDrag

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

Paper: https://arxiv.org/abs/2402.12688
Code: https://github.com/hurunyi/Robust-Wide

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Paper: https://arxiv.org/abs/2404.04833
Code:

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Paper: https://arxiv.org/abs/2403.11105
Code: https://github.com/leeruibin/SPDInv

StableDrag: Stable Dragging for Point-based Image Editing

Paper: https://arxiv.org/abs/2403.04437
Code:

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Paper: https://arxiv.org/abs/2409.02543
Code: https://github.com/alipay/style-tokenizer

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/354_ECCV_2024_paper.php
Code:

TinyBeauty: Toward Tiny and High-quality Facial Makeup with Data Amplify Learning

Paper: https://arxiv.org/abs/2403.15033
Code: https://github.com/TinyBeauty/TinyBeauty

Tuning-Free Image Customization with Image and Text Guidance

Paper: https://arxiv.org/abs/2403.12658
Code:

TurboEdit: Instant text-based image editing

Paper: https://arxiv.org/abs/2408.08332
Code:

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Paper: https://arxiv.org/abs/2308.08947
Code: https://github.com/SamsungLabs/WatchYourSteps

3.视频生成(Video Generation/Video Synthesis)

Animate Your Motion: Turning Still Images into Dynamic Videos

Paper: https://arxiv.org/abs/2403.10179
Code: https://github.com/Mingxiao-Li/Animate-Your-Motion

Audio-Synchronized Visual Animation

Paper: https://arxiv.org/abs/2403.05659
Code: https://github.com/lzhangbj/ASVA

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Paper: https://arxiv.org/abs/2403.14781
Code: https://github.com/fudan-generative-vision/champ

Dyadic Interaction Modeling for Social Behavior Generation

Paper: https://arxiv.org/abs/2403.09069
Code: https://github.com/Boese0601/Dyadic-Interaction-Modeling

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Paper: https://arxiv.org/abs/2310.12190
Code: https://github.com/Doubiiu/DynamiCrafter

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis

Paper: https://arxiv.org/abs/2404.01647
Code: https://github.com/tanshuai0219/EDTalk

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Paper: https://arxiv.org/abs/2312.07537
Code: https://github.com/TianxingWu/FreeInit

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Paper: https://arxiv.org/abs/2402.13729
Code: https://github.com/hxngiee/HVDM

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

Paper: https://arxiv.org/abs/2407.10937
Code: https://github.com/yhZhai/idol

Kinetic Typography Diffusion Model

Paper: https://arxiv.org/abs/2407.10476
Code: https://github.com/SeonmiP/KineTy

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2738_ECCV_2024_paper.php
Code:

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Paper: https://arxiv.org/abs/2405.20222
Code: https://github.com/MyNiuuu/MOFA-Video

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Paper: https://arxiv.org/abs/2311.11325
Code:

MoVideo: Motion-Aware Video Generation with Diffusion Models

Paper: https://arxiv.org/abs/2310.08465
Code: https://github.com/showlab/MotionDirector

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2790_ECCV_2024_paper.php
Code: https://github.com/hechang25/MVSD

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

Paper: https://arxiv.org/abs/2407.10285
Code: https://github.com/yangqy1110/NC-SDEdit

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Paper: https://arxiv.org/abs/2409.18964
Code: https://github.com/stevenlsw/physgen

VEnhancer: Generative Space-Time Enhancement for Video Generation

Paper:
Code: https://github.com/Vchitect/VEnhancer

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Paper: https://arxiv.org/abs/2310.01324
Code: https://github.com/leexinhao/ZeroI2V

4.视频编辑(Video Editing)

Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Paper: https://arxiv.org/abs/2403.13745
Code: https://github.com/G-U-N/Be-Your-Outpainter

DNI: Dilutional Noise Initialization for Diffusion Video Editing

Paper: https://arxiv.org/abs/2409.13037
Code:

DragAnything: Motion Control for Anything using Entity Representation

Paper: https://arxiv.org/abs/2403.07420
Code: https://github.com/showlab/DragAnything

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

Paper: https://arxiv.org/abs/2403.12002
Code:

DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion

Paper: https://arxiv.org/abs/2409.09605
Code: https://github.com/leoShen917/DreamMover

Fast Sprite Decomposition from Animated Graphics

Paper: https://arxiv.org/abs/2408.03923
Code: https://github.com/CyberAgentAILab/sprite-decompose

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2738_ECCV_2024_paper.php
Code:

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

Paper: https://arxiv.org/abs/2407.09012
Code: https://github.com/eccv2024tcan/TCAN

Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

Paper: https://arxiv.org/abs/2409.14340
Code: https://github.com/Tinglok/avsoundscape

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2554_ECCV_2024_paper.php
Code:

5.3D生成(3D Generation/3D Synthesis)

BAMM: Bidirectional Autoregressive Motion Model

Paper: https://arxiv.org/abs/2403.19435
Code: https://github.com/exitudio/BAMM

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

Paper: https://arxiv.org/abs/2407.07554
Code:

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

Paper: https://arxiv.org/abs/2401.12978
Code: https://github.com/snuvclab/coma

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Paper: https://arxiv.org/abs/2407.04345
Code: https://github.com/jsshin98/CanonicalFusion

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

Paper: https://arxiv.org/abs/2407.13584
Code: https://github.com/LMozart/ECCV2024-GCS-BEG

DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

Paper: https://arxiv.org/abs/2408.14860
Code:

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/1847_ECCV_2024_paper.php
Code:

DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2100_ECCV_2024_paper.php
Code: https://github.com/HyoKong/DreamDrone

DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation

Paper: https://arxiv.org/abs/2404.06119
Code: https://github.com/iSEE-Laboratory/DreamView

EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Paper: https://arxiv.org/abs/2405.00915
Code: https://github.com/ymxlzgy/echoscene

EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/168_ECCV_2024_paper.php
Code: https://github.com/Frank-ZY-Dou/EMDM

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Paper: https://arxiv.org/abs/2408.00296
Code:

Expressive Whole-Body 3D Gaussian Avatar

Paper: https://arxiv.org/abs/2408.00297
Code:

Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

Paper: https://arxiv.org/abs/2312.07231
Code:

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

Paper: https://arxiv.org/abs/2405.00915
Code: https://github.com/ibrahimethemhamamci/GenerateCT

GenRC: Generative 3D Room Completion from Sparse Image Collections

Paper: https://arxiv.org/abs/2407.12939
Code: https://github.com/minfenli/GenRC

GVGEN:Text-to-3D Generation with Volumetric Representation

Paper: https://arxiv.org/abs/2403.12957
Code: https://github.com/SOTAMak1r/GVGEN

Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°

Paper: https://arxiv.org/abs/2407.11174
Code:

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Paper: https://github.com/AILab-CVC/HiFi-123
Code: https://arxiv.org/abs/2310.06744

iHuman: Instant Animatable Digital Humans From Monocular Videos

Paper: https://arxiv.org/abs/2407.11174
Code:

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Paper: https://arxiv.org/abs/2407.12291
Code:

KMTalk: Speech-Driven 3D Facial Animationwith Key Motion Embedding

Paper:
Code: https://github.com/ffxzh/KMTalk

Length-Aware Motion Synthesis via Latent Diffusion

Paper: https://arxiv.org/abs/2407.11532
Code:

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/501_ECCV_2024_paper.php
Code: https://github.com/NIRVANALAN/LN3Diff

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Paper: https://arxiv.org/abs/2407.10528
Code:

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Paper: https://arxiv.org/abs/2407.08414
Code: https://github.com/shad0wta9/meshavatar

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

Paper: https://arxiv.org/abs/2404.19759
Code: https://github.com/Dai-Wenxun/MotionLCM

Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM

Paper: https://arxiv.org/abs/2403.07487
Code: https://github.com/steve-zeyu-zhang/MotionMamba

MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2446_ECCV_2024_paper.php
Code: https://github.com/Tangshitao/MVDiffusion_plusplus

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Paper: https://arxiv.org/abs/2403.18241
Code:

PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance

Paper: https://arxiv.org/abs/2408.02157
Code: https://github.com/zxcvfd13502/PanoFree

ParCo: Part-Coordinating Text-to-Motion Synthesis

Paper: https://arxiv.org/abs/2403.18512
Code: https://github.com/qrzou/ParCo

Pyramid Diffusion for Fine 3D Large Scene Generation

Paper: https://arxiv.org/abs/2311.12085
Code: https://github.com/yuhengliu02/pyramid-discrete-diffusion

Realistic Human Motion Generation with Cross-Diffusion Models

Paper: https://arxiv.org/abs/2312.10993
Code: https://github.com/THUSIGSICLAB/crossdiff

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting

Paper: https://arxiv.org/abs/2312.13271
Code: https://github.com/PKU-YuanGroup/repaint123

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

Paper: https://arxiv.org/abs/2407.06938
Code: https://github.com/RodinHD/RodinHD

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

Paper: https://arxiv.org/abs/2407.02040
Code: https://github.com/theEricMa/ScaleDreamer

ScanTalk: 3D Talking Heads from Unregistered Scans

Paper: https://arxiv.org/abs/2403.10942
Code: https://github.com/miccunifi/ScanTalk

SceneTeller: Language-to-3D Scene Generation

Paper: https://arxiv.org/abs/2407.20727
Code:

StructLDM: Structured Latent Diffusion for 3D Human Generation

Paper: https://arxiv.org/abs/2404.01241
Code: https://github.com/TaoHuUMD/StructLDM

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

Paper: https://arxiv.org/abs/2311.17050
Code: https://github.com/Yzmblog/SurfD

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/150_ECCV_2024_paper.php
Code:

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Paper: https://arxiv.org/abs/2408.01291
Code:

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/698_ECCV_2024_paper.php
Code: https://github.com/YG256Li/UniDream

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

Paper: https://arxiv.org/abs/2407.04461
Code:

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

Paper: https://arxiv.org/abs/2403.12034
Code: https://github.com/facebookresearch/vfusion3d

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/1890_ECCV_2024_paper.php
Code: https://github.com/sfanxiang/videoshop

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

Paper: https://arxiv.org/abs/2309.07986
Code: https://github.com/jmhb0/view_neti

VividDreamer: Invariant Score Distillation For Hyper-Realistic Text-to-3D Generation

Paper:
Code: https://github.com/SupstarZh/VividDreamer

6.3D编辑(3D Editing)

3DEgo: 3D Editing on the Go!

Paper: https://arxiv.org/abs/2407.10102
Code:

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

Paper: https://arxiv.org/abs/2407.06842
Code: https://github.com/Fangkang515/CE3D

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Paper: https://arxiv.org/abs/2404.18929
Code: https://github.com/silent-chen/DGE

Free-Editor: Zero-shot Text-driven 3D Scene Editing

Paper: https://arxiv.org/abs/2312.13663
Code: https://github.com/nazmul-karim170/FreeEditor-Text-to-3D-Scene-Editing

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Paper: https://arxiv.org/abs/2403.08733
Code: https://github.com/ActiveVisionLab/gaussctrl

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Paper: https://arxiv.org/abs/2312.00732
Code: https://github.com/lkeab/gaussian-grouping

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Paper: https://arxiv.org/abs/2409.01113
Code: https://github.com/ffxzh/KMTalk

LatentEditor: Text Driven Local Editing of 3D Scenes

Paper: https://arxiv.org/abs/2312.09313
Code: https://github.com/umarkhalidAI/LatentEditor

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/8662_ECCV_2024_paper.php
Code: https://github.com/qwang666/RoomTex-

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Paper: https://arxiv.org/abs/2403.18512
Code: https://github.com/JarrentWu1031/SC4D

Shapefusion: 3D localized human diffusion models

Paper: https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/2155_ECCV_2024_paper.php
Code:

SMooDi: Stylized Motion Diffusion Model

Paper: https://arxiv.org/abs/2407.12783
Code: https://github.com/neu-vi/SMooDi

StyleCity: Large-Scale 3D Urban Scenes Stylization

Paper: https://arxiv.org/abs/2404.10681
Code: https://github.com/chenyingshu/stylecity3d

Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

Paper: https://arxiv.org/abs/2403.10050
Code: https://github.com/slothfulxtx/Texture-GS

Towards High-Quality 3D Motion Transfer with Realistic Apparel Animation

Paper: https://arxiv.org/abs/2407.11266
Code: https://github.com/rongakowang/MMDMC

View-Consistent 3D Editing with Gaussian Splatting

Paper: https://arxiv.org/abs/2403.11868
Code: https://github.com/Yuxuan-W/vcedit

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Paper: https://arxiv.org/abs/2308.08947
Code: https://github.com/SamsungLabs/WatchYourSteps

7.多模态大语言模型(Multi-Modal Large Language Models)

About Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Paper:
Code: https://github.com/ChaduCheng/TypoDeceptions

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

Paper: https://arxiv.org/abs/2403.09513
Code: https://github.com/SaFoLab-WISC/AdaShield

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization

Paper: https://arxiv.org/abs/2407.08156
Code: https://github.com/xsx1001/AddressCLIP

Adversarial Prompt Tuning for Vision-Language Models

Paper:
Code: https://github.com/jiamingzhang94/Adversarial-Prompt-Tuning

A Large Multimodal Model Perceiving Any Aspect Ratio and High-Resolution Images

Paper: https://arxiv.org/abs/2403.11703
Code: https://github.com/thunlp/LLaVA-UHD

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper: https://arxiv.org/abs/2403.06764
Code: https://github.com/pkunlp-icler/FastV

API: Attention Prompting on Image for Large Vision-Language Models

Paper:
Code: https://github.com/yu-rp/apiprompting

Bi-directional Contextual Attention for 3D Dense Captioning

Paper: https://arxiv.org/abs/2408.06662
Code:

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation

Paper: https://arxiv.org/abs/2408.05926
Code:

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

Paper: https://arxiv.org/abs/2311.16445
Code:

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Paper: https://arxiv.org/abs/2407.12442
Code: https://github.com/mc-lan/ClearCLIP

ControlCap: Controllable Region-level Captioning

Paper: https://arxiv.org/abs/2401.17910
Code: https://github.com/callsys/ControlCap

Controllable Navigation Instruction Generation with Chain of Thought Prompting

Paper: https://arxiv.org/abs/2407.07433
Code:

DreamLIP: Language-Image Pre-training with Long Captions

Paper: https://arxiv.org/abs/2403.17007
Code: https://github.com/zyf0619sjtu/DreamLIP

DriveLM: Driving with Graph Visual Question Answering

Paper: https://arxiv.org/abs/2312.14150
Code: https://github.com/OpenDriveLab/DriveLM

Elysium: Exploring Object-level Perception in Videos via MLLM

Paper: https://arxiv.org/abs/2403.16558
Code: https://github.com/Hon-Wong/Elysium

Emergent Visual-Semantic Hierarchies in Image-Text Representations

Paper: https://arxiv.org/abs/2407.08521
Code: https://github.com/TAU-VAILab/hierarcaps

Empowering Multimodal Large Language Model as a Powerful Data Generator

Paper:
Code: https://github.com/zhaohengyuan1/Genixer

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

Paper: https://arxiv.org/abs/2308.03135
Code: https://github.com/jiazhou-garland/EventBind

FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

Paper: https://arxiv.org/abs/2407.05578
Code: https://github.com/pumpkin805/FALIP

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

Paper: https://arxiv.org/abs/2408.02788
Code:

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Paper: https://arxiv.org/abs/2312.06731
Code: https://github.com/zhaohengyuan1/Genixer

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

Paper: https://arxiv.org/abs/2403.09394
Code: https://github.com/mbzuai-oryx/GeoChat

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper: https://arxiv.org/abs/2403.09394
Code: https://github.com/Haiyang-W/GiT

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper: https://arxiv.org/abs/2407.12679
Code: https://github.com/Vision-CAIR/MiniGPT4-video

Groma: Grounded Multimodal Assistant

Paper: https://arxiv.org/abs/2404.13013
Code: https://github.com/FoundationVision/Groma

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Paper: https://arxiv.org/abs/2311.17600
Code: https://github.com/UCSC-VLAA/vllm-safety-benchmark

InternVideo: Video Foundation Models for Multimodal Understanding

Paper: https://arxiv.org/abs/2212.03191
Code: https://github.com/OpenGVLab/InternVideo

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Paper: https://arxiv.org/abs/2403.09377
Code: https://github.com/tingyu215/Routing_VLPEFT

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Paper: https://arxiv.org/abs/2408.05019
Code:

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Paper:
Code: https://github.com/YBZh/LAPT

Learning Video Context as Interleaved Multimodal Sequences

Paper: https://arxiv.org/abs/2407.21757
Code: https://github.com/showlab/MovieSeq

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Paper: https://arxiv.org/abs/2405.02363
Code: https://github.com/llm-as-dataset-analyst/SSDLLM

LLMGA: Multimodal Large Language Model based Generation Assistant

Paper: https://arxiv.org/abs/2311.16500
Code: https://github.com/dvlab-research/LLMGA

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Paper: https://arxiv.org/abs/2403.15378
Code: https://github.com/beichenzbc/Long-CLIP

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Paper: https://arxiv.org/abs/2403.14624
Code: https://github.com/ZrrSkywalker/MathVerse

Merlin:Empowering Multimodal LLMs with Foresight Minds

Paper: https://arxiv.org/abs/2312.00589
Code: https://github.com/Ahnsun/merlin

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Paper: https://arxiv.org/abs/2403.11755
Code: https://github.com/jmiemirza/Meta-Prompting

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Paper: https://arxiv.org/abs/2312.03766
Code: https://github.com/BrianG13/MismatchQuest

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Paper: https://arxiv.org/abs/2403.14624
Code: https://github.com/isXinLiu/MM-SafetyBench

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Paper: https://arxiv.org/abs/2305.16986
Code: https://github.com/GengzeZhou/NavGPT-2

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

Paper: https://arxiv.org/abs/2404.12139
Code: https://github.com/Heathcliff-saku/Omniview_Tuning

Parrot Captions Teach CLIP to Spot Text

Paper: https://arxiv.org/abs/2312.14232
Code: https://github.com/opendatalab/CLIP-Parrot-Bias

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Paper: https://arxiv.org/abs/2407.21771
Code: https://github.com/LALBJ/PAI

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

Paper: https://arxiv.org/abs/2408.14805
Code: https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus

PointLLM: Empowering Large Language Models to Understand Point Clouds

Paper: https://arxiv.org/abs/2308.16911
Code: https://github.com/OpenRobotLab/PointLLM

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Paper: https://arxiv.org/abs/2403.14598
Code: https://github.com/zamling/PSALM

R2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations

Paper: https://arxiv.org/abs/2403.04924
Code: https://github.com/lxa9867/r2bench

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Paper: https://arxiv.org/abs/2407.11422
Code: https://github.com/zjr2000/REVERIE

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Paper:
Code: https://github.com/agneet42/revision

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Paper: https://arxiv.org/abs/2311.16254
Code: https://github.com/aimagelab/safe-clip

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

Paper: https://arxiv.org/abs/2409.10542
Code: https://github.com/AI-Application-and-Integration-Lab/SAM4MLLM

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Paper:
Code: https://github.com/wuyongjianCODE/SDPT

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Paper: https://arxiv.org/abs/2403.04908
Code: https://github.com/ramdrop/edgevl

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Paper: https://arxiv.org/abs/2311.12793
Code: https://github.com/ShareGPT4Omni/ShareGPT4V

Soft Prompt Generation

Paper: https://github.com/renytek13/Soft-Prompt-Generation
Code: https://arxiv.org/abs/2404.19286v2

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Paper: https://arxiv.org/abs/2403.11299
Code:

ST-LLM: Large Language Models Are Effective Temporal Learners

Paper: https://arxiv.org/abs/2404.00308
Code: https://github.com/TencentARC/ST-LLM

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Paper: https://arxiv.org/abs/2409.01690
Code: https://github.com/insait-institute/MUZE

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Paper: https://arxiv.org/abs/2404.00384
Code: https://github.com/shjo-april/TTD

UMBRAE: Unified Multimodal Brain Decoding

Paper: https://arxiv.org/abs/2404.07202
Code: https://github.com/weihaox/UMBRAE

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Paper: https://arxiv.org/abs/2311.17136
Code: https://github.com/TIGER-AI-Lab/UniIR

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Paper: https://arxiv.org/abs/2402.19150
Code: https://github.com/ChaduCheng/TypoDeceptions

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Paper: https://arxiv.org/abs/2312.06109
Code: https://github.com/Ucas-HaoranWei/Vary

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Paper: https://arxiv.org/abs/2403.00522
Code: https://github.com/Meituan-AutoML/VisionLLaMA

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Paper: https://arxiv.org/abs/2407.13851
Code:

8.其他任务(Others)

Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution

Paper: https://arxiv.org/abs/2404.02697v2
Code: https://github.com/uwFengyuan/OCC-CLIP

持续更新~

Files

ECCV2024.md

Latest commit

History