“𝑇ℎ𝑒 𝑏𝑎𝑏𝑦, 𝑎𝑠𝑠𝑎𝑖𝑙𝑒𝑑 𝑏𝑦 𝑒𝑦𝑒𝑠, 𝑒𝑎𝑟𝑠, 𝑛𝑜𝑠𝑒, 𝑠𝑘𝑖𝑛, 𝑎𝑛𝑑 𝑒𝑛𝑡𝑟𝑎𝑖𝑙𝑠 𝑎𝑡 𝑜𝑛𝑐𝑒, 𝑓𝑒𝑒𝑙𝑠 𝑖𝑡 𝑎𝑙𝑙 𝑎𝑠 𝑜𝑛𝑒 𝑔𝑟𝑒𝑎𝑡 𝑏𝑙𝑜𝑜𝑚𝑖𝑛𝑔, 𝑏𝑢𝑧𝑧𝑖𝑛𝑔 𝑐𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛.” -- 𝑊𝑖𝑙𝑙𝑖𝑎𝑚 𝐽𝑎𝑚𝑒𝑠

🍓 Content 🍓

- 1. Introduction
- 2. Background
- - 2.1 Datasets
- - 2.2 Evaluation Metrics
- 3. Generative Models
- 4. Generative Applications
- 5. Discussion

1. Introduction «🎯Back To Top»

The human perceptual system is a complex and multifaceted construct. The five basic senses of hearing, touch, taste, smell, and vision serve as primary channels of perception, allowing us to perceive and interpret most of the external stimuli encountered in this “blooming, buzzing confusion” world. These stimuli always come from multiple events spread out spatially and temporally distributed.

In other words, we constantly perceive the world in a “multimodal” manner, which combines different information channels to distinguish features within confusion, seamlessly integrates various sensations from multiple modalities and obtains knowledge through our experiences.

2. Background «🎯Back To Top»

2.1 Datasets «🎯Back To Top»

Table 1. Chronological timeline of representative text-to-image datasets.

“Public” includes a link to each dataset (if available✔) or paper (if not❌).
“Annotations” denotes the number of text descriptions per image.
“Attrs” denotes the total number of attributes in each dataset.

Year	Dataset	Public	Category	Image (Resolution)	Annotations	Attrs	Other Information
2008	Oxford-102 Flowers	✔	Flower	8,189 (-)	10	-	-
2011	CUB-200-2011	✔	Bird	11,788 (-)	10	-	BBox, Segmentation...
2014	MS-COCO2014	✔	Iconic Objects	120k (-)	5	-	BBox,Segmentation...
2018	Face2Text	✔	Face	10,177 (-)	1~	-	-
2019	SCU-Text2face	❌	Face	1,000 (256×256)	5	-	-
2020	Multi-ModalCelebA-HQ	✔	Face	30,000 (512×512)	10	38	Masks,Sketches
2021	FFHQ-Text	✔	Face	760 (1024×1024)	9	162	BBox
2021	M2C-Fashion	❌	Clothing	10,855,753 (256×256)	1	-	-
2021	CelebA-Dialog	✔	Face	202,599 (178×218)	~5	5	Identity Label...
2021	Faces a la Carte	❌	Face	202,599 (178×218)	~10	40	-
2021	LAION-400M	✔	Random Crawled	400M (-)	1	-	KNN Index...
2022	Bento800	✔	Food	800 (600×600)	9	-	BBox, Segmentation, Label...
2022	LAION-5B	✔	Random Crawled	5.85B (-)	1	-	URL, Similarity, Language...
2022	DiffusionDB	✔	Synthetic Images	14M (-)	1	-	Size, Random Seed...
2022	COYO-700M	✔	Random Crawled	747M (-)	1	-	URL, Aesthetic Score...
2022	DeepFashion-MultiModal	✔	Full Body	44,096 (750×1101)	1	-	Densepose, Keypoints...
2023	ANNA	✔	News	29,625 (256×256)	1	-	-
2023	DreamBooth	✔	Objects & Pets	158 (-)	25	-	-

2.2 Evaluation Metrics «🎯Back To Top»

Automatic Evaluation

👆🏻: Higher is better. 👇🏻: Lower is better.

[NIPS 2016] Inception Score (IS) 👆🏻
- [Paper] [Python Code (Pytorch)] [(New!)Python Code (Tensorflow)] [Ref.Code(AttnGAN)]
[NIPS 2017] Fréchet Inception Distance (FID) 👇🏻
- [Paper] [Python Code (Pytorch)] [(New!)Python Code (Tensorflow)] [Ref.Code(DM-GAN)]
[CVPR 2018)] R-precision (RP) 👆🏻
- [Paper] [Ref.Code(CPGAN)]
[TPAMI 2020] Semantic Object Accuracy (SOA) 👆🏻
- [Paper] [Python Code (Pytorch)]
[ECCV 2022] Positional Alignment (PA) 👆🏻
- [Paper] [Python Code (Pytorch)]

Human Evaluation

Participants are asked to rate generated images based on two criteria: plausibility (including object accuracy, counting, positional alignment, or image-text alignment) and naturalness (whether the image appears natural or realistic).
The evaluation protocol is designed in a 5-Point Likert manner, in which human evaluators rate each prompt on a scale of 1 to 5, with 5 representing the best and 1 representing the worst.

For rare object combinations that require common sense understanding or aim to avoid bias related to race or gender, human evaluation is even more important.

3. Generative Models «🎯Back To Top»

A comprehensive list of text-to-image approaches.

The pioneering works in each development stage are highlighted. Text-to-face generation works are start with a emoji(👸).

3.1 GAN Model «🎯Back To Top»

[Conditional GAN-based]

2016~2021:
- Generative Adversarial Text to Image Synthesis [Paper] [Code]
- Learning What and Where to Draw [Paper] [Code]
- Adversarial nets with perceptual losses for text-to-image synthesis [Paper]
- I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation [Paper] [Code]
- Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis [Paper]
- MC-GAN: Multi-conditional Generative Adversarial Network for Image Synthesis [Paper] [Code]
- Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction [Paper] [Code]

[StackGAN-based]

2017:
- StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks [Paper] [Code]
2018:
- StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks [Paper] [Code]
- Text-to-image-to-text translation using cycle consistent adversarial networks [Paper] [Code]
- AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks [Paper] [Code]
- ChatPainter: Improving Text to Image Generation using Dialogue [Paper]
2019:
- 👸 FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation [Paper]
- C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis [Paper]
- Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis [Paper]
- Semantics Disentangling for Text-to-Image Generation [Paper] [Website]
- MirrorGAN: Learning Text-to-image Generation by Redescription [Paper] [Code]
- Controllable Text-to-Image Generation [Paper] [Code]
- DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis [Paper] [Code]
2020:
- CookGAN: Causality based Text-to-Image Synthesis [Paper]
- RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge [Paper]
- KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis [Paper]
- CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis [Paper] [Code]
- End-to-End Text-to-Image Synthesis with Spatial Constrains [Paper]
- Semantic Object Accuracy for Generative Text-to-Image Synthesis [Paper] [Code]
2021:
- 👸 Multi-caption Text-to-Face Synthesis: Dataset and Algorithm [Paper] [Code]
- 👸 Generative Adversarial Network for Text-to-Face Synthesis and Manipulation [Paper]
- 👸 Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model [Paper]
- Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis [Paper]
- Unsupervised text-to-image synthesis [Paper]
- RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge [Paper]
2022:
- 👸 DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis [Paper]
- 👸 CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis [Paper]
- DR-GAN: Distribution Regularization for Text-to-Image Generation [Paper] [Code]
- T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up [Paper] [Code]

[StlyeGAN-based]

2021:
- 👸 TediGAN: Text-Guided Diverse Image Generation and Manipulation [Paper] [Extended Version][Code] [Dataset] [Colab] [Video]
- 👸 Faces a la Carte: Text-to-Face Generation via Attribute Disentanglement [Paper]
- Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [Paper]
2022:
- 👸 Text-Free Learning of a Natural Language Interface for Pretrained Face Generators [Paper] [Code]
- 👸 clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP [Paper] [Code]
- 👸 TextFace: Text-to-Style Mapping based Face Generation and Manipulation [Paper]
- 👸 AnyFace: Free-style Text-to-Face Synthesis and Manipulation [Paper]
- 👸 StyleT2F: Generating Human Faces from Textual Description Using StyleGAN2 [Paper] [Code]
- 👸 StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [Paper] [Code]
- LAFITE: Towards Language-Free Training for Text-to-Image Generation [Paper] [Code]

[Others]

2018:
- (Hierarchical adversarial network) Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network [Paper] [Code]
2021:
- (BigGAN) CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders [Paper] [Code]
- (BigGAN) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization [Paper] [Code]
2022:
- (One-stage framework) Text to Image Generation with Semantic-Spatial Aware GAN [Paper] [Code]

3.2 Autogressive Model «🎯Back To Top»

[Transformer-based]

2021:
- Zero-Shot Text-to-Image Generation [Paper] [Code] [Blog] [Model Card] [Colab] [Code(Pytorch)]
- CogView: Mastering Text-to-Image Generation via Transformers [Paper] [Code] [Demo Website(Chinese)]
- Unifying Multimodal Transformer for Bi-directional Image and Text Generation [Paper] [Code]
2022:
- CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers [Paper] [Code]
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [Paper] [Code] [Project]
- Neural Architecture Search with a Lightweight Transformer for Text-to-Image Synthesis [Paper]
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers [Paper] [Code]
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [Paper] [Code]
- Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer [Paper]
- Autoregressive Image Generation using Residual Quantization [Paper] [Code]
- Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [Paper] [Code] [The Little Red Boat Story]

3.3 Diffusion Model «🎯Back To Top»

[Diffusion-based]

2022:
- High-Resolution Image Synthesis with Latent Diffusion Models [Paper] [Code] [Stable Diffusion Code]
- Vector Quantized Diffusion Model for Text-to-Image Synthesis [Paper] [Code]
- Hierarchical Text-Conditional Image Generation with CLIP Latents [Paper] [Blog] [Risks and Limitations] [Unofficial Code]
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Paper] [Blog]
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models [Paper] [Code]
- Compositional Visual Generation with Composable Diffusion Models [Paper] [Code] [Project] [Hugging Face]
- Prompt-to-Prompt Image Editing with Cross Attention Control [Paper] [Code] [Unofficial Code] [Project]
- Creative Painting with Latent Diffusion Models [Paper]
- DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics [Paper] [Project]
- Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation [Paper]
- ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts [Paper]
- eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [Paper] [Project] [Video]
- Multi-Concept Customization of Text-to-Image Diffusion [Paper] [Project] [Code] [Hugging Face]
2023:
- GLIGEN: Open-Set Grounded Text-to-Image Generation [Paper] [Code] [Project] [Hugging Face Demo]
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis [Paper (arXiv)] [Paper (OpenReview)] [Code]
- Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models [Paper] [Project] [Code]
- Adding Conditional Control to Text-to-Image Diffusion Models [Paper] [Code]
- Editing Implicit Assumptions in Text-to-Image Diffusion Models[Paper] [Project] [Code]

4. Generative Applications «🎯Back To Top»

4.1 Text-to-Image «🎯Back To Top»

Figure 1. Diverse text-to-face results generated from GAN-based / Diffusion-based / Transformer-based models.

Images in orange boxes are captured from original papers (a) [zhou2021generative], (b) [pinkney2022clip2latent] and (c) [li2022stylet2i]; others are generated by a pre-trained model [pinkney2022clip2latent] [(b) left bottom row], Dreamstudio [(a-c) middle row] and DALL-E 2 [(a-c) right row] online platforms from textual descriptions.

Please refer to Section 3 (Generative Models) for more details about text-to-image.

4.2 Text-to-X «🎯Back To Top»

Figure 2. Selected representative samples on Text-to-X.

Images are captured from original papers ((a) [ho2022imagen], (b)-Left [xu2022dream3d], (b)-Right [poole2022dreamfusion], (c) [tevet2022human]) and remade.

4.3 X-to-Image «🎯Back To Top»

Figure 3. Selected representative samples on X-to-Image.

Images are captured from original papers and remade.

(a) Layered Editing [bar2022text2live] (Left), Recontextualization [ruiz2023dreambooth] (Middle), Image Editing [brooks2022instructpix2pix] (Right).
(b) Context-Aware Generation [he2021context] (Left), Model Complex Scenes [yang2022modeling] (Right).
(c) Face Reconstruction [dado2022hyperrealistic] (Left), High-resolution Image Reconstruction [takagi2022high] (Right).
(d) Speech to Image [wang2021generating] (Left), Sound Guided Image Manipulation [lee2022robust] (Middle), Robotic Painting [misra2023robot] (Right).
Legend: X excluding “Additional Input Image” (Blue dotted line box, top row). Additional Input Image (Green box, middle row). Ground Truth (Red box, middle row). Generated / Edited / Reconstructed Image (Black box, bottom row).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CVPRW 2023🎈] Best Collection.md

[CVPRW 2023🎈] Best Collection.md

🍓 Content 🍓

1. Introduction «🎯Back To Top»

2. Background «🎯Back To Top»

2.1 Datasets «🎯Back To Top»

2.2 Evaluation Metrics «🎯Back To Top»

Automatic Evaluation

Human Evaluation

3. Generative Models «🎯Back To Top»

3.1 GAN Model «🎯Back To Top»

[Conditional GAN-based]

[StackGAN-based]

[StlyeGAN-based]

3.2 Autogressive Model «🎯Back To Top»

[Transformer-based]

3.3 Diffusion Model «🎯Back To Top»

[Diffusion-based]

4. Generative Applications «🎯Back To Top»

4.1 Text-to-Image «🎯Back To Top»

4.2 Text-to-X «🎯Back To Top»

4.3 X-to-Image «🎯Back To Top»

4.4 Multi Tasks «🎯Back To Top»

5. Discussion «🎯Back To Top»

Files

[CVPRW 2023🎈] Best Collection.md

Latest commit

History

[CVPRW 2023🎈] Best Collection.md

File metadata and controls

🍓 Content 🍓

1. Introduction «🎯Back To Top»

2. Background «🎯Back To Top»

2.1 Datasets «🎯Back To Top»

2.2 Evaluation Metrics «🎯Back To Top»

Automatic Evaluation

Human Evaluation

3. Generative Models «🎯Back To Top»

3.1 GAN Model «🎯Back To Top»

[Conditional GAN-based]

[StackGAN-based]

[StlyeGAN-based]

3.2 Autogressive Model «🎯Back To Top»

[Transformer-based]

3.3 Diffusion Model «🎯Back To Top»

[Diffusion-based]

4. Generative Applications «🎯Back To Top»

4.1 Text-to-Image «🎯Back To Top»

4.2 Text-to-X «🎯Back To Top»

4.3 X-to-Image «🎯Back To Top»

4.4 Multi Tasks «🎯Back To Top»

5. Discussion «🎯Back To Top»