“𝑇ℎ𝑒 𝑏𝑎𝑏𝑦, 𝑎𝑠𝑠𝑎𝑖𝑙𝑒𝑑 𝑏𝑦 𝑒𝑦𝑒𝑠, 𝑒𝑎𝑟𝑠, 𝑛𝑜𝑠𝑒, 𝑠𝑘𝑖𝑛, 𝑎𝑛𝑑 𝑒𝑛𝑡𝑟𝑎𝑖𝑙𝑠 𝑎𝑡 𝑜𝑛𝑐𝑒, 𝑓𝑒𝑒𝑙𝑠 𝑖𝑡 𝑎𝑙𝑙 𝑎𝑠 𝑜𝑛𝑒 𝑔𝑟𝑒𝑎𝑡 𝑏𝑙𝑜𝑜𝑚𝑖𝑛𝑔, 𝑏𝑢𝑧𝑧𝑖𝑛𝑔 𝑐𝑜𝑛𝑓𝑢𝑠𝑖𝑜𝑛.” -- 𝑊𝑖𝑙𝑙𝑖𝑎𝑚 𝐽𝑎𝑚𝑒𝑠
1. Introduction «🎯Back To Top»
The human perceptual system is a complex and multifaceted construct. The five basic senses of hearing, touch, taste, smell, and vision serve as primary channels of perception, allowing us to perceive and interpret most of the external stimuli encountered in this “blooming, buzzing confusion” world. These stimuli always come from multiple events spread out spatially and temporally distributed.
In other words, we constantly perceive the world in a “multimodal” manner, which combines different information channels to distinguish features within confusion, seamlessly integrates various sensations from multiple modalities and obtains knowledge through our experiences.
2. Background «🎯Back To Top»
2.1 Datasets «🎯Back To Top»
Table 1. Chronological timeline of representative text-to-image datasets.
“Public” includes a link to each dataset (if available✔) or paper (if not❌).
“Annotations” denotes the number of text descriptions per image.
“Attrs” denotes the total number of attributes in each dataset.
Year | Dataset | Public | Category | Image (Resolution) | Annotations | Attrs | Other Information |
---|---|---|---|---|---|---|---|
2008 | Oxford-102 Flowers | ✔ | Flower | 8,189 (-) | 10 | - | - |
2011 | CUB-200-2011 | ✔ | Bird | 11,788 (-) | 10 | - | BBox, Segmentation... |
2014 | MS-COCO2014 | ✔ | Iconic Objects | 120k (-) | 5 | - | BBox,Segmentation... |
2018 | Face2Text | ✔ | Face | 10,177 (-) | 1~ | - | - |
2019 | SCU-Text2face | ❌ | Face | 1,000 (256×256) | 5 | - | - |
2020 | Multi-ModalCelebA-HQ | ✔ | Face | 30,000 (512×512) | 10 | 38 | Masks,Sketches |
2021 | FFHQ-Text | ✔ | Face | 760 (1024×1024) | 9 | 162 | BBox |
2021 | M2C-Fashion | ❌ | Clothing | 10,855,753 (256×256) | 1 | - | - |
2021 | CelebA-Dialog | ✔ | Face | 202,599 (178×218) | ~5 | 5 | Identity Label... |
2021 | Faces a la Carte | ❌ | Face | 202,599 (178×218) | ~10 | 40 | - |
2021 | LAION-400M | ✔ | Random Crawled | 400M (-) | 1 | - | KNN Index... |
2022 | Bento800 | ✔ | Food | 800 (600×600) | 9 | - | BBox, Segmentation, Label... |
2022 | LAION-5B | ✔ | Random Crawled | 5.85B (-) | 1 | - | URL, Similarity, Language... |
2022 | DiffusionDB | ✔ | Synthetic Images | 14M (-) | 1 | - | Size, Random Seed... |
2022 | COYO-700M | ✔ | Random Crawled | 747M (-) | 1 | - | URL, Aesthetic Score... |
2022 | DeepFashion-MultiModal | ✔ | Full Body | 44,096 (750×1101) | 1 | - | Densepose, Keypoints... |
2023 | ANNA | ✔ | News | 29,625 (256×256) | 1 | - | - |
2023 | DreamBooth | ✔ | Objects & Pets | 158 (-) | 25 | - | - |
2.2 Evaluation Metrics «🎯Back To Top»
👆🏻: Higher is better. 👇🏻: Lower is better.
- [NIPS 2016] Inception Score (IS) 👆🏻
- [Paper] [Python Code (Pytorch)] [(New!)Python Code (Tensorflow)] [Ref.Code(AttnGAN)]
- [NIPS 2017] Fréchet Inception Distance (FID) 👇🏻
- [Paper] [Python Code (Pytorch)] [(New!)Python Code (Tensorflow)] [Ref.Code(DM-GAN)]
- [CVPR 2018)] R-precision (RP) 👆🏻
- [Paper] [Ref.Code(CPGAN)]
- [TPAMI 2020] Semantic Object Accuracy (SOA) 👆🏻
- [ECCV 2022] Positional Alignment (PA) 👆🏻
Participants are asked to rate generated images based on two criteria: plausibility (including object accuracy, counting, positional alignment, or image-text alignment) and naturalness (whether the image appears natural or realistic).
The evaluation protocol is designed in a 5-Point Likert manner, in which human evaluators rate each prompt on a scale of 1 to 5, with 5 representing the best and 1 representing the worst.
For rare object combinations that require common sense understanding or aim to avoid bias related to race or gender, human evaluation is even more important.
3. Generative Models «🎯Back To Top»
A comprehensive list of text-to-image
approaches.
The pioneering works in each development stage are
highlighted
. Text-to-face generation works are start with a emoji(👸).
3.1 GAN Model «🎯Back To Top»
- 2016~2021:
Generative Adversarial Text to Image Synthesis
[Paper] [Code]- Learning What and Where to Draw [Paper] [Code]
- Adversarial nets with perceptual losses for text-to-image synthesis [Paper]
- I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation [Paper] [Code]
- Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis [Paper]
- MC-GAN: Multi-conditional Generative Adversarial Network for Image Synthesis [Paper] [Code]
- Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction [Paper] [Code]
- 2017:
- 2018:
- StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks [Paper] [Code]
- Text-to-image-to-text translation using cycle consistent adversarial networks [Paper] [Code]
- AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks [Paper] [Code]
- ChatPainter: Improving Text to Image Generation using Dialogue [Paper]
- 2019:
- 👸 FTGAN: A Fully-trained Generative Adversarial Networks for Text to Face Generation [Paper]
- C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis [Paper]
- Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis [Paper]
- Semantics Disentangling for Text-to-Image Generation [Paper] [Website]
- MirrorGAN: Learning Text-to-image Generation by Redescription [Paper] [Code]
- Controllable Text-to-Image Generation [Paper] [Code]
- DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis [Paper] [Code]
- 2020:
- CookGAN: Causality based Text-to-Image Synthesis [Paper]
- RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge [Paper]
- KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis [Paper]
- CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis [Paper] [Code]
- End-to-End Text-to-Image Synthesis with Spatial Constrains [Paper]
- Semantic Object Accuracy for Generative Text-to-Image Synthesis [Paper] [Code]
- 2021:
- 👸 Multi-caption Text-to-Face Synthesis: Dataset and Algorithm [Paper] [Code]
- 👸 Generative Adversarial Network for Text-to-Face Synthesis and Manipulation [Paper]
- 👸 Generative Adversarial Network for Text-to-Face Synthesis and Manipulation with Pretrained BERT Model [Paper]
- Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis [Paper]
- Unsupervised text-to-image synthesis [Paper]
- RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge [Paper]
- 2022:
- 👸 DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis [Paper]
- 👸 CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis [Paper]
- DR-GAN: Distribution Regularization for Text-to-Image Generation [Paper] [Code]
- T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up [Paper] [Code]
- 2021:
- 2022:
- 👸 Text-Free Learning of a Natural Language Interface for Pretrained Face Generators [Paper] [Code]
- 👸 clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP [Paper] [Code]
- 👸 TextFace: Text-to-Style Mapping based Face Generation and Manipulation [Paper]
- 👸 AnyFace: Free-style Text-to-Face Synthesis and Manipulation [Paper]
- 👸 StyleT2F: Generating Human Faces from Textual Description Using StyleGAN2 [Paper] [Code]
- 👸 StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [Paper] [Code]
- LAFITE: Towards Language-Free Training for Text-to-Image Generation [Paper] [Code]
[Others]
- 2018:
- 2021:
- 2022:
3.2 Autogressive Model «🎯Back To Top»
- 2021:
Zero-Shot Text-to-Image Generation
[Paper] [Code] [Blog] [Model Card] [Colab] [Code(Pytorch)]- CogView: Mastering Text-to-Image Generation via Transformers [Paper] [Code] [Demo Website(Chinese)]
- Unifying Multimodal Transformer for Bi-directional Image and Text Generation [Paper] [Code]
- 2022:
- CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers [Paper] [Code]
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [Paper] [Code] [Project]
- Neural Architecture Search with a Lightweight Transformer for Text-to-Image Synthesis [Paper]
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers [Paper] [Code]
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [Paper] [Code]
- Text-to-Image Synthesis based on Object-Guided Joint-Decoding Transformer [Paper]
- Autoregressive Image Generation using Residual Quantization [Paper] [Code]
- Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [Paper] [Code] [The Little Red Boat Story]
3.3 Diffusion Model «🎯Back To Top»
- 2022:
High-Resolution Image Synthesis with Latent Diffusion Models
[Paper] [Code] [Stable Diffusion Code]- Vector Quantized Diffusion Model for Text-to-Image Synthesis [Paper] [Code]
- Hierarchical Text-Conditional Image Generation with CLIP Latents [Paper] [Blog] [Risks and Limitations] [Unofficial Code]
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Paper] [Blog]
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models [Paper] [Code]
- Compositional Visual Generation with Composable Diffusion Models [Paper] [Code] [Project] [Hugging Face]
- Prompt-to-Prompt Image Editing with Cross Attention Control [Paper] [Code] [Unofficial Code] [Project]
- Creative Painting with Latent Diffusion Models [Paper]
- DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics [Paper] [Project]
- Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation [Paper]
- ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts [Paper]
- eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [Paper] [Project] [Video]
- Multi-Concept Customization of Text-to-Image Diffusion [Paper] [Project] [Code] [Hugging Face]
- 2023:
- GLIGEN: Open-Set Grounded Text-to-Image Generation [Paper] [Code] [Project] [Hugging Face Demo]
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis [Paper (arXiv)] [Paper (OpenReview)] [Code]
- Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models [Paper] [Project] [Code]
- Adding Conditional Control to Text-to-Image Diffusion Models [Paper] [Code]
- Editing Implicit Assumptions in Text-to-Image Diffusion Models[Paper] [Project] [Code]
4. Generative Applications «🎯Back To Top»
4.1 Text-to-Image «🎯Back To Top»
Figure 1. Diverse text-to-face results generated from GAN-based / Diffusion-based / Transformer-based models.
Images in orange boxes are captured from original papers (a) [zhou2021generative], (b) [pinkney2022clip2latent] and (c) [li2022stylet2i]; others are generated by a pre-trained model [pinkney2022clip2latent] [(b) left bottom row], Dreamstudio [(a-c) middle row] and DALL-E 2 [(a-c) right row] online platforms from textual descriptions.
Please refer to Section 3 (Generative Models) for more details about text-to-image.
4.2 Text-to-X «🎯Back To Top»
Figure 2. Selected representative samples on Text-to-X.
Images are captured from original papers ((a) [ho2022imagen], (b)-Left [xu2022dream3d], (b)-Right [poole2022dreamfusion], (c) [tevet2022human]) and remade.
4.3 X-to-Image «🎯Back To Top»
Figure 3. Selected representative samples on X-to-Image.
Images are captured from original papers and remade.
(a) Layered Editing [bar2022text2live] (Left), Recontextualization [ruiz2023dreambooth] (Middle), Image Editing [brooks2022instructpix2pix] (Right).
(b) Context-Aware Generation [he2021context] (Left), Model Complex Scenes [yang2022modeling] (Right).
(c) Face Reconstruction [dado2022hyperrealistic] (Left), High-resolution Image Reconstruction [takagi2022high] (Right).
(d) Speech to Image [wang2021generating] (Left), Sound Guided Image Manipulation [lee2022robust] (Middle), Robotic Painting [misra2023robot] (Right).
Legend: X excluding “Additional Input Image” (Blue dotted line box, top row). Additional Input Image (Green box, middle row). Ground Truth (Red box, middle row). Generated / Edited / Reconstructed Image (Black box, bottom row).