diff --git a/README.md b/README.md index ebaa5b75..c361ca3d 100644 --- a/README.md +++ b/README.md @@ -24,28 +24,68 @@ +
-# Showcase +## ๐Ÿ“ธ Showcase https://github.com/fudan-generative-vision/hallo/assets/17402682/294e78ef-c60d-4c32-8e3c-7f8d6934c6bd +### ๐ŸŽฌ Honoring Classic Films -# Framework + + + + + + + + + + + + + + + + + + + + + +
Devil Wears PradaGreen BookInfernal Affairs
Patch AdamsTough LoveShawshank Redemption
-![abstract](assets/framework_1.jpg) -![framework](assets/framework_2.jpg) +Explore [more examples](https://fudan-generative-vision.github.io/hallo). + +## ๐Ÿ“ฐ News + +- **`2024/06/15`**: โœจโœจโœจ Released some images and audios for inference testing on [๐Ÿค—Huggingface](https://huggingface.co/datasets/fudan-generative-ai/hallo_inference_samples). +- **`2024/06/15`**: ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Launched the first version on ๐Ÿซก[GitHub](https://github.com/fudan-generative-vision/hallo). + +## ๐Ÿค Community Resources + +Explore the resources developed by our community to enhance your experience with Hallo: + +- [Demo on Huggingface](https://huggingface.co/spaces/multimodalart/hallo) - Check out this easy-to-use Gradio demo by [@multimodalart](https://huggingface.co/multimodalart). +- [hallo-webui](https://github.com/daswer123/hallo-webui) - Explore the WebUI created by [@daswer123](https://github.com/daswer123). +- [hallo-for-windows](https://github.com/sdbds/hallo-for-windows) - Utilize Hallo on Windows with the guide by [@sdbds](https://github.com/sdbds). +- [ComfyUI-Hallo](https://github.com/AIFSH/ComfyUI-Hallo) - Integrate Hallo with the ComfyUI tool by [@AIFSH](https://github.com/AIFSH). + +Thanks to all of them. + +Join our community and explore these amazing resources to make the most out of Hallo. Enjoy and elevate their creative projects! -# News +## ๐Ÿ”ง๏ธ Framework -- **`2024/06/15`**: ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ Release the first version on [GitHub](https://github.com/fudan-generative-vision/hallo). -- **`2024/06/15`**: โœจโœจโœจ Release some images and audios for inference testing on [Huggingface](https://huggingface.co/datasets/fudan-generative-ai/hallo_inference_samples). +![abstract](assets/framework_1.jpg) +![framework](assets/framework_2.jpg) -# Installation +## โš™๏ธ Installation - System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1 - Tested GPUs: A100 @@ -69,7 +109,7 @@ Besides, ffmpeg is also need: apt-get install ffmpeg ``` -# Inference +## ๐Ÿ—๏ธ๏ธ Usage The inference entrypoint script is `scripts/inference.py`. Before testing your cases, there are two preparations need to be completed: @@ -77,7 +117,7 @@ The inference entrypoint script is `scripts/inference.py`. Before testing your c 2. [Prepare source image and driving audio pairs](#prepare-inference-data). 3. [Run inference](#run-inference). -## Download pretrained models +### ๐Ÿ“ฅ Download Pretrained Models You can easily get all pretrained models required by inference from our [HuggingFace repo](https://huggingface.co/fudan-generative-ai/hallo). @@ -91,12 +131,12 @@ git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models Or you can download them separately from their source repo: - [hallo](https://huggingface.co/fudan-generative-ai/hallo/tree/main/hallo): Our checkpoints consist of denoising UNet, face locator, image & audio proj. -- [audio_separator](https://huggingface.co/huangjackson/Kim_Vocal_2): Kim\_Vocal\_2 MDX-Net vocal removal model by [KimberleyJensen](https://github.com/KimberleyJensen). (_Thanks to runwayml_) +- [audio_separator](https://huggingface.co/huangjackson/Kim_Vocal_2): Kim\_Vocal\_2 MDX-Net vocal removal model. (_Thanks to [KimberleyJensen](https://github.com/KimberleyJensen)_) - [insightface](https://github.com/deepinsight/insightface/tree/master/python-package#model-zoo): 2D and 3D Face Analysis placed into `pretrained_models/face_analysis/models/`. (_Thanks to deepinsight_) - [face landmarker](https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task): Face detection & mesh model from [mediapipe](https://ai.google.dev/edge/mediapipe/solutions/vision/face_landmarker#models) placed into `pretrained_models/face_analysis/models`. -- [motion module](https://github.com/guoyww/AnimateDiff/blob/main/README.md#202309-animatediff-v2): motion module from [AnimateDiff](https://github.com/guoyww/AnimateDiff). (_Thanks to guoyww_). -- [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse): Weights are intended to be used with the diffusers library. (_Thanks to stablilityai_) -- [StableDiffusion V1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5): Initialized and fine-tuned from Stable-Diffusion-v1-2. (_Thanks to runwayml_) +- [motion module](https://github.com/guoyww/AnimateDiff/blob/main/README.md#202309-animatediff-v2): motion module from [AnimateDiff](https://github.com/guoyww/AnimateDiff). (_Thanks to [guoyww](https://github.com/guoyww)_). +- [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse): Weights are intended to be used with the diffusers library. (_Thanks to [stablilityai](https://huggingface.co/stabilityai)_) +- [StableDiffusion V1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5): Initialized and fine-tuned from Stable-Diffusion-v1-2. (_Thanks to [runwayml](https://huggingface.co/runwayml)_) - [wav2vec](https://huggingface.co/facebook/wav2vec2-base-960h): wav audio to vector model from [Facebook](https://huggingface.co/facebook/wav2vec2-base-960h). Finally, these pretrained models should be organized as follows: @@ -137,7 +177,7 @@ Finally, these pretrained models should be organized as follows: | `-- vocab.json ``` -## Prepare Inference Data +### ๐Ÿ› ๏ธ Prepare Inference Data Hallo has a few simple requirements for input data: @@ -153,9 +193,9 @@ For the driving audio: 2. It must be in English since our training datasets are only in this language. 3. Ensure the vocals are clear; background music is acceptable. -We have provided some samples for your reference. +We have provided [some samples](examples/) for your reference. -## Run inference +### ๐ŸŽฎ Run Inference Simply to run the `scripts/inference.py` and pass `source_image` and `driving_audio` as input: @@ -189,31 +229,45 @@ options: face region ``` -# Roadmap +## ๐Ÿ“…๏ธ Roadmap | Status | Milestone | ETA | | :----: | :---------------------------------------------------------------------------------------------------- | :--------: | | โœ… | **[Inference source code meet everyone on GitHub](https://github.com/fudan-generative-vision/hallo)** | 2024-06-15 | | โœ… | **[Pretrained models on Huggingface](https://huggingface.co/fudan-generative-ai/hallo)** | 2024-06-15 | -| ๐Ÿš€๐Ÿš€๐Ÿš€ | **[Training: data preparation and training scripts]()** | 2024-06-25 | -| ๐Ÿš€๐Ÿš€๐Ÿš€ | **[Optimize inference performance in Mandarin]()** | TBD | +| ๐Ÿšง | **[Optimizing Inference Performance]()** | 2024-06-23 | +| ๐Ÿšง | **[Optimizing Performance on images with a resolution of 256x256.]()** | 2024-06-23 | +| ๐Ÿš€ | **[Improving the model's performance on Mandarin Chinese]()** | 2024-06-25 | +| ๐Ÿš€ | **[Releasing data preparation and training scripts]()** | 2024-06-28 | + +
+Other Enhacements + +- [ ] Enhancement: Test and ensure compatibility with Windows operating system. [#39](https://github.com/fudan-generative-vision/hallo/issues/39) +- [ ] Bug: Output video may lose several frames. [#41](https://github.com/fudan-generative-vision/hallo/issues/41) +- [ ] Bug: Sound volume affecting inference results (audio normalization). +- [ ] Enhancement: Inference code logic optimization. +- [ ] Enhancement: Enhancing performance on low resolutions(256x256) to support more efficient usage. + +
-# Citation + +## ๐Ÿ“ Citation If you find our work useful for your research, please consider citing the paper: ``` @misc{xu2024hallo, title={Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation}, - author={Mingwang Xu and Hui Li and Qingkun Su and Hanlin Shang and Liwei Zhang and Ce Liu and Jingdong Wang and Yao Yao and Siyu zhu}, - year={2024}, - eprint={2406.08801}, - archivePrefix={arXiv}, - primaryClass={cs.CV} + author={Mingwang Xu and Hui Li and Qingkun Su and Hanlin Shang and Liwei Zhang and Ce Liu and Jingdong Wang and Yao Yao and Siyu zhu}, + year={2024}, + eprint={2406.08801}, + archivePrefix={arXiv}, + primaryClass={cs.CV} } ``` -# Opportunities available +## ๐ŸŒŸ Opportunities Available Multiple research positions are open at the **Generative Vision Lab, Fudan University**! Include: @@ -224,6 +278,14 @@ Multiple research positions are open at the **Generative Vision Lab, Fudan Unive Interested individuals are encouraged to contact us at [siyuzhu@fudan.edu.cn](mailto://siyuzhu@fudan.edu.cn) for further information. -# Social Risks and Mitigations +## โš ๏ธ Social Risks and Mitigations The development of portrait image animation technologies driven by audio inputs poses social risks, such as the ethical implications of creating realistic portraits that could be misused for deepfakes. To mitigate these risks, it is crucial to establish ethical guidelines and responsible use practices. Privacy and consent concerns also arise from using individuals' images and voices. Addressing these involves transparent data usage policies, informed consent, and safeguarding privacy rights. By addressing these risks and implementing mitigations, the research aims to ensure the responsible and ethical development of this technology. + +## ๐Ÿ‘ Community Contributors + +Thank you to all the contributors who have helped to make this project better! + + + +