- System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
- Tested GPUs: A100
Create conda environment:
conda create -n hallo python=3.10
conda activate hallo
Install packages with pip
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Besides, ffmpeg is also needed:
apt-get install ffmpeg
You can easily get all pretrained models required by inference from our HuggingFace repo.
Clone the pretrained models into ${PROJECT_ROOT}/pretrained_models
directory by cmd below:
git lfs install
git clone https://huggingface.co/fudan-generative-ai/hallo2 pretrained_models
Or you can download them separately from their source repo:
- hallo2: Our checkpoint of video super-resolution.
- facelib: pretrained face parse models
- realesrgan: background upsample model
- CodeFormer: pretrained Codeformer model, it's optional to download it, only if you want to train our video super-resolution model from scratch
Finally, these pretrained models should be organized as follows:
./pretrained_models/
|-- CodeFormer/
| |-- codeformer.pth
| `-- vqgan_code1024.pth
|-- facelib
| |-- detection_mobilenet0.25_Final.pth
| |-- detection_Resnet50_Final.pth
| |-- parsing_parsenet.pth
| |-- yolov5l-face.pth
| `-- yolov5n-face.pth
|-- hallo2
| `-- net_g.pth
`-- realesrgan
`-- RealESRGAN_x2plus.pth
Simply to run the scripts/video_sr.py
and pass input_path
and output_path
:
python scripts/video_sr.py --input_path [input_video] --output_path [output_dir] --bg_upsampler realesrgan --face_upsample -w 1 -s 4
Animation results will be saved at output_dir
.
For more options:
usage: video_sr.py [-h] [-i INPUT_PATH] [-o OUTPUT_PATH] [-w FIDELITY_WEIGHT] [-s UPSCALE] [--has_aligned] [--only_center_face] [--draw_box]
[--detection_model DETECTION_MODEL] [--bg_upsampler BG_UPSAMPLER] [--face_upsample] [--bg_tile BG_TILE] [--suffix SUFFIX]
options:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Input video
-o OUTPUT_PATH, --output_path OUTPUT_PATH
Output folder.
-w FIDELITY_WEIGHT, --fidelity_weight FIDELITY_WEIGHT
Balance the quality and fidelity. Default: 0.5
-s UPSCALE, --upscale UPSCALE
The final upsampling scale of the image. Default: 2
--has_aligned Input are cropped and aligned faces. Default: False
--only_center_face Only restore the center face. Default: False
--draw_box Draw the bounding box for the detected faces. Default: False
--detection_model DETECTION_MODEL
Face detector. Optional: retinaface_resnet50, retinaface_mobile0.25, YOLOv5l, YOLOv5n. Default: retinaface_resnet50
--bg_upsampler BG_UPSAMPLER
Background upsampler. Optional: realesrgan
--face_upsample Face upsampler after enhancement. Default: False
--bg_tile BG_TILE Tile size for background sampler. Default: 400
--suffix SUFFIX Suffix of the restored faces. Default: None
We use the VFHQ dataset for training, you can download from its homepage. Then updata dataroot_gt
in ./configs/train/video_sr.yaml
.
Start training with the following command:
python -m torch.distributed.launch --nproc_per_node=8 --master_port=4652 \
basicsr/train.py -opt ./configs/train/video_sr.yaml \
--launcher pytorch