This repository contains ViSpeR, a large-scale dataset and models for Visual Speech Recognition for English, Arabic, Chinese, French and Spanish.
Given the scarcity of publicly available VSR data for non-English languages, we collected VSR data for the most four spoken languages at scale.
Comparison of VSR datasets. Our proposed ViSpeR dataset is larger in size compared to other datasets that cover non-English languages for the VSR task. For our dataset, the numbers in parenthesis denote the number of clips. We also give the clip coverage under TedX and Wild subsets of our ViSpeR dataset.
Dataset | French (fr) | Spanish (es) | Arabic (ar) | Chinese (zh) |
---|---|---|---|---|
MuAVIC | 176 | 178 | 16 | -- |
VoxCeleb2 | 124 | 42 | -- | -- |
AVSpeech | 122 | 270 | -- | -- |
ViSpeR (TedX) | 192 (160k) | 207 (151k) | 49 (48k) | 129 (143k) |
ViSpeR (Wild) | 680 (481k) | 587 (383k) | 1152 (1.01M) | 658 (593k) |
ViSpeR (full) | 872 (641k) | 794 (534k) | 1200 (1.06M) | 787 (736k) |
First, use the provided video lists to download the videos and put them in seperate folders. The raw data should be structured as follows:
Languages | Split |
---|---|
French | train, test_tedx, test_wild |
Spanish | train, test_tedx, test_wild |
Chinese | train, test_tedx, test_wild |
Arabic | training coming soon, test_tedx, test_wild |
Data/
├── Chinese/
│ ├── video_id.mp4
│ └── ...
├── Arabic/
│ ├── video_id.mp4
│ └── ...
├── French/
│ ├── video_id.mp4
│ └── ...
├── Spanish/
│ ├── video_id.mp4
│ └── ...
- Setup the environement and repo:
conda create --name visper python=3.10
conda activate visper
git clone https://github.com/YasserdahouML/visper
cd visper
- Install fairseq within the repository:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ..
- Install PyTorch (tested pytorch version: v2.2.2) and other packages:
pip install torch torchvision torchaudio
pip install pytorch-lightning
pip install sentencepiece
pip install av
pip install hydra-core --upgrade
- Install ffmpeg:
conda install "ffmpeg<5" -c conda-forge
You need the download the meta data from Huggingface🤗, this includes train.tar.gz
and test.tar.gz
. Then, use the provided metadata to process the raw data for creating the ViSpeR dataset. You can use the crop_videos.py
to process the data, note that all clips are cropped and transformed
Languages | Split |
---|---|
French | train, test |
Spanish | train, test |
Chinese | train, test |
Arabic | Training coming soon, test |
python data_prepare/crop_videos.py --video_dir [path_to_data_language] --save_path [save_path_language] --json_path [language_metadata_path] --use_ffmpeg True
ViSpeR/
├── Chinese/
│ ├── video_id/
│ │ │── 00001.mp4
│ │ │── 00001.json
│ └── ...
├── Arabic/
│ ├── video_id/
│ │ │── 00001.mp4
│ │ │── 00001.json
│ └── ...
├── French/
│ ├── video_id/
│ │ │── 00001.mp4
│ │ │── 00001.json
│ └── ...
├── Spanish/
│ ├── video_id/
│ │ │── 00001.mp4
│ │ │── 00001.json
│ └── ...
The video_id/xxxx.json
has the 'label' of the corresponding video video_id/xxxx.mp4
.
For english, you can refer to LRS3, and VoxCeleb-en
The processed multilingual VSR video-text pairs are utilized to train a multilingual encoder-decoder model in a fully-supervised manner. The supported languages are: English, Arabic, French, Spanish and Chinese. For English, we leverage the combined 1759H from LRS3 and VoxCeleb-en. While the encoder size is 12 layers, the decoder size is 6 layers. The hidden size, MLP and number of heads are set to 768, 3072 and 12, respectively. The unigram tokenizers are learned on all languages and have a vocabulary size of 21k. Results are presented here:
Language | VSR (WER/CER) | AVSR (WER/CER) |
---|---|---|
French | 29.8 | 5.7 |
Spanish | 39.4 | 4.4 |
Arabic | 47.8 | 8.4 |
Chinese | 51.3 (CER) | 15.4 (CER) |
English | 49.1 | 8.1 |
Model weights to be found at Huggingface🤗
Languages | Task | Size | Checkpoint |
---|---|---|---|
en, fr, es, ar, zh | AVSR | Base | visper_avsr_base.pth |
en, fr, es, ar, zh | VSR | Base | visper_vsr_base.pth |
Run evaluation on the videos using
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer.py \
ckpt_path=visper_vsr_base.pth \
data.modality=video infer_path=/path/to/files.npy \
infer_lang=[LANG]
For evaluating using the AVSR model, modify data.modality=audiovisual
and ckpt_path=visper_avsr_base.pth
above. [LANG]
should be set to one of the five languages (arabic, chinese, french, spanish or english).
To test on English, please get the data from here WildVSR-en
This dataset can be used to train models for visual speech recognition. It's particularly useful for research and development purposes in the field of audio-visual content processing. The data can be used to assess the performance of current and future models.
Due to the data collection process focusing on YouTube, biases inherent to the platform may be present in the dataset. Also, while measures are taken to ensure diversity in content, the dataset might still be skewed towards certain types of content due to the filtering process.
This repository is built using the espnet, fairseq, auto_avsr and avhubert repositories.
@article{narayan2024visper,
title={ViSpeR: Multilingual Audio-Visual Speech Recognition},
author={Narayan, Sanath and Djilali, Yasser Abdelaziz Dahou and Singh, Ankit and Bihan, Eustache Le and Hacid, Hakim},
journal={arXiv preprint arXiv:2406.00038},
year={2024}
}
@inproceedings{djilali2023lip2vec,
title={Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping},
author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Boussaid, Haithem and Almazrouei, Ebtessam and Debbah, Merouane},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={13790--13801},
year={2023}
}
@inproceedings{djilali2024vsr,
title={Do VSR Models Generalize Beyond LRS3?},
author={Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and LeBihan, Eustache and Boussaid, Haithem and Almazrouei, Ebtesam and Debbah, Merouane},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={6635--6644},
year={2024}
}