less data preparation time #1

leo19941227 · 2021-09-20T23:38:17Z

As we have discussed this very long time ago, that it will be nice we have a centralized LibriMix repo instead of one repo for a task, and we just use different data preparation script for each task.

Here I create on repo basing on Jiatong's version since his version has much more modification from the official release including the RTTM label files or SD. So there are two data preparation scripts:

generate_librimix_sd.sh
generate_librimix_ss.sh

I also make some changes to decrease the data preparation time, including ignoring train-clean-360 for both tasks and ignoring WHAM noise augmentation for SS. Furthermore, since now SD and SS all have a specific setting in terms of min/max condition or mix_clean/mix_both condition. Hence I think the data preparation script can now just prepare the specific setting we use for benchmarking, I believe these changes can save user some time from waiting the data to be ready.

Could you please take a look if my change fit your need?
Thanks!!

leo19941227 · 2021-09-20T23:42:10Z

Sorry I find there is some bug for SS, let me look into this further.

leo19941227

Hi @ftshijt and @HuangZiliAndy ,

I made some updates and now two scripts for generating mixtures both work.
Please help me proof-read the change if you have time. Thanks!

leo19941227 · 2021-09-21T00:47:12Z

generate_librimix_ss.sh

+    --freqs 16k \
+    --modes min \
+    --types mix_clean


@HuangZiliAndy Please help me proof-read this! Thanks!

leo19941227 · 2021-09-21T00:50:35Z

scripts/create_librimix_from_metadata.py

+    print("[Warning] - train-clean-360 is ignored in create_librimix_from_metadata.py for less data preparation time."\
+        " Please note that in S3PRL we only use the train-clean-100 for downstream tasks.")
    md_filename_list = [file for file in os.listdir(metadata_dir)
-                        if 'info' not in file]
+                        if 'info' not in file and '360' not in file]


This is for skipping the mixture creation for trian-clean-360 which will take much more time.

leo19941227 · 2021-09-21T00:50:56Z

generate_librimix_sd.sh

@@ -79,5 +78,5 @@ for n_src in 2; do
    --n_src $n_src \
    --freqs 16k \
    --modes max \
-    --types mix_clean mix_both
+    --types mix_both


Can @ftshijt please help me proof-read this?
Thanks!

LGTM! Thanks

less data preparation time

4c5259a

leo19941227 requested review from ftshijt and HuangZiliAndy September 20, 2021 23:38

noise must be presented even for mix_clean

4c6a5b6

leo19941227 commented Sep 21, 2021

View reviewed changes

leo19941227 merged commit c985b5b into master Sep 23, 2021

leo19941227 deleted the less-data-preparation-time branch September 23, 2021 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

less data preparation time #1

less data preparation time #1

leo19941227 commented Sep 20, 2021

leo19941227 commented Sep 20, 2021

leo19941227 left a comment

leo19941227 Sep 21, 2021

leo19941227 Sep 21, 2021

leo19941227 Sep 21, 2021

ftshijt Sep 21, 2021

less data preparation time #1

less data preparation time #1

Conversation

leo19941227 commented Sep 20, 2021

leo19941227 commented Sep 20, 2021

leo19941227 left a comment

Choose a reason for hiding this comment

leo19941227 Sep 21, 2021

Choose a reason for hiding this comment

leo19941227 Sep 21, 2021

Choose a reason for hiding this comment

leo19941227 Sep 21, 2021

Choose a reason for hiding this comment

ftshijt Sep 21, 2021

Choose a reason for hiding this comment