Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

less data preparation time #1

Merged
merged 2 commits into from
Sep 23, 2021
Merged

Conversation

leo19941227
Copy link
Member

Hey @ftshijt and @HuangZiliAndy !

As we have discussed this very long time ago, that it will be nice we have a centralized LibriMix repo instead of one repo for a task, and we just use different data preparation script for each task.

Here I create on repo basing on Jiatong's version since his version has much more modification from the official release including the RTTM label files or SD. So there are two data preparation scripts:

  • generate_librimix_sd.sh
  • generate_librimix_ss.sh

I also make some changes to decrease the data preparation time, including ignoring train-clean-360 for both tasks and ignoring WHAM noise augmentation for SS. Furthermore, since now SD and SS all have a specific setting in terms of min/max condition or mix_clean/mix_both condition. Hence I think the data preparation script can now just prepare the specific setting we use for benchmarking, I believe these changes can save user some time from waiting the data to be ready.

Could you please take a look if my change fit your need?
Thanks!!

@leo19941227
Copy link
Member Author

Sorry I find there is some bug for SS, let me look into this further.

Copy link
Member Author

@leo19941227 leo19941227 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ftshijt and @HuangZiliAndy ,

I made some updates and now two scripts for generating mixtures both work.
Please help me proof-read the change if you have time. Thanks!

Comment on lines +79 to +81
--freqs 16k \
--modes min \
--types mix_clean
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HuangZiliAndy Please help me proof-read this! Thanks!

Comment on lines +66 to +69
print("[Warning] - train-clean-360 is ignored in create_librimix_from_metadata.py for less data preparation time."\
" Please note that in S3PRL we only use the train-clean-100 for downstream tasks.")
md_filename_list = [file for file in os.listdir(metadata_dir)
if 'info' not in file]
if 'info' not in file and '360' not in file]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for skipping the mixture creation for trian-clean-360 which will take much more time.

@@ -79,5 +78,5 @@ for n_src in 2; do
--n_src $n_src \
--freqs 16k \
--modes max \
--types mix_clean mix_both
--types mix_both
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can @ftshijt please help me proof-read this?
Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks

@leo19941227 leo19941227 merged commit c985b5b into master Sep 23, 2021
@leo19941227 leo19941227 deleted the less-data-preparation-time branch September 23, 2021 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants