Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any chance of hosting the VoxCeleb datasets? #34

Open
NeuroForLunch opened this issue Aug 11, 2021 · 3 comments
Open

Any chance of hosting the VoxCeleb datasets? #34

NeuroForLunch opened this issue Aug 11, 2021 · 3 comments

Comments

@NeuroForLunch
Copy link

The downloads are very slow from their site, the mirrors do not always work, and their google drive link is dead.

https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

@yarikoptic
Copy link
Member

wow -- https://www.robots.ox.ac.uk/~vgg/data is an awesome collection of datasets. I think collecting them under http://datasets.datalad.org/?dir=/labs/vgg (or may be even just straight on the top level?) . Some are an easy job for the crawler. Running now

datalad crawl-init --save --template=simple_with_archives 'a_href_match_=.*/data/.*\.zip' url=https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ leading_dirs_depth=0
datalad crawl
datalad install -d . -s https://github.com/joonson/voxconverse labels

to see what happens for voxconverse one... Result you can see at https://github.com/yarikoptic/demo-vgg-voxconverse (I am not redistributing any data file there, so to datalad get will fetch entire original archive from its original location for this one), which I got there via

datalad create-sibling-github --github-login yarikoptic -s gh-yarikoptic demo-vgg-voxconverse
datalad push --to gh-yarikoptic  # after tuning url to be ssh since github no longer allows user/pw..

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

(attn @joonson)
and we would be glad to help to ensure dissemination and easier access. But not sure if we could host and re-distribute all of it from datasets.datalad.org where we generally prefer to not mirror the data. May be we could/should provide re-distribution through datalad-osf special remote, i.e. depositing to OSF...

overall -- the chance exists, but needs thinking/time investment to make it happen. Interested to join the effort? ;-)

@NeuroForLunch
Copy link
Author

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

It would be awesome to be able to download a certain number of files instead of the giant archives.

@yarikoptic
Copy link
Member

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

It would be awesome to be able to download a certain number of files instead of the giant archives.

I do get the incentive and it should be possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants