Any chance of hosting the VoxCeleb datasets? #34

NeuroForLunch · 2021-08-11T20:18:07Z

The downloads are very slow from their site, the mirrors do not always work, and their google drive link is dead.

https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

yarikoptic · 2021-08-12T14:25:14Z

wow -- https://www.robots.ox.ac.uk/~vgg/data is an awesome collection of datasets. I think collecting them under http://datasets.datalad.org/?dir=/labs/vgg (or may be even just straight on the top level?) . Some are an easy job for the crawler. Running now

datalad crawl-init --save --template=simple_with_archives 'a_href_match_=.*/data/.*\.zip' url=https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ leading_dirs_depth=0
datalad crawl
datalad install -d . -s https://github.com/joonson/voxconverse labels

to see what happens for voxconverse one... Result you can see at https://github.com/yarikoptic/demo-vgg-voxconverse (I am not redistributing any data file there, so to datalad get will fetch entire original archive from its original location for this one), which I got there via

datalad create-sibling-github --github-login yarikoptic -s gh-yarikoptic demo-vgg-voxconverse
datalad push --to gh-yarikoptic  # after tuning url to be ssh since github no longer allows user/pw..

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

(attn @joonson)
and we would be glad to help to ensure dissemination and easier access. But not sure if we could host and re-distribute all of it from datasets.datalad.org where we generally prefer to not mirror the data. May be we could/should provide re-distribution through datalad-osf special remote, i.e. depositing to OSF...

overall -- the chance exists, but needs thinking/time investment to make it happen. Interested to join the effort? ;-)

NeuroForLunch · 2021-08-12T20:29:40Z

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

It would be awesome to be able to download a certain number of files instead of the giant archives.

yarikoptic · 2021-08-13T14:48:29Z

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

It would be awesome to be able to download a certain number of files instead of the giant archives.

I do get the incentive and it should be possible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any chance of hosting the VoxCeleb datasets? #34

Any chance of hosting the VoxCeleb datasets? #34

NeuroForLunch commented Aug 11, 2021

yarikoptic commented Aug 12, 2021

NeuroForLunch commented Aug 12, 2021

yarikoptic commented Aug 13, 2021

Any chance of hosting the VoxCeleb datasets? #34

Any chance of hosting the VoxCeleb datasets? #34

Comments

NeuroForLunch commented Aug 11, 2021

yarikoptic commented Aug 12, 2021

NeuroForLunch commented Aug 12, 2021

yarikoptic commented Aug 13, 2021