> It's awesome that the dataset is offered with a CC-0 license: https://voice.mo...

archgoon · on July 2, 2018

Well, download took longer than expected :).

Anyhow, here's a sample from the csv file:

  filename,text,up_votes,down_votes,age,gender,accent,duration
  cv-valid-test/sample-001224.mp3,but i felt miserable watching him wither away like a shriveled dandelion,1,0,thirties,male,england,

Not sure how some of these are being populated, but yeah; there's several additional folders including invalid mp3, a splintered train set (not sure how it was selected) and a test set folder.

Here's the README.txt. Looks cool! Have happy hacky fun! :)

https://gist.github.com/cwgreene/f7f4df4ddcd9da017b9f4694b3f...

Interestingly; many of the 'invalid' mp3's are actually (mostly) correct. Listening to them is interesting to guess as to why they were downvoted.

punchingwater · on July 2, 2018

We also keep the README in the repo: https://github.com/mozilla/voice-web/blob/master/docs/corpus...

archgoon · on July 2, 2018

Thanks! Couldn't find the source of the Readme in the zipfile. Can you talk about what the update process for this file is? How often is it updated? Is there a way to just download the new files? Is there a tarball script for this in the repo somewhere?

I see that you have instructions for s3, are the files actually backed in s3? Is it possible to download them with s3 (possibly using requester pays)?

punchingwater · on July 3, 2018

We have no plans to allow users to download the "raw" data from s3 (ie. before we perform the train/dev/test split). But we want to eventually build some tools to automate this. See here for some background:

https://discourse.mozilla.org/t/the-mozilla-guarantee-publis...

Rodd45 · on July 2, 2018

Can't wait to try it either.