Hacker News new | past | comments | ask | show | jobs | submit login

> It's awesome that the dataset is offered with a CC-0 license: https://voice.mozilla.org/en/data, does anyone know if it includes the answers from the survey?

I'm downloading it now, I'll have an answer in a half hour. Does anyone know if there is a torrent for it?




Well, download took longer than expected :).

Anyhow, here's a sample from the csv file:

  filename,text,up_votes,down_votes,age,gender,accent,duration
  cv-valid-test/sample-001224.mp3,but i felt miserable watching him wither away like a shriveled dandelion,1,0,thirties,male,england,
Not sure how some of these are being populated, but yeah; there's several additional folders including invalid mp3, a splintered train set (not sure how it was selected) and a test set folder.

Here's the README.txt. Looks cool! Have happy hacky fun! :)

https://gist.github.com/cwgreene/f7f4df4ddcd9da017b9f4694b3f...

Interestingly; many of the 'invalid' mp3's are actually (mostly) correct. Listening to them is interesting to guess as to why they were downvoted.



Thanks! Couldn't find the source of the Readme in the zipfile. Can you talk about what the update process for this file is? How often is it updated? Is there a way to just download the new files? Is there a tarball script for this in the repo somewhere?

I see that you have instructions for s3, are the files actually backed in s3? Is it possible to download them with s3 (possibly using requester pays)?


We have no plans to allow users to download the "raw" data from s3 (ie. before we perform the train/dev/test split). But we want to eventually build some tools to automate this. See here for some background:

https://discourse.mozilla.org/t/the-mozilla-guarantee-publis...


Can't wait to try it either.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: