Hacker News new | past | comments | ask | show | jobs | submit login

Labeled data is indeed a problem. The only sizable corpus I know of is TIMIT and it costs $300 and I think has prohibitions on commercial use. That said, phonetic labeling is becoming less important thanks to designs like this...

I wonder if you could bootstrap a sizable speech dataset by trawling audio off YouTube and then using one of the really good cloud speech recognition services to label it. :)




IMHO, the TIMIT corpus should no longer be used in most application-driven speech recogniton research, as it’s small and completely unrealistic for any real world application. Furthermore, nobody cares about phone error rates, as recognizing phones is not the ultimate goal.

There have been much better, larger datasets available for a long time, for example the Fisher English conversational telephone speech corpus was released in 2004 and contains ~1950h of transcribed speech. There are tons of other datasets in various languages and for various applications (conversational speech, broadcast transcription, etc.).


Isn't there some value in being able to bench accoustic models in isolation, no matter how weak they may be, without downstream language models?


The labeled data is $300? That's basically free, even for somebody who's just a serious hobbyist, much less any funded public or private research group.

Edit: It's even less [1]:

    $0.00 1993 Member
    $250.00 Non-Member
    $125.00 Reduced-License
[1]: https://catalog.ldc.upenn.edu/LDC93S1




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: