Hacker News new | past | comments | ask | show | jobs | submit login
Gnuspeech (gnu.org)
156 points by conductor on Oct 20, 2015 | hide | past | favorite | 55 comments



Steps to get this running on Linux, without messing with GNUstep:

    wget https://ftp.gnu.org/gnu/gnuspeech/gnuspeechsa-0.1.5.tar.gz
    tar -zxvf gnuspeechsa-0.1.5.tar.gz
    cd gnuspeechsa-0.1.5/
    mkdir build
    cmake -D CMAKE_BUILD_TYPE=Release ..
    make
    ./gnuspeech_sa -c ../data/en -p /tmp/test_param.txt -o /tmp/test.wav "Hello world." && aplay -q /tmp/test.wav
This uses Gnuspeech_SA, the cross-platform version. No idea how to get it to use a voice other than the default male one.


Shouldn't there be a "cd build" command just after "mkdir build"?


Yes, thanks. Unfortunately I can't edit my comment anymore.


    wget -O- https://ftp.gnu.org/gnu/gnuspeech/gnuspeechsa-0.1.5.tar.gz | tar -xz
    cd gnuspeechsa-0.1.5
    cmake .
    cmake --build .
    ./gnuspeech_sa -c data/en -p /tmp/test_param.txt -o /dev/stdout "Hello world." | aplay

    $VISUAL data/en/ttm_control_model.config


cc1plus: error: unrecognized command line option ‘-std=c++11’

apparently I need a newer gcc


In case it helps in finding the right version: that flag was added in gcc 4.7.


Can anyone in the field comment on what is the best offline TTS available today? This project is old, hardly maintained and the samples sound horrible. Festival is... not very good, but 'acceptable'. The MS build-in TTS engines are OK-ish. Dragon has some TTS but hard to use outside the use cases they designed it for it seems, and I'm not sure if they're even using their own TTS engine? Haven't had a chance to try it out myself. Any major ones I'm missing? And just how much better are the online ones, like Google's? Anyone have any quantitative data on that?


AFAIK there are a few that produce such results that they can not be distinguished from a real person. None of the really good ones are open source, and the best ones I have heard of are not even for sale.

The best implementations make an advantage in markets so they are well guarded. We do not know about the best implementations, because we did not notice a thing. For example some phone operators have replaced their customer services with TTS / STT solutions. Because people tend to lock up when they realize they are talking with a computer, they have had to make those systems sound very natural.

I know a few are pretty crappy ones, but a few are plain spooky. Customers that tend to joke and flirt with the computer, hoping for an emotional response, probably are most likely to notice them.

Then there's the case where the US intelligence services demonstrated their capabilities to a politician (senator/congressman) by recording him, and producing a voice clip of him saying something like "death to america" so well that no one could distinguish the speaker. It also seemingly passed further voice analysis. Google it up, pretty interesting read.


> I know a few are pretty crappy ones, but a few are plain spooky. Customers that tend to joke and flirt with the computer, hoping for an emotional response, probably are most likely to notice them.

I got a call two years ago from a telemarketer that kept asking me yes/no questions with a robot voice and didn't leave me any space in the conversation to say anything. I really felt like I was talking to a computer so I tried to Turing-test him by forcing him to answer an open-ended question. It took two minutes, and the caller turned out to be human after all. That wasn't very relieving though. Humans should not speak with such zombified voice.


>Humans should not speak with such zombified voice.

Why. Not. Let. The. Telemarketer. Have. Some. Fun. At. Their. Crappy. Job.


You spoke to Daniel Suarez's Daemon!


You claim that some are not open-source and not for sale, then for what purpose are they made?


I guess they would be used in a service that sells voice recordings, are developed for internal use or are born in research.


For English (and I think Japanese), HTS with the STRAIGHT vocoder can be pretty amazing. Very amazing. Licensing on both of those pieces is problematic. (Edit: I should probably mention that HTS is not easy to use. It will require some mental fortitude.)

Festival can produce very good results, but you have to get into the weeds with it and do a lot of planning. If you don't like Scheme, you're not going to like dealing with festival, and if you do like scheme, you're going to be frustrated by their particular scheme.

I've also noticed a lot of folks looking at post filtering, but that is a sea of dragons at the moment.


Assuming that for their demo at http://flite-hts-engine.sp.nitech.ac.jp/ they would select a good data set and parameters, I have to say that I'm not particularly impressed. Do you know of any online examples of a well-tuned Festival set up? I didn't find the 'official' Festival or Festvox examples especially great, certainly worse than online available TTS services, and I have never been able to find an example of a well-tuned result of Festival.


I don't know of any online examples, sorry. The problem is that a well-aligned festival voice is not academically interesting, so people generally don't publish it. Festival performs well when it can rely solely on unit selection and has multiple choices of the same phoneme sequence. Once it has to do some guessing on the synthesis, you'll see a drop in quality.

With a small bit of work, you can have festival use an HTS voice directly. I usually use festival to generate my label files, post filter the phoneme timing, synthesize with HTS, then apply a post-filter.


Flite is not Festival. It is a very cut down version for embedded machines.


Flite lacks entonation on questions and such.

It will work better on the ZipitZ2 and pentium2 machines.

Enough to listen short stories, but it's not suited to listen a full book from you bed.


Yes I know, but the GP commented on both Flite and Festival, and I addressed both comments - albeit, I'll admit, in a tangled way.


It is worth mentioning that STRAIGHT requires MATLAB.



Mary sounds quite good. Apparently if you know how to write input files in the correct format you can get it to speak much more like a real human by including stress/speed/tone/etc. variations too.


>http://espeak.sourceforge.net/

Can't confirm how well espeak works, since I didn't try it, but I have tried pyttsx on Windows. (pyttsx is a Python library which uses SAPI5 on Windows and espeak on Linux.)

Worked okay for me. Not tested a lot though. Here's my simple Python snippet for trying pyttsx on Windows:

http://code.activestate.com/recipes/578839-python-text-to-sp...

(A commenter on the above ActiveState recipe said my pyttsx recipe worked fine for him with espeak on Crux Linux.)

and more details on the same here:

http://jugad2.blogspot.in/2014/03/speech-synthesis-in-python...

"The pyttsx library is a cross-platform wrapper that supports the native text-to-speech libraries of Windows and Linux at least, using SAPI5 on Windows and eSpeak on Linux."

Also, check out the synthesized train announcer's voice in my blog post above - somewhat eerie and cool :)


And a similar small trial of the reverse problem - speech recognition, using the Python 'speech' library:

http://jugad2.blogspot.in/2014/03/speech-recognition-with-py...


Perfect moment to mention my little project.

http://simulationcorner.net/index.php?page=sam

I have to admit, that the quality is not good. But it also works on embedded systems.


The TTS engine built in to OSX works offline and is good. In my opinion it's as good as any of the online ones for most uses. But the OSX license limits it to "personal, non-commercial use". I'm not sure whether it's possible to get the same technology in a commercial license. Some of the voices are licensed from Nuance (parent company of Dragon), but it's not clear whether the engine is too.

Some better free-software offerings would really open this area up for experimentation, since the commercial offerings tend to be pretty black-box and targeted at very specific use-cases.


Last time I looked, the stock Festival voices were intentionally not production quality - they were demonstrations of how well the Festvox automated voice building scripts worked without manual tweaking. They're basically academic research that the researchers let everyone use. If you want high-quality voices you'll have to pay for them, because so far no-one that's put in the work is willing to give it away for free.


Amazon acquired Ivona, a polish TTS company.


My knowledge is probably 10 years out of date, but the Asterisk guys used Cepstral.


And GNU drops the ball again...missing a perfect opportunity to call one of their projects "GNUSpeak". ;-)


Egads! It was so obvious that I kept calling it GNUSpeak in my head, enjoying the cleverness of it all, and only now, following your comment, noticed it's actually GNUSpeech...



Thanks, I was looking for some.

Also, holy crap! http://pages.cpsc.ucalgary.ca/~hill/extra-synthesis-examples...


GNU vocaloids?


If you're on a Mac, you can have some fun with voices (plenty designed for multiple languages):

    say -v ?
Or just run this script to pass hours of time in joy and terror: https://gist.github.com/DelvarWorld/3f700aac8d7972b053f0


Honestly, this sounds absolutely terrible. Why is the quality so much worse than, say, Google's TTS running on my phone?


It's probably the best technology that the early 90s had to offer.

FTA:

    gnuspeech is currently fully available as a
    NextSTEP 3.x version in the SVN repository
    along with the Gnu/Linux/GNUStep version,
    which is incomplete though functional
Wikipedia[1]:

    3.0 September 8, 1992
    3.1 May 25, 1993
    3.2 October 1993
    3.3 February 1995
[1] https://en.wikipedia.org/wiki/NeXTSTEP#Release_history


from what I can tell, gnuspeech is fully synthesized speech. Siri, for example, uses a voice actor to say tens of thousands of utterances, and in order to generate words/phrases that don't exist, they concatenates (more or less) segments from recorded phrases.


I think maybe it sounds worse because it's based on a more accurate and complex modelling of the human voice, so while it sounds worse now, it has the potential to get better and perhaps surpass implementations using more widely used methods.

Can the voice synthesis method used on Android sing, scream, or make other sounds that are not speech?


It does sound...vintage. The male voice (especially in The Chaos example) sounds exactly like the computer's voice in WarGames.


That's the price you pay for ideological freedom.


I was recently looking at voice generation. The newer Festival is better than the one that's probably in your package manager.

http://www.cstr.ed.ac.uk/projects/festival/morevoices.html

I wonder if we will see any of the deep learning libraries try to do voice recognition, a good open source implementation is lacking, or voice generation.


Do you mean: Speech to text (called "automatic speech recognition") Speech to identity of the speaker (called "speaker id")

?

(Also, "voice generation" is called "TTS" or "text to speech")

There are multiple world class open ASR systems using DNNs. Check out Kaldi.


I am not sure of the source of the confusion. I meant TTS in the first part of my comment ("voice generation" isn't far from "speech synthesis", which is the title of the Wikipedia article for it in English) and both ASR and TTS in the second. Regarding ASR I'd not heard of Kaldi, thanks for the tip.


According to the FAQ, the voices used on that demo page aren't currently publicly available: http://www.cstr.ed.ac.uk/projects/festival/demofaq.html#voic... (Also, the FAQ seems to imply that you'll have to pay to use them commercially when they are.)


It might be just me but I can't figure out how to even start compiling this to test it out on linux. That said, I'm always happy to see alternatives to Festival TTS. Hopefully it'll become a bit more user friendly in time.


The GNU project has _so much stuff_ in it. It seems like I'm always hearing about something I didn't know about before.


Very true, though the first release of this package just came out a few days ago.


Apparently it's been part of GNU for 13 years... but never had an official release?

https://en.m.wikipedia.org/wiki/Gnuspeech


Interesting! Thanks! Also, I interpreted the statement about the v1.0 release as being the initial release; perhaps there was a v0.1 release before that...


This seems super interesting! They say that they are using a novel approach that models the actual way humans create sounds. Is this completely new? I would assume this implies that you can reuse so much more among languages, and have a more flexible system - like getting it to shout or sing, making strange noises, whistling, etc. But I don't know a lot about it. Is what I'm saying Science Fiction?

This is the kind of article in which i'd love (and I have almost come to expect!) to find comments that provide so much more breadth and depth of information that the actual link, here on HN; this is one of the things that make this community so great.


The idea is actually quite old (as they half-way admit, this is based on a program that was developed on NeXT machines when those were current; the research goes back to the late 60s and 70s). Wikipedia has some background naturally: https://en.wikipedia.org/wiki/Speech_synthesis#Articulatory_...

There is also research into using this type of physical modeling for singing[1] and laughing[2], so its not far-fetched to imagine a model that can speak, sing, laugh, shout, and perform any of the various vocal articulations humans can make. The main obstacles seem to be figuring out how to control the model over time, how to smoothly transition between control states, and how to capture enough fine details to cross the uncanny valley.

[1] http://www.cs.princeton.edu/~prc/SingingSynth.html

[2] https://ccrma.stanford.edu/groups/mcd/publish/files/2013-nim...


> The idea is actually quite old (as they half-way admit, this is based on a program that was developed on NeXT machines when those were current; the research goes back to the late 60s and 70s).

I don't think linking the papers that form the basis of the work, and describing the history in fair detail (including the original development for the NeXT, and the technical change from real-time DSP-based synthesis to realtime CPU based synthesis enabled by CPU speed improvements since the NeXT was current) constitutes "half-way admitting" that the idea has been around for a while. Its pretty outright explicit.


Requires GNUStep on Linux? This is very baked into the NeXT world it would appear.


Unar uses GNUStep too, and it's not the same as a full GNUStep desktop setup. It uses far less dependencies.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: