Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone else find that the text-to-speech on the "Tauri in 100 Seconds" video sounds very unnatural?

Tauri seems like a step in the right direction so will be giving it a go.



It's not text-to-speech, it's the video (and voice) of a Youtuber (Fireship) that makes a lot of "in 100 Seconds" videos about all kinds of frameworks/programming languages/etc.


I find it surprisingly hard to judge, because the edit style is so obnoxious and poorly executed. I think it’s done by a human, but the speech and editing are both sufficiently lousy that I’m not confident from this video alone—though another couple of videos I tried were definitely somewhat better. The aggressive cutting style is just bad editing, and the speaker has spoken each phrase independently with no attempt to bridge them (which a competent speaker would do). The first ten seconds are particularly grating, with thoroughly unnatural emphasis on the start of most syllables, almost as though each word or even syllable had been spoken independently and then glued together (e.g. And, A, Light, Weight, Rust, Back, End). The prosody is also regularly quite a bit off, and it’s hard to determine if that’s related to the bad editing and speech segmentation, or an independent issue. There’s enough that feels natural in things like intonation that I don’t think it’s TTS, but it’s also making a lot of the sorts of errors that even the best TTS engines habitually make.

I get the impression that it’s a human who doesn’t know how to speak or edit particularly well but normally gets away with it not being too bad (as I say, other videos seem to be better, though they still suffer from the aggressive cut style), but compromised harshly on quality in this case in order to squish more into the hundred seconds… and still ended up 50% over time.

I very strongly dislike the general style.


Different strokes. I love Jeff's style and video pacing, though I don't think this is one of his best.


I guess that happens when you try to fit everything in 100 seconds.

Also, it will become almost literally impossible to hear the difference in the near future, so I personally wouldn't keep assuming it's one or the other.

Also also, did you give this feedback to Jeff?


> I guess that happens when you try to fit everything in 100 seconds.

What, you overshoot by 50%? :-) But more seriously, the job that has been done is simply low-quality, even disregarding the choppy edit style—probably a mixture of shoddy work (from comparing it with some of the others), and the creator not being capable of better as regards the speech—most aren’t particularly aware of how to improve. Much better is readily possible.

> Also, it will become almost literally impossible to hear the difference in the near future

I am an excellent reader, highly regarded for my diction and prosody and for conveying the sense of a matter. So far, I haven’t heard a TTS demo, hand-picked or otherwise, that would stand a chance against a skilled reader: the veriest dunce would savvy which was which within a couple of sentences. (And at least a couple of the demos were deliberately trying to address these sorts of shortcomings in inflection and such.) Certainly they’ve reached the stage where you often can’t reliably identify the computer versus the mediocre reader (and there are an awful lot of them), but I don’t think we’ll see computers beating skilled readers and speakers any time soon, when I consider how poor a job AI is still doing on coherent longform writing, and then add the degrees of nuance and information conveyed in speech. (Supplant them? Sure. But you don’t have to be better than something to supplant it, just cheaper, or some such thing. My favourite example of this is book binding where hot melt glue is vastly inferior to cold glue, but it’s much faster and thus cheaper to produce with, and it’s good enough that you’ll struggle to find cold glue ever used in production now.)


I guess it's been processed to remove gaps which makes it sound unnatural and to me as well as GP I guess, antagonizing.


Surely that's not a human voice?!


Like a lot of newer video content creators, this author has perfected their ability to speak exactly like a speech to text synth does: The same inflection over sentences, no human variability between lines, a lot of post-processing.


What would drive this? I've noticed more videos like this recently and it has an uncanny valley effect that makes me shut if off almost immediately.


Maybe tiktok TTS is now the standard for vocal communication :( . If I ever hear "the young people" talking like this to each other when I'm out shopping I think I will retire and just move off the grid.


That should be a sub genre of popping and locking dance competition, also sound like a TTS.

The AGIs should be concerned about us, not the other way around.


I am the same way, but I've noticed folks 10-15 years younger than me don't mind it at all. Perhaps it's just some weird internet culture thing that I'm too crusty to understand :)


I literally closed the tab after about 8 seconds of it. It's unbearable.


lol, Jeff (the video creator and narrator) will probably take this as a compliment.


The introduction being a video, and only a video, is already a major annoyance. I don't suppose the text is available for reading somewhere?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: