Does anyone else find that the text-to-speech on the "Tauri in 100 Seconds" vide...

j0ma · on June 16, 2022

It's not text-to-speech, it's the video (and voice) of a Youtuber (Fireship) that makes a lot of "in 100 Seconds" videos about all kinds of frameworks/programming languages/etc.

chrismorgan · on June 16, 2022

I find it surprisingly hard to judge, because the edit style is so obnoxious and poorly executed. I think it’s done by a human, but the speech and editing are both sufficiently lousy that I’m not confident from this video alone—though another couple of videos I tried were definitely somewhat better. The aggressive cutting style is just bad editing, and the speaker has spoken each phrase independently with no attempt to bridge them (which a competent speaker would do). The first ten seconds are particularly grating, with thoroughly unnatural emphasis on the start of most syllables, almost as though each word or even syllable had been spoken independently and then glued together (e.g. And, A, Light, Weight, Rust, Back, End). The prosody is also regularly quite a bit off, and it’s hard to determine if that’s related to the bad editing and speech segmentation, or an independent issue. There’s enough that feels natural in things like intonation that I don’t think it’s TTS, but it’s also making a lot of the sorts of errors that even the best TTS engines habitually make.

I get the impression that it’s a human who doesn’t know how to speak or edit particularly well but normally gets away with it not being too bad (as I say, other videos seem to be better, though they still suffer from the aggressive cut style), but compromised harshly on quality in this case in order to squish more into the hundred seconds… and still ended up 50% over time.

I very strongly dislike the general style.

syzygyhack · on June 16, 2022

Different strokes. I love Jeff's style and video pacing, though I don't think this is one of his best.

jeffhuys · on June 16, 2022

I guess that happens when you try to fit everything in 100 seconds.

Also, it will become almost literally impossible to hear the difference in the near future, so I personally wouldn't keep assuming it's one or the other.

Also also, did you give this feedback to Jeff?

chrismorgan · on June 16, 2022

> I guess that happens when you try to fit everything in 100 seconds.

What, you overshoot by 50%? :-) But more seriously, the job that has been done is simply low-quality, even disregarding the choppy edit style—probably a mixture of shoddy work (from comparing it with some of the others), and the creator not being capable of better as regards the speech—most aren’t particularly aware of how to improve. Much better is readily possible.

> Also, it will become almost literally impossible to hear the difference in the near future

I am an excellent reader, highly regarded for my diction and prosody and for conveying the sense of a matter. So far, I haven’t heard a TTS demo, hand-picked or otherwise, that would stand a chance against a skilled reader: the veriest dunce would savvy which was which within a couple of sentences. (And at least a couple of the demos were deliberately trying to address these sorts of shortcomings in inflection and such.) Certainly they’ve reached the stage where you often can’t reliably identify the computer versus the mediocre reader (and there are an awful lot of them), but I don’t think we’ll see computers beating skilled readers and speakers any time soon, when I consider how poor a job AI is still doing on coherent longform writing, and then add the degrees of nuance and information conveyed in speech. (Supplant them? Sure. But you don’t have to be better than something to supplant it, just cheaper, or some such thing. My favourite example of this is book binding where hot melt glue is vastly inferior to cold glue, but it’s much faster and thus cheaper to produce with, and it’s good enough that you’ll struggle to find cold glue ever used in production now.)

cachehit · on June 16, 2022

I guess it's been processed to remove gaps which makes it sound unnatural and to me as well as GP I guess, antagonizing.

drcongo · on June 16, 2022

Surely that's not a human voice?!

krageon · on June 16, 2022

Like a lot of newer video content creators, this author has perfected their ability to speak exactly like a speech to text synth does: The same inflection over sentences, no human variability between lines, a lot of post-processing.

jamincan · on June 16, 2022

What would drive this? I've noticed more videos like this recently and it has an uncanny valley effect that makes me shut if off almost immediately.

stjohnswarts · on June 16, 2022

Maybe tiktok TTS is now the standard for vocal communication :( . If I ever hear "the young people" talking like this to each other when I'm out shopping I think I will retire and just move off the grid.

sitkack · on June 16, 2022

That should be a sub genre of popping and locking dance competition, also sound like a TTS.

The AGIs should be concerned about us, not the other way around.

krageon · on June 16, 2022

I am the same way, but I've noticed folks 10-15 years younger than me don't mind it at all. Perhaps it's just some weird internet culture thing that I'm too crusty to understand :)

drcongo · on June 16, 2022

I literally closed the tab after about 8 seconds of it. It's unbearable.

bil7 · on June 16, 2022

lol, Jeff (the video creator and narrator) will probably take this as a compliment.

zeta0134 · on June 16, 2022

The introduction being a video, and only a video, is already a major annoyance. I don't suppose the text is available for reading somewhere?