Ironically you seem to be focusing too much on the exact words and phrases I used rather than the deeper meaning. So let's just get completely away from words like "knows" and "understanding" which seem to be tripping multiple people up.
> It's just a substring match.
Let's just say this is true. That is a super simply process, but what would it look like?
Step 1: Transcribe the audio into text
Step 2: Run substring match on text
The transcribing/close captioning feature only does step 1. This shows that a step 2 is possible. I think you would have to be naive to think the capability to do this type of analysis on the transcribed text was designed for only this feature and would never be used for anything else. This feature is announcing that Youtube isn't merely creating transcripts of the audio in videos, it is running some unknown amount of analysis on that data.
As I said in my original comment, "it isn't that I didn't know Google had this ability", but this is literally a glowing sign pointing to this fact. I think the danger of Google reminding people of this has potential to outweigh the benefit of the "that's cute" reaction that this is designed to elicit.
> I think you would have to be naive to think the capability to do this type of analysis on the transcribed text was designed for only this feature and would never be used for anything else.
Why would you have to be naive to believe this? The subtitles, with timings, are available on the client-side already. You seem to allude that doing this work would require some sort of deep analysis work. I think it's really more like 5 lines of JS, and 4 of them are producing the fun animated gradient :P
It isn’t about this requiring “deep analysis work”. It is the difference between some analysis and no analysis. That increase from 0 to 1 is always the biggest hurdle to clear when it comes to any corporate behavior like this.
This feature is like walking into your kitchen one day to find a dead cockroach. I’m saying that is an indication of a cockroach problem while you’re effectively responding with “it’s fine, the one cockroach is dead and there is no reason to believe there are any others.”
> This feature suggests Google understands what was said.
Running .includes() on the client does not imply that Google has any "understanding" of what was said. It only implies that they ran .includes() on the client. includes() does not "understand" anything.
The thing I really don't understand is this: the fact that Google has closed captions at all implies they do an enormous more "analysis" than this minor feature could possibly require. If you understand how Google does CCs and what that means, this shouldn't have bothered you at all.
In your analogy, it's like you see a mound of a hundred thousand cockroaches, but you're are worried about a dust speck in another room.
>Running .includes() on the client does not imply that Google has any "understanding" of what was said. It only implies that they ran .includes() on the client. includes() does not "understand" anything.
Do you want to have a good faith conversation about this? Because going back to debating the meaning of "understanding" after I already said this was misleading is not a good way to have a conversation.
>The thing I really don't understand is this: the fact that Google has closed captions at all implies they do an enormous more "analysis" than this minor feature could possibly require. If you understand how Google does CCs and what that means, this shouldn't have bothered you at all.
Can we set up a baseline that there is a difference between content agnostic analysis and content aware analysis? Transcripts are content agnostic in that they can be produced without any comprehension of the words said. This feature is content aware in that it is looking for specific meaning in the words said. Do you not see any difference between these two?
> Do you want to have a good faith conversation about this? Because going back to debating the meaning of "understanding" after I already said this was misleading is not a good way to have a conversation.
Call it "understanding", call it "content aware analysis". I guarantee that their closed captioning service has much more of that quality than this new feature does.
> Can we set up a baseline that there is a difference between content agnostic analysis and content aware analysis? Transcripts are content agnostic in that they can be produced without any comprehension of the words said. This feature is content aware in that it is looking for specific meaning in the words said. Do you not see any difference between these two?
Again, I don't see it. CCs are not content agnostic: they have to have semantic understanding of the words said in order to produce accurate results. How do you think CCs differentiate between the words "to", "too" and "two" without looking at the surrounding words and having some idea of contextual usage? How do you think CCs can tell between "there" and "they're" without understanding if the speaker is referring to a person or a location? This is only the tip of the iceberg as to how CCs actually work, and more "content aware analysis" will always lead to more accurate CCs.
>Again, I don't see it. CCs are not content agnostic: they have to have semantic understanding of the words said in order to produce accurate results. How do you think CCs differentiate between the words "to", "too" and "two" without looking at the surrounding words and having some idea of contextual usage? How do you think CCs can tell between "there" and "they're" without understanding if the speaker is referring to a person or a location? This is only the tip of the iceberg as to how CCs actually work, and more "content aware analysis" will always lead to more accurate CCs.
Still can't get away from that "understanding" debate. You're also now equating an understanding of context with an understanding of meaning. An understanding of meaning isn't needed to differentiate between "to", "two", and "too" because they're all used differently in sentences. When the system encounters those, I don't think it goes to the definitions and tries to find which word makes the most meaningful sentence. Most times the specific homophone can be inferred based on things like part of speech and the part of speech can often be inferred from a sentence without knowing any meaning.
For example, would the system be able to properly handle homophones that are grammatically similar? Could it consistently transcribe sentences like "I have Celiac disease and enjoy the taste of rose water, so I prefer flower to flour in my deserts." That is an easy sentence to understand for anyone who knows the meaning of those words, but there are no grammatical or structural indications on which flower/flour to use.
But either way, that is getting way too deep in the weeds compared to where my point started. This feature calls attention to an analysis of meaning because the user sees the software reacting to the meaning of the content of the video. A transcript does not call attention to an analysis of meaning because the behavior of the software does not change based on the content of the video.
> But either way, that is getting way too deep in the weeds compared to where my point started.
Your first comment - the one that started all this - was, as far as I can understand, arguing that this feature indicated that Google had the capabilities to do more advanced - understanding? processing? meaning analysis? - than it had done in the past. If I keep coming back to that, well, it's because it appears to be your main point. If it's not, please correct me.
> Most times the specific homophone can be inferred based on things like part of speech and the part of speech can often be inferred from a sentence without knowing any meaning.
This is not true. I don't think I have enough more responses on HN to fully explain why homophones can not be inferred without understanding meaning, but I encourage you to go and read about how transcription works!
> For example, would the system be able to properly handle homophones that are grammatically similar?
I mean, this is easy enough for you to check. Here's some videos about flour / flower - notice how the CCs correctly determine if the word is flour or flower with almost 100% accuracy.
> This feature calls attention to an analysis of meaning because the user sees the software reacting to the meaning of the content of the video.
Are you saying you specifically think that YT is analyzing meaning from this feature, or just some generic user? I think you are smart enough to know that it's not true, but perhaps my mom might not understand that CCs require infinitely more processing power and this feature is just a drop in the bucket. (If you really still don't think it's true, definitely go read more about how CCs are made!)
>Your first comment - the one that started all this - was, as far as I can understand, arguing that this feature indicated that Google had the capabilities to do more advanced - understanding? processing? meaning analysis? - than it had done in the past. If I keep coming back to that, well, it's because it appears to be your main point. If it's not, please correct me.
Here is what I said. "It highlights how much Google analyses the content of its videos... It isn't that I didn't know Google had this ability...". My point was not that I learned about Google's capability from this feature or that this capability was new, it is that this calls attention to Google looking for meaning in the content of the video. A transcript does not call attention to Google looking for meaning regardless of how the transcripts are prepared.
>I mean, this is easy enough for you to check. Here's some videos about flour / flower - notice how the CCs correctly determine if the word is flour or flower with almost 100% accuracy.
Both those videos include the correct homophone in the title and description of the video. Choosing the correct one is not an indication of the system using the meaning of those words, it is pattern recognition. Every use of "flower" means the next usage is less likely to be "flour". The specificity of the example I used was important because it used both "flower" and "flour" in a way that can only be distinguished by the meaning of the words.
>Are you saying you specifically think that YT is analyzing meaning from this feature, or just some generic user? I think you are smart enough to know that it's not true, but perhaps my mom might not understand that CCs require infinitely more processing power and this feature is just a drop in the bucket. (If you really still don't think it's true, definitely go read more about how CCs are made!)
This feature is a glowing sign that Youtube as a company analyses the content of the videos for the meaning of what is said in those videos. You are too deep into the technical details trying to assign credit for what aspect of Youtube does the "understanding" or which "require[s] infinitely more processing power".
Think of this feature like receiving mail and you see one of the letters has already been opened. That could make you feel like your privacy was invaded in a way you wouldn't feel after receiving a postcard. And now we have spent several comments debating whether a torn envelope indicates whether anyone read the letter and whether a postcard is private.
That depends on how the transcription software is written. Are swear words filtered out or are they just never in the system's vocabulary in the first place? I assumed the latter, but fair enough point. It is possible my categorization needs more thought.
Regardless, there is in my opinion a clear distinction in sophistication between a filter and something that triggers a timed action. And that was really what my original comment was about, this feature's elevated sophistication is a conscious reminder of Google's capabilities. Normally that is out of sight and out of mind which is probably better for Google.
so you're not so worried that they do this analysis (this is very tame comparing to what they really do), but rather that they are transparent about it?
> It's just a substring match.
Let's just say this is true. That is a super simply process, but what would it look like?
Step 1: Transcribe the audio into text
Step 2: Run substring match on text
The transcribing/close captioning feature only does step 1. This shows that a step 2 is possible. I think you would have to be naive to think the capability to do this type of analysis on the transcribed text was designed for only this feature and would never be used for anything else. This feature is announcing that Youtube isn't merely creating transcripts of the audio in videos, it is running some unknown amount of analysis on that data.
As I said in my original comment, "it isn't that I didn't know Google had this ability", but this is literally a glowing sign pointing to this fact. I think the danger of Google reminding people of this has potential to outweigh the benefit of the "that's cute" reaction that this is designed to elicit.