People keep saying stuff like this, but I don't believe it: "There is no technic...

staticman2 · on Dec 30, 2024

Is conversation data really that valuable?

If I have the LLM translate a text from French to English... what is there to learn from that? Maybe the translation is great maybe it's awful, but there's no "correct" translation to evaluate the LLM against to improve it.

If I ask the chatbot for working code and it can't provide it, again, there's no "correct" code to train it against found in the conversation.

If I ask an LLM to interpret a bible passage, whether it does a good job or a terrible job there's no "correct" answer that the provider has to use as the gold standard, just the noise of people chatting with arbitrary answers.

motoxpro · on Dec 30, 2024

When will this come to pass? OpenAI has many orders of magnitude more conversational data and Anthropic just keeps catching up. Until there is some evidence (OpenAI winning, or Google winning rather than Open Source catching up), I don't belive this is true.

ijustlovemath · on Dec 30, 2024

What do you think they've been building these multimodal models with? If I had Google application logs from the past few decades I would absolutely be transforming it and loading into my training dataset. I would not be surprised if Google/Meta/Msft are not doing this already

When the big companies say they're running out of data, I think they mean it literally. They have hoovered up everything external and internal and are now facing the overwhelming mediocrity that synthetic data provides.

skydhash · on Dec 30, 2024

Digital data are only a tiny part of the influx of information that people interact with. It's the same error platforms made with releasing movies straight to their services. Yes, people watch more movies at home, but going to cinemas is a whole experience that encompasses more than just watching a movie. Yes, books are great, but traveling and tutoring is much more impactful.

ijustlovemath · on Dec 30, 2024

Sorry, what does that have to do with an OpenAI tech moat?

skydhash · on Dec 30, 2024

>>> The company that captures the most human-AI interaction data will have a TREMENDOUS moat.

>> When the big companies say they're running out of data, I think they mean it literally. They have hoovered up everything external and internal and are now facing the overwhelming mediocrity that synthetic data provides.

> Digital data are only a tiny part of the influx of information that people interact with.

ijustlovemath · on Dec 30, 2024

I'm not sure how you'd get that non-digital data, though. Fundamentally that sounds like a process that doesn't scale to tbe level that they need. Can you explain more?

skydhash · on Dec 30, 2024

Sorry, I wasn't clear enough. I'm saying that for most problems, a lot of the relevant data is not digital. I'm a software developer, and most of the time, the task is to transcript some real world process to a digital equivalent. But most of the time, you lose the richness of interactions to gain repeatability, correctness, speed,...

So what people bothers writing down are just a pale reflection of what has been, the reader has to relies on his experience and imagination to recreate it. If we take drawing for example, you may read all the books on the subject, you still have to practice to properly internalize that knowledge. Same with music, or even pure science (the axioms you start with are grounded in reality).

I believe LLMs are great at extracting patterns from written text and other forms of notation. They may be even good at translating between them. But as anyone who is polyglot may attest, literal translation is often inadequate because lot of terms are not equivalent. Without experiencing the full semantic meaning of both, you'll always be at risk at being confusing.

With traditional software, we were the ones providing meanings so that different tools can interact with each other (when I click this icon, a page will be printed out). LLMs are mostly translation machines, but with just a thin veneer of syntax rules and terms relationships, but with no actual meaning, because of all the information that they lack.

ijustlovemath · on Dec 30, 2024

I actually think LLMs power comes as a result of their deep semantic understanding. For example, embeddings of gendered language, like "king" and "queen," have a very similar vector difference to "man" and "woman". This is true across all sorts of concepts if you really dive into the embeddings. That doesn't come without semantic understanding.

As another example, LLMs are kind of magical when it comes to what I'd call "bad memory spelunking". Is there a video game, book, or movie from your childhood, which you only have some vague fragments of, which you'd like to rediscover? Format those fragments into a request for a list of candidates, and if your description contains just enough detail, you will activate that semantic understanding to uncover what you were looking for.

I'd encourage you to check out 3blue1brown's LLM series for more on this!

I think it's true they lack a lot of information and understanding, and that they probably won't get better without more data, which we are running out of. That's sort of the point I was originally trying to make.

mrbungie · on Dec 30, 2024

Data is becoming a commodity in this regard. That can't be really their moat when Google, Anthropic, etc are publishing similar products.

asadotzler · on Dec 30, 2024

Well, once the powerful steal it all, of course it becomes commoditization. Had they been required to abide by the law, something that few Silicon Valley VC-backed companies really worry about, this kind of "information" would not be a commodity, it'd be a lucrative property. But, Silicon Valley doesn't give a fuck about anyone so they just stole it all and now it's basically worthless as a result.

robertlagrant · on Dec 30, 2024

> something that few Silicon Valley VC-backed companies really worry about

Is that actually true?

paxys · on Dec 30, 2024

Yup. We have seen time and again that companies don't need a "technical moat" to stay in the lead. First mover advantage and a 1-2 year head start is always enough. Of course they also need to keep their foot on the gas and not let their product get overtaken. With all the talent OpenAI has I'm pretty sure they will manage.