Succinctly stated and something that resonates strongly with me.
In the last internet revolution (web search), results started high quality because the inputs were high quality - bloggers and others just wanted to document and share knowledge. But over time, many interests (largely commercial) figured out how to game the system with SEO, and quality of search results has decreased as search's incentive structure led to lower quality data being indexed.
We're at the start of the LLM revolution now - models are trained on high quality inputs (which may be as rare as "low-background steel" in the future). But the models allow the mass production of lower quality outputs with errors and hallucinations; once those get fed back into new models, are we doomed to decreasing effectiveness of LLMs, just as we've seen with search? Will there be LLMO (LLM optimization) to try to get your commercial interests reflected in the next generation models?
I think we've got a few golden years of high quality LLMs before that negative feedback loop really starts to hurt like it did in search.
I doubt it's going to go the same way as search. You can't run Google on consumer hardware, but you can run LLMs locally.
At worse, newer models will get worse and you can just stick to older models.
You could also argue that proprietary models gated by an API are better than anything you can run locally, and yeah maybe those will get worse with time.
They're not going to get any worse than what you can run locally though. If they do, open models will overtake them, and then we'd be in a better position overall.
> You can't run Google on consumer hardware, but you can run LLMs locally.
You can't run an up to date model locally. When I ask Googles models they have knowledge from stuff just a week ago, without using search. You wont get that from a giant local model.
Unlike the low background radiation steel, high quality content will still continue to be created. Arguably even at a faster rate with these powerful tools.
The proportion will drop ofcourse but that only means curation will become king. With search this curation was difficult, because the value was so low you could only do it profitably if it was completely automatic, and that was difficult.
An optimistic take would be that LLMs will make curation so much more valuable, that it will be done much better. If the wider world gets to use this curation to limit spam, and highlight good work, it would be amazing for the world.
You're forgetting that we have mechanisms in place to curate curation and farm attention toward the crap, not toward quality. It will be harder and harder to break the surface tension of this gooey gelatin wrapper we've placed over creative activity.
People will pay money for curation towards quality, probably quite a bit if the rest of the landscape is 99.99% noise. They won't pay much money for curation "toward the crap".
But that will be reputation-based, and names can be sold. Look at the brand-holding conglomerates we have today that don't make any of the original product, they just license the name out to anyone, regardless of the quality of the finished good. From mattresses to magazines, we keep seeing this. Why wouldn't we see this in curation sites?
Sure. I suspect it might be hard to actually recruit people like this considering how the people who got sampled are hugely pissed off at AI firms rn, but money will go a long way here.
I don't think it impacts AGI timelines much. Worst case scenario, we just cut off training at 2024 data. But it's not like another ~5 years of internet data is the magic that will get us over the finish line to AGI. We should have enough data already. We will get to AGI with synthetic data (e.g. AlphaGeometry for code, or simulators e.g. RL inside unreal engine), world model from video, algorithm improvements, and 100x more compute.
I'd add that majority of LLM-generated stuff put onto the internet isn't pure junk. It's endowed with human-created context and curation. If someone submits some LLM-created code to Github, it's because it works. A small fraction is pure noise (e.g. state-actors spamming Twitter) but that should remain a minority.
I think it’s clear that LLMs cannot be the end state of this technology, and we will need systems that can reason and develop hypotheses and test them internally. These systems may benefit from more curated datasets (such as those collected before the bullshit wave began) along with real world interaction data from YouTube and robotics. Such systems could eventually be used to rank web pages for their bullshit level, which of course would present a risk of censorship but when used properly could lead to insightful data handling.
It’s just really clear that a giant text averaging machine can only go so far, and while we do see some higher level emergent properties, we’re going to have to move beyond the current state of the art in the next decade, and such future systems may be much less affected by the internet’s bullshit. Even without the bullshit singularity I think such measures would be a necessity and we would see them developed soon.
> It’s just really clear that a giant text averaging machine can only go so far
It's not really a text averaging machine, it's a pattern matching machine.
Right now the "depth" of the patterns it can match can only go so far, but in a few years with more advances in chips and memory the depth is going to increase and the patterns it can match will fan out accordingly.
it is a statistical model. If everyone is saying X and is wrong, and one guy says Y and is right, the llm will bit out X. Because that is the most probable thing in the dataset. It literally is a text averaging machine
Not just real world data from videos. AI models need feedback from many sources: humans, code execution, web search, simulations, games, robotics, math verification, or from actual experiments in the real world. All of these are environments that can take the output of a model and do some processing and return feedback. The model can learn and search for solutions, creating its own training data as a RL agent.
Since all deployed models produce some kind of effect and feedback from the world there is an opportunity there to collect data targeted on the current level of the model, the most useful kind of data. That's why I think AI will be ok even with the proliferation of bots online. It's not 100% pure synthetic data in a loop, it is a agent-environment loop.
tl;dr Models learn better from their own experiences, not ours.
But bullshit hallucinogenic output and fakery at colossal scales doesn't just pollute the pool of static information available on the web. It also warps the minds of the humans you're relying on to verify reality on the next training loop. Not only that, but we can also see the emergent breed of humans who believe that writing is similar to arithmetic - an unnecessary skill that can be handed off to a calculator. Or that making a movie shouldn't require knowing anything besides asking for what you want to see. How is someone like that - someone who wants to rely on bots - going to tell a bot what is or isn't true? How can they even have a pre-pollution baseline understanding of reality?
I just finished rereading Do Androids Dream for the first time in 20 years or so, and was astonished at how similar his andys really are to LLMs in the polluted / destroyed reality there. How confusing and corrupting they are to organic life. PKD describes the androif brains as neural networks with thousands of layered pathways and trillions of weighted parameters, and it's as if he was able to accurately conceive of what linguistic and "emotional" strengths and weaknesses those constructs would actually have, decades before LLMs existed. And there's this one amazing line where Deckard calls them "Life thieves". What else should we call what Sam Altman and others are building - but theft of human ingenuity, creativity, and basic reason for living, and the utter annihilation and suppression of people like this 16 year old kid who dare to hope they can contribute something more original in life than being a servant of a tech company building this shit, or a tiktok influencer who writes prompts?
What should that kid hope: That their work becomes noticeable enough to be immediately stolen and their name turned into a prompt?
Life thieves.
Just as a further aside, I had dinner tonight with a friend who's a fairly famous animator in the commercial realm, and I brought up this post. He's just sure the kid's screwed and the genie is out and creative is basically over. He's turning to building wooden clocks.
But his reaction, and the reactions I see here every time this comes up, remind me of something else. They remind me of how people react when someone is robbed. Everyone has some reason why it's bad but not that bad, it was inevitable, it'll be okay, etc. Or they go around wondering what they're going to do now. Or they paper it over with optimism. Surprisingly few people get robbed and are willing to realize they were robbed, and become wildly pissed off about it. Most people have some sort of flight reaction, as evidenced by e.g. the promotion of "prompt writing" or learning to use a paid API to do what was formerly your own creative job that now spits your own work back at you.
I say sue the shit out of all these content thieves.
The comparison is only valid if AI ends up being monopolized. A continuously evolving ecosystem on the other hand has a better chance of adapting to those pressures. I am sure there are search engines that don’t index the SEO crap, but I don’t remember the names, and they have other flaws.
Open source AI needs to get a lot of investment for this to be mitigated. Relying on market incentives to drive development without the possibility of forking is too dangerous given the high stakes.
This comment/sentiment/idea has already been thoroughly discussed, shared, retweeted, posted, and echoed a couple thousand times online already. Humans like you have been redundantly saying the same things in varying degrees of novelty, like this, for centuries.
That's what the topic of the post we are discussing is. He is basically reiterating the point. Which is a good point. And there have been zero good arguments made (as far as I'm aware) to why this fear would be unfounded.
I'm pointing out the irony. The Internet is already saturated with us continually metaphorically JPG compressing original ideas into textual redundancy, with many people having only been exposed to those copies of copies.
I guess some of it might depend on how good the AI-generated content gets, and also how good the AI gets at detecting AI-generated content.
If the AI was good at detecting it, it wouldn't matter if the AI-generated content sucked, yes?
Even a low probability of detection would help. Let's say our algorithm is 50% likely to detect AI junk. That means that half the junk data won't make it in to model. Even 20% would probably be worthwhile, especially if it also threw away human-generated junk (and let's be realistic here: there is, and always has been, no shortage of terrible and/or wrong human-generated content).
Let's say those crappy filler paragraphs that get stuck between pictures on meme clickbait pages... I suspect most of those people have already been replaced, but that prose was horrible long before the current LLM boom.
I suspect there is a lot of effort being expended right now on ways to ensure the training data (whether human or AI generated) isn't shite.
Humans are smarter than machines. We "sense" bullshit in patterns that machines cannot yet intuit. I can tell when something's AI generated. It has a smell or a flavor that's different from human work and it's a variable pattern because, like a southern accent, I can tell if it's Georgia or Alabama, Gemini or Claud, even if very capable machines cannot.
The same goes for image generation models, AI art already has a tendency to veer into the same clichés and those are only going to get reinforced if newer models are trained on newer scrapes which now include the million hyper-derivative AI images being uploaded to places like DeviantArt, Twitter and Pixiv every day. Those vendors who got in early have a moat in the form of untainted scrapes, but they'll eventually need new data to keep up with new subjects/styles/etc.
Art is very heavily tagged, accurately described and filtered. Many galleries don't accept AI art at all, which means to sneak in the AI needs to be pretty much perfect.
Also, there's an enthusiastic scene of LoRAs where the makers work with small enough datasets to do manual curation.
Photography has been very well developed technically for almost a century. Yet very little is considered art. The most interesting AI art, to me, are “image compilations” where the compiler appears to have taken just a bit of LSD.
>> which means to sneak in the AI needs to be pretty much perfect
That's status quo. Because right now artificial art is easy to distinguish form "natural art". Besides that I consider this - no offense! - as some kind of "arrogance". Galleries accept what they think what art is. Why can't I, the art consumer, decide for myself, what art is?
Not always, a few companies like Getty Images are using their internal stock photo libraries to train models rather than slurping up the entire internet. They are uniquely positioned to take that path though, having already been in the stock photo business for decades before AI took off, and for everyone else it's a sisyphean task to manually curate a complete representative dataset that they can be sure isn't contaminated with AI images.
Even if the data is 100% synthetic, you can still hill climb to new mountains.
If you don't believe me, look at evolution.
It doesn't matter if we no longer have 100% human art as input. This is the worst these systems will ever look and feel, and they're only going to improve.
Exactly, the fitness function is "make a picture which looks like these other pictures", which is used as a proxy for "make a picture that a human would find appealing by making it like all these pictures made by humans". The former definiton will always hold true, but the latter actually desired definition breaks down once your training set is contaminated with imitations of imitations of imitations.
Number of works that are released to the internet! At the end of the day, there is a human who has an idea in mind and is using an LLM to realize it. They are tweaking their prompt until they achieve their vision, then publishing only those successes. The resulting art can then be used with the prompt as training data.
It would be nice if every human at the wheel of an LLM or image generator were putting that much care into their craft, but that's just not what's happening. You just need to look at places like DeviantArt which are now overrun with users who joined 6 months ago and already have submissions in the thousands, it's simply not possible that they are putting any care into what they're spraying out. Much of the time they're posting numerous functionality identical images, probably generated from the same prompt with a different seed.
Likewise SEO incentivizes mass production of low quality LLM text, because they "quality" they are optimizing for is impressions, not actual quality.
There are biological analogies you could have chosen other than adaptive hill-climbing. For instance, there's adaptive radiation, where a small population of organisms is introduced into a new environment and rapidly diversifies into new species, filling niches that weren't filled before. After the environment is mostly saturated, the populations stabilize and adaptation slows down, with gradual and relatively slow change and diversification.
Plugging LLM into this analogy would lead to a story where we're at the initial "Permian explosion" phase of evolution. New LLMs are rapidly "hill climbing" as they feed on virgin data scraped from the Internet. But as those LLMs diversify and adapt, they peter out once all niches have been filled and the fuel consumed.
Another analogy is the Petri dish, but that's an even more pessimistic analogy than adaptive radiation.
> After the environment is mostly saturated, the populations stabilize and adaptation slows down, with gradual and relatively slow change and diversification.
We have the pressures of growth and novelty. My analogy is perfect.
Hill-climbing requires some measure to optimize. current technology uses 'how well can you predict real text' as this measure. If you change the text you try to predict, it changes the measure you are optimizing. It is far from obvious that this will still improve the actual performance.
> It's getting really tiresome to see tech bros talk about biology as if they know what they're talking about.
My undergrad was in biology, "bro". I was cloning luciferace into plants using agrobacterium-mediated transfection over a decade before this week's news of transgenic petunias.
I was planning to do a PhD in computational metabolomics, but life took a different turn: a couple of Google engineers saw the laser projector I built and programmed to play video games on the side of skyscrapers [1], and they lured me away to work on what became a decacorn. My quality of life and net worth certainly thank me for the pivot.
Take a look at my post history [2]. My analogies are informed and salient.
In any case, I'm working directly in this field now and the results we're achieving don't need affirmation. I tried to communicate this to outsiders in an easy to digest analogy. I'll let our work speak for itself.
You need to study optimization. I made a perfectly salient analogy. I'm incredibly busy working, and I'm not going to write an essay to bridge the gap for you. All of the pieces are there in front of you. Humans and the free market are an exogenous fitness landscape that is more than suitable to push these systems to their pinnacle.
I'm completely fallible. I just don't entertain people calling me a "tech bro" that lazily follow up with "but you should know better". That's not earnest, good faith engagement.
With a mixture of experts approach it’s not incest, we intentionally split the training data to add diversity. Incest is bad because the errors compound on themselves, but this isn’t a problem when the AI parents don’t share the same DNA (problem space.)
I see it this way, maybe there’s no new ‘knowledge’ but the AI can apply our collective knowledge better than we can. Within that set of knowledge there are surely discoveries never yet realized based on by the fusion of ideas.
Pre-AI we rely on individuals Einstein, Bohr, and Oppenheimer and their associations and studies of each others’ work. With AI, we can fuse the corpus of scientific discoveries into a single entity that we each can communicate with. Maybe the AI lacks the spark of creativity needed to make discovery, but put today’s Einstein in front of it and what would he ask? How much boost would it give him?
Einstein said - “I have no special talent. I am only passionately curious.”
Humans train on self-generated data all the time, but compared to today's LLMs, humans have superior reasoning ability, which enables them to separate good from bad data. Humans can confirm many things for themselves if need be and our social hierarchies crystallize authority figures which perform curation for others. On top of that humans have a lifetime of experience with the real world, while LLMs rely only on reports of what the real world is like.
All in all LLMs being more gullible than your average 6 year old is really not helping. Humans will have to (continue to) curate the data we train LLMs on.
This view of people seems… idealistic, given the number of people who uncritically look to Facebook or Reddit for information (and then happily parrot whatever memes they find) and the number of LLMs that do the same.
I think people are substantially worse at figuring out what is “good” data, and I think a huge number of people are/are starting to nominate LLMs as those “authority figures”. And, given that, I think there is a low ceiling for the creators of these tools to clear to make them appealing to users.
> a huge number of people are starting to nominate LLMs as those "authority figures"
When LLMs started to go mainstream I said I was not afraid of AI, but was afraid of (unwise) people with AI. This phenomenon is what I foresaw.
The worst thinkers are the ones most hungry to outsource their decision making.
The fact that LLM output requires good judgement by the recipient to filter the wheat from the chaff, and those with poor judgement skills are most likely to lean on decision tools makes for a potentially volatile situation.
I think I'll restrain from buying into this until much more testing is conducted. There is information out there suggesting training on synthetic data is at least on par but cheaper.
Training on self generated data is not necessarily such a problem, see how successful alphazero/muzero is when it is only trained on self play.
The key is that you need some kind of external indicator that tells you which generated examples are good and which are bad. In the case of alphazero you get that by simulating games and seeing who wins, in the case of LLMs you will be only taking the generations that are 'successful', e.g. which HN posts get upvoted.
It will force more robust filtering and analysis
that will allow to determine "quality value" of
any information piece and grade its novelty/uniqueness
for classification and subsequent training,
which will use only high-quality(book-grade)
and unique/novel content(excluding copypasted SEO spam).
Of course it's not going to improve on itself. That's what you all are here for - happily interacting with LLM's. Doing the finetuning and improving it with yet another dose of humanity, albeit with more direct interaction this time.
And everybody's asking what AI is going to be and can do for them, all the while freely working for it. Don't ask what AI can do for you, just do for AI what is asked of you.
The Dead Internet Theory was only slightly ahead of its time.
Used to be real people pretended to be girls on the internet. These days I can’t even get an honest real fake person pretending to be an attractive female on LinkedIn.
I think that people's believe in LLM is intelligence is closely tied to mistaken association between intelligence and the ability to speak.
Parrots can speak, but they cannot reason. Moreover, evolutionary, birds learned to mimic the speech to fool other species. To fool in such way that other species would think that parrots are of the same specie.
We, the humans, are smart enough to recognise that even though parrots can talk, they are not as smart as we are. Unfortunately for us, LLMs are much more advanced things than parrots, and they are capable to fool broad masses pretending that LLMs are of the human specie with all inherent features including the reasoning intelligence (that we cannot easily test externally).
This is unfortunate, because such false believes slow down actual scientific progress towards the natural intelligence researches, and towards creating of the true reasoning artificial intelligence.
Btw, I wouldn't be surprised that if one day the AGI created, it will not be able to speak at all, nor recognise the images.
I believe you are vastly underestimating the capabilities of parrots or similarly intelligent birds.
At least when my pet parrot manages to fool me into whatever it is he wants to get out of me that time, he does it for his own primal benefit and not because it was programmed to adhere to some strict set of ethic and moral boundaries set by legal requirements and someone else's idea of how people should think and behave.
I don't think that following the primal benefit would be a good estimation for intelligence, despite the fact that such behavior is common for most species.
What makes human specie special is the ability of some individuals to create new things that didn't exist before. That's a loose criteria of course, because most individuals just follow educated social constructs for their entire life.
Oh yes I agree on that. I should have specified that I was arguing their ability to reason - which from my own experience these birds do most definitely possess to a surprising extend. They are smart, and they have all day to figure out which buttons to press to get you to do something specific they find funny or other such things.
The smart behaviour of parrots and their speech synthesis are two different phenomenon.
Parrots are definitely smart, and parrots can definitely speak, but they can't learn to speak beyond what they are trained to do. It's not like a parrot has ever spontaneously put together a coherent sentence independently.
Absolutely, I also replied to the OP's response below that I failed to specify that I was arguing parrots' ability to reason which I do believe exists to a great extend.
Regardless, speech is just another means of communication. In the end it doesn't really matter if I use perfectly articulated Swahili or just scream "aaar" at you for a few seconds as long as I get the cheeseburger with extra cheese I want from you. These parrots, much like children, just push your buttons and are quite good at finding (or negotiating) the right pattern of things to do and sounds to make to get to a certain outcome.
Relatively advanced problem solving abilities are well documented in some species of birds, such as crows.
>LLMs are much more advanced things than parrots
Hardly. Parrots are able to fly, find food, find mates, interact with other parrots and lots of other things than no LLM is even close to being able to do.
I was hoping that it would be clear from the context, that LLMs are more advanced than birds in just the ability to synthesise the speech close to human speech in its complexity.
Sorry, I didn't mean to offend the parrots. I like parrots too! :)
My belief they have some sort of intelligence is more based on GPT-4 being able to figure out stuff I can't figure out myself. Things like "why doesn't this bit of code work" or "how can I write code to do blah blah blah..." It's not quite like human intelligence, better in some ways, worse in others. The better bit is it has far more information than the average human, the worse is it's reasoning is not as good as a smart human though maybe ahead of some dumb ones.
Hinton thinks LLMs can "understand" because without understanding it's impossible to predict the next work as effectively as GPT-4. The only way to do that is to understand the meaning in the text. He also says he's given GPT-4 a novel reasoning problem he invented and it successfully answered, which wouldn't be possible without a level of understanding/intelligence (although I'm not sure how he ensured the problem wasn't in its training data).
Well, my point was that the speech by itself is not a good criteria to estimate the intellect. This is not only applicable to machines but to humans as well. I assume that the primary evolutionary determined purpose of a speech function was convincing rather than the source of reasoning. Even though we use natural languages to broadcast the information, the languages usually overwhelmed with linguistics and psychological tricks to make an illusion of usefulness and novelty of this information. Even if the information was truly novel, it's usually hard or even impossible to find the roots. This is one of the reasons of why most people prefer to learn rather to invent. Broadcasting of information of inventions made by someone else is much simpler and usually more beneficial than researching something from scratch. And our natural language specifically designed for such forms of broadcastings.
In this sense if the ML developers would be focused on the inventions automatisation, I would expect that they would choose something more formalised than the natural language.
Anyway, whatever way and methods they chose I think the better external estimation criteria of intelligence should be the ability to make completely new things that clearly didn't exist before. Not just reasoning about existing one. After all, it's not a new thing that computers are able to deduce. Any programming language can do that better than any chat bot.
As AI advances, there will be AI and algorithms that will check LLMs and their output, sort of LLMs' verifier. It is foolish to think that LLMs won't improve to the point where there are almost no or very low hallucinations and false information. We are just at the beginning of LLMs and generative AI.
When people don't know how to do something but think it is easy, then it can often take 50 years or more for that to happen. Has happened before in this field.
The model predicts a 99% chance that the next word is "alive" and a 1% chance that it's "dead".
The llm calculates probabilities. How does it actually choose the word? It throws a weighted die. It literally chooses one at random (albeit on a custom probability distribution).
So tell me, what how can you eliminate hallucinations from something that is literally designed to pick stuff at random?
Hallucinations will never be removed from these types of llms. Hallucinations are fundamental to how they work. In the sense that even the "good" outputs are hallucinations picked out at random from a probability distribution.
Any company that says they can control hallucinations, in any way, is flat out lying.
check what, exactly? The model already did the best it could. It can't check its own work. And if I have to find out the answer myself from some other source in order to confirm the answer then that makes the model useless.
I just tried GPT4 with "is the queen alive?" and it came back with 'hold on, checking Bing' followed by "Queen Elizabeth II passed away..." That kind of thing.
I'm almost certain that I've seen this exact phenomenon described in a sci-fi novel written before LLMs existed. I thought it was Neal Stephenson's Anathem, but a quick skim through the text didn't turn anything up. Anybody else know what I might be thinking of?
Yes it's a major plot point in Anathem. The Internet was so badly polluted with machine generated garbage that a special caste of techno priests, the Ita, evolved with the mission of doing machine augmented web searches to find valid knowledge among the vast noise.
LLMs are being trained on a smaller and smaller percentage of human prose. Right now it seems like code is the best source for the bulk of an LLM's diet, but it's also looking likely that synthetic math text will be even better. The structured reasoning of code and math seems to be what actually makes these big LLMs "smart." Once you've trained a smart LLM, it seems to take a relatively small amount of hand-curated human prose to fine tune it into talking like a human. Unfortunately this article feels like the wishful thinking of someone who is afraid of the changes LLMs are bringing and hasn't done much research.
It seems to me like archive.org and the major book publishers are sitting on a gold mine(at least up to 2022), but I haven't seen anyone saying the same, so maybe I just don't know enough about LLM.
Doesn't that just mean he who can get the best human feedback will get the best ai? If the quality of AI depends on not feeding on 'fake' input, the key becomes getting verified real input.
My first thought was that this would create more money in genuine creativity, which would be great. But instead it feels more likely there will be much more telemetry and tracking to determine whether an input is made by you (and thus by a human) or not.
"Only trust stuff that can be traced to a specific human" is a great way weed out non-human stuff. But it requires a pretty effective surveillance system...
I'm curious about how the economics of human feedback will work out in the long term. I currently work in operations at a data annotation company, and my experience has given me a very pessimistic view of the industry's current state.
The MO for these vendors and the AI companies that buy their data seems to be a race to the bottom in price with little concern for quality. The current industry norm is to outsource the work to developing countries, where the cost of living is more in line with what the annotation agencies are willing to pay. While this isn't necessarily problematic for quality in and of itself, it does seem to make it harder to find candidates with the English skills required to generate high-quality RLHF and SFT training data. Furthermore, the pay offered for coding annotators is not competitive with local pay for software engineers, making it challenging to recruit skilled programmers. A lot of coding annotation is done by beginners and students.
There is certainly a lot of hype surrounding LLMs and their potential to disrupt various industries — even the US DOD has been scoping out the potential use of LLMs to assist military commanders in strategic decisions. However, if we want these LLMs to consistently perform at an expert level, they need expert-level training data. I worry that producing this data at the quality and scale required may be prohibitively expensive, and could cause a major bottleneck in model improvements long-term.
The Googles, Microsoft, Apples, and Amazon's of the world with access to 'real' humans are the ones with a leg up here. They already have the telemetry, now it's about effectively training models with it.
That Michigan PhD student dump was by an unauthorized third party and was shut down. But was the pricing reasonable? Are admins of old school message boards sitting on now valuable DB dumps of activity on the board in the form of millions of messages?
The fact that internet is going to be filled with AI content is likely to happen. But for the rest I think it’s the classical bias science fiction has when predicting the future. More of the same. Or the linear transposition of the now (for a lack of better wording). Think CRT screens alongside flying cars.
One could argue that LLMs have some form of intelligence already even though I believe it falls more under “very good intuition”. Even if the internet is filled with trash, we still have (and will continue to do so I hope) a lot of content of high quality available (all the books, podcasts, videos, etc. produced up until now). If you take human intelligence, it doesn’t require the whole internet to start to get smart.
A lot of interactions with the real world and some good books/videos should be enough to get to AGI (and start the dreaded feedback loop). We just haven’t found a way to achieve it yet (AFAIK).
I always found the idea of infinitely self improving AI to be suspect. Let’s say we have a super smart AI with intelligence 1, and it uses all that to improve itself by 0.5. Then that new 1.5 uses itself to improve by 0.25. Then 0.125, etc etc. obviously it’s always increasing, but it’s not going to have the runaway effect people think.
There are many dimensions where improvements are happening - speed increase, size reduction, precision, context length, using external computation (function calling), using formal systems, hybrid setups, multi-modality etc. If you look at short history of what's happening - we're not seeing below 50% improvements over those relatively short periods of time. We had gpt1 just five and a half years ago. We now have open weight models orders of magnitude better. We know we're feeding models with tons of redundancy and low quality inputs, we know synthetic data can improve and lower training cost dramatically. We know we're not near anything optimal. We'll see orders of magnitude size reductions in coming years etc. Humans don't represent any kind of intelligence ceiling - it can be surpassed and if it can be surpassed and we know humans alone produce well above 50% improvements - it will get better and getting better.
Saying that models will get attracted to bullshit local maximum is similar fallacy to saying that wikipedia will be full of rubbish when it was created. Forces are set up in a way that creates improvements that accumulate, humans don't represent any ceiling and unlike humans models have near zero replication cost, especially time wise.
Sure, but it seems that with a fixed amount of hardware or operations there is some sort of efficient frontier across all the axes (speed, generalization, capacity, whatever), so there should logically be a point with diminishing returns and a maximum performance.
Like there is only so much you can do with a single punch card.
If it's smarter than us it's pretty irrelevant whether it takes 12W or 5KW or even 1TW to run. Sure it may stop improving once it's far surpassed Von Neumann-level (at some point nobody knows) due to some physics or unknown information constraints but I don't think that has any practical bearing on much.
If it improves at a faster rate than humanity, it pulls ahead even if the absolute speed is slow. That's what people are really more worried about, not instant omniscience.
The general assumption is that some form of Moore’s Law continues, meaning that even without major algorithmic improvements AIs will blow past human intelligence and continue improving at an exponential rate.
Yeah but there are arguments that Moore’s law won’t continue because at a certain point you can’t really get transistors closer without quantum effects messing with them
Yes, but the assumption is that Moore's law (or something like it) continues way past the point of machines surpassing human intelligence. And maybe the AIs find completely new ways to speed up computing after that.
Why would the rate of improvement follow your imagined formula?
If people are worried about a runaway effect, why would you think you can dismiss their concerns by constructing a very specific scaling function that will not result in a runaway effect?
The more general point is people are asymptomatic growth and assume a never ending exponential, when in reality it’s probably something with a carrying capacity
Yeah you could imagine that with a fixed amount of resources that implies a maximum “computational intelligence”. Right now we aren’t close to that limit. But if we get better algorithms, there’s going to be fewer gains as we get towards a finite ceiling.
LLMs are increasingly trained on synthetic data. A LLM can learn logic, and ultimately the rules governing physics and everything else, without human input.
The first versions being trained on "human knowledge" seems kind of like a proof-of-concept. Future iterations will be much, much smarter.
It should be clear to people that the general population sucks (eg see most comment sections).
We've spent endless time and money sealing ourselves off from the great unwashed only to expose ourselves to them in full force on the internet (plus Russian paid trolls). It doesn't make much sense.
Isn't this really good news for artists? When everything you see is generic shit, it will be so much easier for truly creative thoughts, expressed through high quality writing, based on genuine experiences in the real world, to stand out.
Also, this will put pressure on human creativity to produce weirder stuff that can't be produced by simply paraphrasing and remixing known ideas. Mass photography was bad news for the portrait painting industry, but forced painters to think outside known concepts and explore crazy abstract art.
I totally look forward to whatever humans come up with next, that can't be easily generated by a computer program.
Ugh, the guy who thinks "prompt editor" is the job title of the future, and that people in every field will take classes on prompt editing. Gag.
You know, there's no replacement for knowing how to do something yourself. And no, you cannot be as good creatively if you haven't learned a craft. (And no, prompt editing is not a craft any more than ordering food in a restaurant is cooking. Could you even describe in a useful way what was wrong with a soup you were served if you had never learned to cook?)
Not sure why I expected something much more insighful when opening the link.
The knowledge base that LLMs have today, is probably enough to derive very high quality content/knowledge. Data and information coming "from outside" won't cease to flow, but will otherwise increase.
On the other hand, I see value in somehow indexing/marking human generated content, or at least human supervised/approved content.
Consider LLM code generation. It would an ideal domain for simulated data.
For example, a version of popular software is released with new functions. A framework could simulate numerous plausible code snippets that exercise the new release, including edge cases. The code snippets would be amply commented, and proven to work in a test harness.
This method is the opposite of depending upon the internet for training data. And this method is being used to train AI robots, that have no available internet training data.
I can read my own writings without overfitting the neurons in my brain. The key I think is contextualization, something LLMs are great at already. The open question is how to utilize that contextualization ability during training.
The argument that LLMs can’t possibly scale because of data contamination falls apart the moment we discover a method to incorporate context-learning into the training loop.
You know I've read a lot here on HN about AI and our future. I'd like to hear more from the folks actually in the AI industry the researchers and developers implementing LLMs, diffusion models, etc.
Will high quality training data be needed for incremental improvements?
What is the upper bound of improvements, and what is it contingent on? Compute, training data, etc.
The web already contains vastly more information than you could ever even begin to read or look at in a lifetime. Most of it is total garbage. You have to use a trust filter to find things worth looking at.
You're arguing that there isn't a trust filter that can do that, but there will be, and it will probably be an AI.
Well, for now, anyway. I went to college in the 70’s. What strikes me is, once obviously wrong hallucinations are eliminated, LLM’s will do exactly what untalented liberal arts students did with their time, but with some useful stuff tacked on.
On the bright side, it might force those engaged in research to figure out just what exactly it is that makes their models work to begin with. Turns out knowledge is about more than just modeling statistics.
The argument that the problems of AI are going to be solved with MOAR AI, sounds suspiciously like the argument that the solution to America's gun problem will be solved with MOAR GUNS.
SEO is dead. It will all be pay to play from now on. Want exposure? Open your wallet. Paid marketing will be the final bottleneck as problem of content has been resolved forever.
Regulation. Law. Human verification. Slow old fashioned ways of verifying content. Basically in the context of news for example: outlets pinkyswear that their writers are made of flesh and use sources from other companies that have the same sort of certification and so on. All the way from eyes to consumer. And you'd not read news without that kind of certification in the future I guess.
Cryptography is great for amplifying control from controlling access to a single secret to the ability to read information, create 'valid' information, and even to combine new information.
What kind tends to limit cryptography from solving problems is that controlling a secret is actually quite difficult.
I mean if you think about it for 5 minutes sure. But look at what bluesky is doing for example around decentralized moderation. It’s not really been an important problem in the global sense to get cryptographic attestations by neural parties for asserting what they have witnessed or analyzed for evidence of falsehoods
I mean, if I control your DNS for a few moments, there are trusted providers that will issue a TLS cert to me. Things like this happen all the time for companies that hopefully have a security officer. Individuals have little hope here.
When an article crosses some threshhold of replies to upvotes per unit of time it gets deranked, presumably to discourage political / flame-war topics.
Eg, this article still shows up on the second page. After a while, when the replies slow down, it will move back to the first page (if it's still getting upvotes).
This is a weird thread. I feel like there are many people who have no clue what they are talking about other than a few buzzwords, and make huge generalizations about AI or intelligence in general to _will their opinion into reality_. These statements are biased by either fear of loosing their job or fear of loosing on some investments on AI.
Obviously, the enshittification threat as the author points out is significant. It will be interesting problem to deal with in the future, but until then I feel like no one knows when or what that will look like.
At this rate, no one can predict what AI will bring us 6 months in the future even!
When you interact with an LLM (or Eliza) your brain is doing a lot of heavy lifting, without you even realizing it. I think we tend to infer more intelligence than is there, in the same way we see faces in clouds.
Watch human behavior long enough and you'll also realize that encoding of social behaviors in our society does a lot of heavy lifting for our brains so we don't have to think.
Also, for bonus points define intelligence in a manner that describes simple systems with rudimentary intelligence up to complex systems with higher levels of intelligence than a human. Can you do it, because as of so far no one else has.
There is quite a bit of different information, but when any groups cross reference each others work on differences you end up with lines like this
"There is no agreed-upon definition of the concept of intelligence neither in psychology nor in philosophy. Experts’ definitions differ widely."
Intelligence itself seems to be a bunch of different behaviors and abilities that when combined have emergent behaviors that are difficult to predict and reduce to simple systems.
The biggest threat here is not waves of bullshit, that’s nothing new.
It’s knowing who to trust.
One of the defining factors under the Stasi in East Germany, it was not that “ordinary people” could not recognise what they heard on the radio was bullshit, but that they could not know who to trust to say “that’s bullshit”. Every family has a Stasi informer, so whilst everyone knew the regieme was lying there was no critical mass event.
There are a dozen technical solutions to “bullshit”. We just need to ensure we have institutions that support us standing up to it.
I have absolutely no idea why people are so worried.
There is a human choice in a lot of matters, especially when it comes to how and why one perceives quality.
Quality in literature. Food. Arts. Fashion. Blogs. Programming. Acting. And all such things.
What I believe, based on a both cultural and philosophy stance at where the world is and where its going, is that, at best, so-called AI will push an even larger human segment towards finding the truth for oneself in a different and orginal aspect in terms of being human and how one perceives reality.
I think that the intelligent and thoughtful individual will find plenty of ways to nuture one's own abilities towards quality, which has always been the only true real value you can measure yourself against.
No one will stand a chance against the fluff and the obvious stupidity of "my ai will call your ai" and therfore, the only real direction to follow is the one that gives you an edge for yourself based on your fundamental skills you learn through years and years of experience and personality.
The reason great books are great is not that it's written. It's that it sips through the words and sentences, by their very human authors, that you can feel them inside of you.
There will be, is my belief, a larger calling in the world for removing oneself from the fast paced results which has absolutely no nerve or soul.
I am not afraid at all. I only feel sorry for the humans that has nothing to give in themselves to obtain a larger meaning with their time spent, in whatever profession or life they pursue.
Have you actually used an LLM? GPT-4 is (well was before turbo) an anti-bullshit machine.
You could ask it any question, ”why does my hair look dead if using a hairdryer” and it actually gave an on-point 100% relevant no-bullshit answer. Try googling that, it’s a million seo spam results, none which answer the question
I'm not disagreeing with your point, but this is a really bad example. Google does pretty well with that exact question, including this at #8 (for me at least):
Jesus Christ. If all I did was read hackernews comments I would think that these magical algorithms which have the potential for massive positive change don't exist and that the world is coming to and end.
I get it, engineering trains us to look for failure modes, but my god try to have a little amazement at the progress.
I think bullshit generation is a side effect of any new accelerating technology, not a specific type of new accelerating technology. When we learned to write things down it accelerated our society and our bullshit generation. When we learned to send information electronically it accelerated our society and our bullshit generation. The article already mentions the internet but neglects to mention that is already mostly bullshit prior to LLMs. We still find ways to extract the growing value and move on to the next innovation despite the growing bullshit though.
For one thing, we use AI to generate answers or outputs we want to rely on. If I use human sources I pick sources I trust. I frequently use multiple sources and check citations.
If I still need to do that with "AI", what purpose does it serve?
This super low quality, low effort, non-original content does not belong on hacker news imho much less the top of the front page. I’ll take my downvotes.
The assumption here is that future iterations of the technology require "BS"-riddled datasets. I question that assumption, both because the technology probably can improve solely with existing datasets and because we don't know that "synthetic" data isn't able to improve things.
The internet is already mostly filled with low quality bullshit though, and GPT-4/Gemini are much better writers than whoever is churning out SEO as we know it now.
It's a lazy argument to imply this 1. invalidates the technological achievement 2. prevents iterative improvement a la the singularity. For one, the Internet itself is not bullshit just because a lot of spammers/hustlers put bad content on it to try to make money. And secondly, you can curate datasets... nothings stopping researchers from training LLMs on shitty SEO now, and if they wanted to they could curate datasets going forward to try to prevent LLM-spam from entering the training sets of future models.
And finally, people already use reputation/identity/branding and proxies for it as quality filters on the internet. For example, this is an unfamiliar blogger to me and so I entered it with skepticism I wouldn't have with people like Gwern or Lynn Alden. Good writing from people like Gwern and Lynn Alden won't disappear just because LLM content exists on the internet - it just makes reputations and identity (eg to a real human) more important.
You are spot-on (I think) with the point that identify and reputation will become much more important. I hope this will end up with system that help cultivate and verify reputation. I fear however that the identity leg is much easier to tackle.
What does me telling LLMs to write articles for "best vacuum cleaners 2024" and putting it on the Internet have to do with the ability of LLMs to improve themselves? Humans write those kinds of articles for the Internet as it is, and yet humans are the ones designing and improving LLMs now.
The Internet is a pull model. I can go to Gwern's website directly and not care that most other websites have crap on them.
People choose to use push models for content through meta properties, tiktok, and aggregators like reddit and HN, but nothing is forcing them to. If they push enough bad content, people won't keep using them. Already happened with Facebook and Reddit predecessors, probably happening to Reddit now.
It doesn't matter how big the haystack is when you have the ability to go directly to the needle.
It's on the first page only because it talks about a very widespread fear, namely that LLMs might actually become dumber because they are trained on their own output.
In reality, LLMs are already trained on the output of other LLMs, as specific and well-directed training is much more effective than a disorderly ingestion of text produced by illiterate internet users. Let's not forget the demographic makeup of the typical average user. I think that training on reddit and 4chan messages can be much more harmful than training on text produced by the worst chatbot.
Hacker News editors are trying to build a narrative to instill fear and pessimism.
"What's the fun in writing on the internet anymore?" explains that everyting can be stolen and rewritten by AIs. And today they talk about LLMs and the whole internet becoming dumber. Tomorrow they'll talk about news falsification and LLMs used to replace writers.
https://news.ycombinator.com/item?id=39415878
>In reality, LLMs are already trained on the output of other LLMs
That's a fact, but are there any studies on training a LLM on it's own output and not the output of a different LLM ? For instance, chatgpt gets knowledge updates so I understand it must be retrained somehow. What happens if the retraining data contains large patches of its own output ? Has this scenario been explored ?
In the last internet revolution (web search), results started high quality because the inputs were high quality - bloggers and others just wanted to document and share knowledge. But over time, many interests (largely commercial) figured out how to game the system with SEO, and quality of search results has decreased as search's incentive structure led to lower quality data being indexed.
We're at the start of the LLM revolution now - models are trained on high quality inputs (which may be as rare as "low-background steel" in the future). But the models allow the mass production of lower quality outputs with errors and hallucinations; once those get fed back into new models, are we doomed to decreasing effectiveness of LLMs, just as we've seen with search? Will there be LLMO (LLM optimization) to try to get your commercial interests reflected in the next generation models?
I think we've got a few golden years of high quality LLMs before that negative feedback loop really starts to hurt like it did in search.