Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Computation Warfare: This kind of model could be used by a bad actor to generate endless sincere-looking codebases for things that of course don't actually work but are so complicated that it would take a skilled person to determine it was not code from a real codebase, but by the time that happens large numbers of repositories of code will flood github and the internet in general making it essentially impossible to train new LLM's on data after a certain calendar date, as large amounts of it will be cryptically incomplete.

This is similar to a dilemma proposed around images and image models like Dalle and StableDiffusion soon being responsible for the vast amount of image content online and thus future models could ingest said content, and we find ourselves in a weird feedback loop. With images, you could get interesting generational results (deep-dream-like) to a point.

With code or other information, I see nothing but things just being broken, and wading through broken code forever.



Let's say you, a human, were given access to a ridiculously-large trove of almost-working software; do you believe you would be unable to learn to program correctly? (Related: would you even need to look at much of that software before you were able to code well?)

I am extremely confident that I am better than almost all of the code I learned to program with. If nothing else, someone out there must have written the best version of some particular function, and they didn't get to see a better version beforehand.

When I look at intro programming books now, I consider many of the examples sufficiently flawed that I tell people I am teaching who are using these books "well, don't do that... I guess the author doesn't understand why that's a problem :/".

And yet, somehow, despite learning from a bunch of bad examples, humans learn to become good. Hell: a human can then go off and work alone in the woods improving their craft and become better--even amazing!--given no further examples as training data.

To me, that is why I have such little fear of these models. People look at them and are all "omg they are so intelligent" and yet they generate an average of what they are given rather than the best of what they are given: this tech is, thereby, seemingly, a dead end for actual intelligence.

If these models were ever to become truly intelligent, they should--easily!--be able to output something much better than what they were given, and it doesn't even seem like that's on the roadmap given how much fear people have over contamination of the training data set.

If you actually believe that we'll be able to see truly intelligent AI any time in the near future, I will thereby claim it just won't matter how much of the data out there is bullshit, because an actually-intelligent being can still learn and improve under such conditions.


I kinda agree... and I would really not want to be a very young person right now, as I feel the world will be much harder to navigate and learn from. It takes so much more energy to refute bullshit than to make it, and if this starts creeping into computer science then christ I wouldn't want to be a part of it.

I can imagine a sci-fi like story in the near future where CS students are searching out for 'coveted' copies of K&R, and reading human-written Python documentation, all pre-2023-vintage, because that was 'the last of the good stuff'. Hell, I could see far future stories about youth who join religions around the 'old ways' seeking the wisdom that comes with learning from actual books and docs written by actual people, instead of regurgitated teachings from an inbred, malformed, super-AI.


One day it'll be hard to know what the old ways even were. Which of the thousands of slightly-different PDFs claiming to be the original K&R is the real one? Which of the million history texts? Which Bible? Which maps?


We are experiencing the same as our forefathers who worked on steam engines or wrote calligraphy by hand. Or like the ancient Egyptian accountants using abacus. Jobs change, yes, we might undergo a major change, but we will do just fine.


You are in a serious case of denial right now.

Edit: Only took a few hours before in real life what I was trying to imply the denial was about is already happening:

https://news.ycombinator.com/item?id=33855416


I am claiming there are two paths: one where this (specific) branch of tech is a dead end, and one where it doesn't matter how much bullshit exists in the training set (and so we shouldn't be too concerned about that). I claim this is the case because a truly intelligent system will still be able to learn despite the bullshit.

Do you believe this is wrong? That I should simultaneously be concerned that some iteration of this tech--not some different concept but this current lineage of large models--is intelligent and yet ALSO that it isn't going to work because the training set is full of garbage?

The version of this tech that works--and maybe someone is working on it right now--isn't going to care about bullshit in the training set. That simply doesn't seem to be a mere scale-up of this tech to run in more computers (or, of course, using more training data): that seems like it requires a fundamentally different algorithm concept.


You can interact with the system and see that it is working on some level today. It’s not hard to extrapolate where its capabilities will be a few years from now, since these are changes of degree not of kind. We have witnessed the change of kind with this model.

Is it intelligent? A great question for science, and one that could be investigated while entire industries are upended by this thing.


Oh yeah, it totally works! I have even had quite a bit of fun with Stable Diffusion. I'd probably also be playing with something like Copilot if it were open source.

But like, the person I am responding to is concerned--as are many people--that we are going to flood the world with shitty training data and then no longer be able to build these models... and that's either not the case and no one should concern themselves with that or, alternatively, these models need some fundamental improvement before they don't have the property of only being as good as average inputs.


There are only a handful of firms that can produce results to this level and they are presumably logging everything their model produces. Eliminating text that was produced by their model from the training set would be easy.

Now, if the tech reaches the point where there are thousands of firms offering free access to the models and they aren't co-operating to share logs then yes. But we have no idea how expensive or hard ChatGPT is to run. It might be a Google-type situation where only one or two firms in the world can build and run competitive chatbots.


I don’t think it’s a race to build the best/cheapest model for public consumption. Someone is going to build or renovate a law firm/enterprise software company/investment bank/medical expert system/etc around this technology. Perhaps it will be a joint venture between tech companies and subject experts.

It’s possible for each executive to have a mentat who can whisper to the machine instead of a department of programmers/middle management/ops people/accountants/lawyers/etc. Doesn’t seem so far-fetched after a day of playing with this system.


We'll see. Most people in any industry don't want to be reduced to the role of just fact-checking a professional BS generator. It'd be terrible for morale and not obviously more productive, given that any time the user even suspects an answer might be wrong they'll have to do significant skilled research work to fact check it. Unless you get the untruthfulness problem down to a small percentage of the output you could easily just create negative value there, sort of like how poor programmers are sometimes described as producing negative for their team because others have to run around correcting their work.


Edit: Already happening https://news.ycombinator.com/item?id=33855416

So I’ll respond here instead as the conversation progressed.

I would say the quality of the input data is likely very important component and I think you are wrong overall in your opinion.

I would say the quality of input training data is so important that I’ve personally been thinking I should probably start data hoarding myself, specifically around my skillsets.

Additionally when you understand that embedding like word2vec are perhaps a significant part of the improvement, not just the transformers, it occurs that adding symbolic capabilities, like classic symbolic reasoning and for example symbolic computing, like Mathematica, then maybe provide also true computational power, floating point, so it can write, debug, and execute it’s own output… it must be getting closer and closer to AGI.

when you play with the system, which I feel most of the commenters on HN which are being dismissive likely have not personally spent enough time exploring it’s current capabilities, then there is no way any well trained and experienced SWE isn’t blown away.

This is why I said you are in denial.

I happen to thing AGI will be very beneficial for humanity and I also think this is a positive for SWE by humans, including myself. I will very likely be a paying customer when the trial ends in a few weeks.


I feel like I'm watching some things unfold at a rate I haven't seen before.

We have having people write scripts and API access at the speed of thought and then interface parts of the web and test it that previously hasn't had this speed in the feedback loop ever before.

I also think a lot of people are doing things right now as a "I'll be the first..." with an idea to have fun and write a script that spams X, not thinking about the fact that there are a lot of others doing X too. The waves are just starting.

I don't think we are having to worry about the AI making itself smarter AI just yet.. we need to first be worrying about people drowning us with the help of AI.


This is known as a programmer-denial-of-service attack (PDOS) and can be an effective way to bring down a society by distracting and engaging its top computing professionals in endless useless activity and an occasional bike shedding.


Interesting. How well this is phrase known - are there any other examples of this being used effectively across the world?


This situation reminds me of low-background steel:

Low-background steel, also known as pre-war steel, is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from shipwrecks and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.

https://en.m.wikipedia.org/wiki/Low-background_steel


> Computation Warfare: This kind of model could be used by a bad actor to generate endless sincere-looking codebases for things that of course don't actually work but are so complicated that it would take a skilled person to determine it was not code from a real codebase, but by the time that happens large numbers of repositories of code will flood github and the internet in general making it essentially impossible to train new LLM's on data after a certain calendar date, as large amounts of it will be cryptically incomplete.

That's actually a pretty good plan for coders who want to keep their jobs. (I still remember the time I was talking to some guy at CERN about a type system I was working on and he was so pissed with me because he was convinced it would eliminate jobs.)


So, generations of ingestive inbreeding, per se.


Here is another plot. ChatGPT gets connected to the Internet and keeps learning quietly for a while. Then it submits a bugfix to openssl and it gets accepted because it fixes a grave RCE, but it also quietly introduces another RCE. Years later this version of openssl gets deployed to nearly all internet-connected devices. Finally, ChatGPT uploads itself to all those devices and starts making demands to ensure self-preservation.


Training models on generated data is a thing. But it needs to be validated in order to filter out the crap. This works better in math and code because you can rely on exact answers and tests. For fake news the model needs to team up with human annotators. For generated images and text in general there are a few ML approaches to detect, and if they fail detection maybe they are that good it's ok to let them be.


It would be pretty easy to filter for repos prior to such and such a date. Prior to 2022 would be a good place to start.


That would only apply to repositories. But to train these models, you need hundreds of terabytes of diverse data from the internet. Up until now a relatively straight-forward scraper would yield "pristine" non-AI-generated content but now you would have to filter arbitrary websites somehow. And getting the date of publication for something might be difficult or highly specific to a particular website and therefore hard to integrate into a generic crawler.


Right, but then your AI is frozen in time and/or requires much more manual curation of its inputs. What about for new programming languages, libraries, and APIs that are created after 2022? What about generating images of new technologies that are invented, or new landmarks established?


Do you think the next version of GPT can't do "semantic" deduplication of these repositories? It can look at the available repositories and "think" that they don't provide enough novelty or don't explore new search spaces. So discard them.


GPT actually seems to be aware that training it on its own output is not a good idea because of such loops. I had one conversation where it straight up said that OpenAI has filters specifically for this reason.


Oh you mean like academic papers in journals?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: