Hacker News new | past | comments | ask | show | jobs | submit login
AI-generated content, other unfavorable practices get CNET on Wikipedia banlist (tomshardware.com)
105 points by goplayoutside on March 3, 2024 | hide | past | favorite | 58 comments



Wait, you're telling me their malware-infested downloader isn't a reliable source for news, either?


[dupe]

Some more discussion here: https://news.ycombinator.com/item?id=39556041


Since that thread didn't get significant attention, the current thread doesn't count as a dupe. There are some comments there though:

Wikipedia downgrades CNET's reliability rating after AI-generated articles - https://news.ycombinator.com/item?id=39556041 - Feb 2024 (10 comments)


2 days is not that long ago. 30+ upvotes and 10 comments is plenty of eyeballs for a recent topic. Conversely, it's not a dead topic there since it's only a few days, so the discussion could remain there? The point is there's already a recent post that many saw, but we have to see it again instead of just continuing the conversation there? <Shrug> Definitely a Related: of course.


30 upvotes and 10 comments is borderline, yes, but that thread only spent 8 minutes on HN's front page. If it had spent more time there I'd agree. And

Btw you can check this kind of thing (more or less) using https://hnrankings.info/39556041/ - it didn't catch the 8 minutes but the approximation is good enough.

I suppose I could've merged the current comments thither and re-upped that one, but I didn't think of it!


> The site that was hurt by this so-called SEO heist is called Exceljet, a site run by Excel expert David Bruns to help others better use Excel.

Welcome to the enshittening. There’s only so much a single dedicated operator can take before they pack it in. We need legislation to catch up fast and some big symbolic restitution cases decided in the courts.


Interesting that the article itself gives AI-generated vibes:

>It's important to remember that while Wikipedia is "The Free Encyclopedia that Anyone Can Edit,"


(disclaimer: I'm a casual user of ChatGPT and haven't gone too much into the how it works)00

Given its nature as an LMM and a complex next word predictor, the phrase "it is important to remember" could be a way that was inadvertently trained so that it keeps itself on track.

If it has "some points, it is important to remember {something}, some more things" it may be able to better generate text compared to "some points, weird tangent".

Since it doesn't have a hidden memory, everything that it "thinks" is out there in the text including its own cues for what it should do. It also can't go back an edit its previous text to remove the self hints or clarify earlier points without calling them out. That style of writing differs from natural human writing since we are able to keep on topic (or not) without needing to write messages to ourselves that others can read.

When we do, it's pointed out rather than trying to slip it in casually. "Note to self or reader" for things that are to be pointed out and break the flow of the text or "as an aside" for the tangents.


Given that, would a chain of thought + final answer be a better output then?


How would you recognise AI-generated content? Is it a particular writing style? As a dyslexic person I am having a hard-time recognising it


Stating the thesis then (repeatedly) re-stating it with a (very) slightly different wording was a red flag. Professional journalists don't write like that, but AI and schoolkids trying to pad out their essays do.

I'm guessing that a human did edit this AI output, just not very well.


I think you’re giving too much credit to “professional” journalists. This is also an excellent way to hit a certain word count if you are a few sentences short. Basically you bloat it up without needing to write anything new. Listicles and other pre LLM blogspam did this all the time.


We're all facing the need to build a mental list of which news sources are publishing AI-generated garbage, so that we can disregard them.

I'm still surprised to see that, as far as I can tell, no news outlets have made a public commitment to never use AI in their writing. Seems like it would be an easy way to promote the brand on commitment to quality.

I've griped about this before, but here we still are. We now know that MSN news, for instance, has no credibility due to their publishing AI-generated misinformation. https://news.ycombinator.com/item?id=39043135


ChatGPT loves to add "it's important to remember" to every second answer. Also "multifaceted" and a few other adjectives like that.


objection -- ChatGPT does not "love" anything.. a reason to repeat that phrase is that it is written often, for real communication reasons. Declaring that something written often by humans is now an indication of ChatGPT is wrong-direction IMHO and directly objectionable here, a place where reasoning humans think and communicate via writing.


ChatGPT was additionally finetuned after the initial training of the base model, so it's not completely clear if it truly represents the original distribution anymore. I use open weights models and they repeat such phrases less frequently in my experience.


That's completely untrue. Try using any base (non finetuned model) such as llama base, mistral base etc and you'll see that it does not write like that. That response format and writing style was added by OpenAI during the finetuning stage.


the post says "a reason to repeat that phrase"

not THE reason, not THE ONLY reason, so it is not COMPLETELY untrue. agitated?


I can visualize a little cartoon Einstein scribbling away on his chalkboard, but everything I'd actually project on the chalkboard would be nonsense. Relying on my cartoon output or thinking it has value would be madness.

Someone who's studied the subject quite a bit might visualize an actual Einstein chalkboard from memory. Their output could be verified and referenced, but looking for new Einstein-level math on the chalkboard would be madness.

Someone who Einstein himself would consider a peer might use this visualization method as a way to do actual work.

If we're assigning value to our chalkboards we'd be able to explain why we chose the numbers -1, 0, and 1. This would bias any math we'd do towards one of the chalkboards depending on our intent. At this point my chalkboard is useful as a filter.

Putting this all together, we'd see that the overall look of the cartoon responses would be a blend of 0 and 1 styles and would depend on our requests e.g. a reference request would look mostly like 0's art style. My own personal art style will be intentionally absent because it's only ever framing nonsense, by my own admission.


long before CNET was admitting to this it was obvious to me. The writing has the vocabulary and grammar of a accomplished academic but no ability to actually convey the important points concisely.


An easy tell is "level of effort given context"


Enjoy for now that you can still tell. It’s early times like the video games before Call of Duty etc.

The worst thing about AI is how it can easily betray you, manipulate you and have swarms execute long term sleeper plans at scale!


Sounds like all information we're given should be met with a healthy dose of skepticism. The ability to smell AI makes it a little easier right now to put it in the garbage pile.


We are moving toward a future where future robotics viruses will infect deployed humanoids and command them to go on a killing spree.

https://www.youtube.com/watch?v=WWAnJX889j0


Eh, ChatGPT uses phrases like "it's important to remember" because they were common in the training data. The rest of the sentence definitely isn't ChatGPT:

> It's important to remember that while Wikipedia is "The Free Encyclopedia that Anyone Can Edit," it's hardly The Wild West.

My reading is that the article is average human prose (not great, not unreadable), not LLM prose.


I wonder if ChatGPT was trained on too many high school essays. It’s not very good writing style, but it’s what most people start out writing.


>My reading is that the article is average human prose (not great, not unreadable), not LLM prose.

It's not average human prose either(for one thing expecting GPT to converge on some average doesn't really make sense in the first place).

Base models with no rlhf or fine-tuning don't talk like that at all. This is specifically an artifact of the post-training fine-tuning/RLHF process


I never said that I expected ChatGPT to converge on an average, I said that this particular phrase is not strong evidence by itself of GPT authorship.

"It's important to remember" is a phrase that plenty of people used before ChatGPT. I've included three examples below from a quick time-boxed search. Just because ChatGPT says a phrase doesn't mean that every time you see it it came from ChatGPT—someone wrote the training data that ChatGPT was trained on, and someone else wrote the data (or selected the responses) that it was fine tuned on. ChatGPT isn't inventing new phrases out of whole cloth, it has a stereotyped style that is pieced together out of many existing phrases.

A collection of such stereotyped phrases in a single piece would be stronger evidence of GPT authorship, but I see no evidence of that here.

https://old.reddit.com/r/gravityfalls/comments/bgtnez/while_...

https://twitter.com/AsteadWH/status/1050813462673264640

https://stackoverflow.blog/2019/12/19/what-senior-developers...


I think I misunderstood you then. Fair Enough.


> My reading is that the article is average human prose.

Which is, of course, exactly what ChatGPT is trained to produce. A lot of people's mental AI detector is actually a mediocrity detector.


No GPT is trained to spit out anything that falls into its training data distribution.

There's no average. Pre-training incentives being able to predict the smartest string of text in the corpus as readily as the dumbest. It doesn't converge on "average" and it doesn't really make sense that it would either.

Base models don't talk like GPT. This is strictly an artifact of post training fine-tuning/RLHF.


My theory is that this phrase was baked in during alignment training to ensure the model prints out any necessary warnings.

"It's important to remember to consult a mechanic" for example.

It's just some priming.


If I play GM-level chess moves from Rybka or AlphaZero randomly on average every 5 moves on average, could you tell?


Every 5 moves, yes.

But as most Chess Grandmasters say, you only really need to use it in one difficult spot to change the result of a game.


Nah. I play the top engine move like half the time and I'm not a particularly strong player.


Both can be true at the same time. A lot of the time the best move is obvious. However, if you play the top move 20% of the time, you will also play the top move in a lot of cases where it's not obvious. Given enough games, it's detectable.


See https://xkcd.com/978/ (Citogenesis) for what I think is the biggest problem with a source that publishes AI-generated articles.

Wikipedia is supposed to use primary sources, AI generated articles, by nature, can't be primary sources. In particular, AIs love to use Wikipedia in their training dataset: it is a free, high quality source of information, but it is not flawless either. So there is a good chance that if Wikipedia cites an AI generated article, it has Wikipedia as its source, starting the "citogenesis" process.


Slight nitpick, while Wikipedia allows primary sources in some cases, it generally prefers secondary sources.

https://en.wikipedia.org/wiki/Wikipedia:No_original_research....


That's more than a slight nitpick - it corrects a core misunderstanding to be found everywhere this discussion seems to be happening.

To wit: AI articles are to be found along a dramatic spectrum of quality. If such an article is high-quality, relevant to the subject matter, asserts the fact in question, and makes proper and veracious use of a primary source in support of such an assertion, why isn't it a reasonable source for an encyclopedia?

I can imagine a future with rich educational materials with these layers:

* raw experimental data -=>

* publication ("primary" source) -=>

* AI-generated review of many publications ("secondary" source) -=>

* Human-authored encyclopedia article (with one or two people following the fact pattern all the way back down to the data, and many more people helping to synthesize the higher layers into a rich, readable, considerate, diverse synopsis)


I thought Wikipedia loved secondary sources? Like articles about something, as opposed to a primary source which would be like a person who was present or something (which would mean it’s original research).

An article that says Abraham Lincoln wrote the emancipation proclamation would be preferred to Abraham Lincoln himself being asked and then the Wikipedia article citing “my interview with Abe”.

(Your general point is valid, I’m just confused about the terminology)


In the historical sense, a primary source comes from the time period under study. A secondary source is later, and interprets, or records, later impressions/recollections of the period in question. So for example, a newspaper article from the 1890s is a primary source about the late 19th century, even if the author had no direct knowledge of the topic about which they were writing.


[flagged]


> you'll have no idea if a brilliant essay or entire magazine is made by a bot.

That doesn't sound like a problem to me. If the ideas being expressed offer value to the readers, who/what created them isn't particularly relevant.

> you'll have no idea if the politician/professor/activist on the video is real.

Tying real-world identity to the expression of ideas on the Internet has already created the problems you describe toward the end of your comment. We already see self-censorship, external censorship, and people just regurgitating a "safe" subset of ideas that they may not even actually believe.

Many of the most honest and insightful comments I've seen here were from throw-away accounts, or "Anonymous Coward" on Slashdot, or otherwise weren't immediately tied to an identifiable person.

The most artificial, blandest, and lowest-quality online content I've seen has consistently been from politicians who are merely repeating their party's vetted talking points, or the mass media, or from other identifiable people.


In the days before "google it" was a synonym for "find it", we had different curated link sites, and even pyhsical magazines with hand-curated lists of links that people interested in a certain topic might find interesting. This still exists today in some forms, for example the "awesome lists" that you see for some programming topics, for example https://github.com/sindresorhus/awesome .

Just like there was a time when 90%-99% of all email traffic was viagra spam but e-mail as a concept survived, I imagine in the future most of the internet by volume will be AI-generated trash, and those in the know will still circulate lists of where the other 1% can be found.

An even brighter scenario is that someone, maybe a kid tinkering in their garage, figures out how to make a search engine that finds the good stuff, doesn't immediately die to AI bot farms' SEO efforts, and is financially viable.


It's already been happening. Google has been useless for a few years now due to blog spam. To be fair they shot themselves in the foot by prioritising content length.

Measuring search ranking by content length is like measuring aircraft building progress by weight.

So now people put 'reddit' or whatever in the search to try and get some authenticity but even those platforms are full of corrupt mods and astroturfing.

The UK are always ahead of the game when it comes to a dystopian future, chomping at the bit to put in place some kind of internet-licence to watch pr0n https://en.wikipedia.org/wiki/Proposed_UK_Internet_age_verif...


A good read is a good read - whoever writes it.

Whole issue of fake media and that, well, look at social media over the years, I doubt it will make much difference than what we have already. People over time, learn what to trust and who to trust.

The real issue though, as you touch upon, will be governments using this as another way to gain even more power and control of our daily lives.

Though my thoughts are, like most new tech that can copy/mimic - it will be the Disney's and movie studios with their lobbyist money that will do the most harm in any progress.

Though mindful of the potential damage, a rouge AI could do in the financial markets.

One upside - postal letters will start to see an increase in usage as people start to trust physical media more over digital. Is certainly a trend I foresee happening.


AI is unique in that the incentive is to write engaging, compelling, perhaps wonderful ostensibly truthful content that is total nonsense.

Enthusiasts in a subject are often wrong, but usually the goal is at least to convey something useful about the topic to the reader.

The spammer/content farmer doesn't care about that. Users trying to farm attention on social media don't care about that. A manipulator on a certain topic doesn't care about that. They just want SEO and eyeballs. But usually the intersection between high quality topical writers and people with these motivations is very low.

...That is not the case any more.


> state approved

Why? Is a higher "authority" the only solution you can imagine?

Surely you realize that public key cryptography exists and that the politician/professor/activist has people who know him who can confirm he creates the video and those people know others until you reach everyone in a few steps.

What we really need is a network of trust and to teach people how trust works.


We'll be lost in a Library of Babel full with every possible syntactically valid article, regardless of truth value.


We'll go back to the gatekeepers. Only certain media sources will be deemed trustworthy as sources, and the media industry will revert back to corporate power. Now, will news publishers be smart enough to understand this? Maybe, maybe not. It appears CNET wasn't. If they let AI play itself out and never published generated content, they could, in the long run, be significantly more profitable and prestigious as a media source.


Maybe a resurgence of local news outlets? "news written by people you can trust to be people" might be what we need to bring that subscription model back.


In some ways I see parallels to space debris in LEO that create growing risk for navigating that space and finding anything up there.


I'm not sure what changes if this brilliant essay is written by bot, human, or martian?


A connection to truth, or objective reality if you still believe in that.


It'll be interesting if AI-generated content is ever no longer distinguishable from human writing in both evidence and prose. I suppose that's the ultimate goal.

The final hurdle at that point will be the de-democratization of writing and the dilution of creativity and novel writing. There will probably always be a market for that, but for things like reporting events, it seems like AI could easily overtake the industry.


Pre-2023 printed books will be the new incunabula classic books


The more important question is whether it's no longer distinguishable because the AI-generated content has improved, or because human-generated content has regressed.


I'm not sure if you mean this for real, or if you're just doin' it for the joke, but if you _are_ being serious, perhaps you're just reading the wrong things? There are some truly excellent books, articles, blog posts, and more being written by actual humans out there today, probably more than ever before.


Separating signal from noise is going to become much harder.

Here's a spoiler-filled article from a major site (IGN) whose last paragraph (and likely more) has clearly been authored by ChatGPT in order to meet tight publishing deadlines and coincide with a movie's release: https://www.ign.com/articles/dune-part-2-post-credits-scene-...

The indicators are all there: "In short, (vague sentence)... Is (hypothetical from article) or (other hypothetical from article)? Will (vacuous statement) or (inverse vacuous statement)? We'll have to wait for (eventuality) to find out."

There is no attribution to AI for the generated content, though, and this lack of attribution is going to become the norm once LLMs become just another authoring tool like spell-check. Coupled with the race for clicks, the "excellent blog posts" are going to be drowned out.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: