A bit thin on detail, but will this require confidential VMs with encrypted GPUs...

vasco · 2025-04-12T10:17:40 1744453060

At the pace models improve, the advantage of going the dark route shouldn't really hold for long, unless I'm missing something.

miohtama · 2025-04-12T10:43:53 1744454633

Access to proprietary training data: Search, YouTube, Google Books might give some moat.

maxloh · 2025-04-12T11:59:49 1744459189

We have Common Crawl, which is also scraped web data for training LLMs, provided for free by a non-profit.

UltraSane · 2025-04-12T13:15:14 1744463714

The Common Crawl is going to become increasingly contaminated with LLM output and training data that is more likely to have less LLM output will become more valuable.

kouteiheika · 2025-04-12T18:58:28 1744484308

I see this misconception all the time. Filtering out LLM slop is not much different than filtering out human slop. If anything, LLM generated output is of higher quality that a lot of human written text you'd randomly find on the internet. It's no coincidence that state-of-art LLMs increasingly use more and more synthetic data generated by LLMs themselves. So, no, just because training data was produced by a human doesn't make it inherently more valuable; the only thing that matters is the quality of the data, and the Internet is full of garbage which you need to filter out one way or another.

empiko · 2025-04-12T21:36:48 1744493808

But the signals used to filter out human garbage are not the same the signals that would be needed to filter LLM garbage. LLMs generate texts that look high-quality at a glance, but might be factually inaccurate. For example, an LLM can generate a codebase that is well-formatted, contains docstrings, comments, maybe even tests; but it will use a non-existent library or be logically incorrect.

UltraSane · 2025-04-13T02:32:04 1744511524

LLM output is uniquely harmful because LLMs trained on LLM output are subject to model collapse

https://www.nature.com/articles/s41586-024-07566-y

SXX · 2025-04-12T19:25:52 1744485952

Problem with filtering is that LLMs can generate few orders of magnitude more slop than humans.

hdjjhhvvhga · 2025-04-12T12:16:09 1744460169

Are the differences between Google Books and LibGen documented anywhere? I believe most models outside of Google are trained on the latter.

unsnap_biceps · 2025-04-12T08:36:33 1744446993

The number of folks that have the hardware at home to run it is going to be very low and the risk of companies for leaking it is gonna make it unlikely IMHO.

notpushkin · 2025-04-12T09:00:11 1744448411

I think home users would be the least of their concerns.

RadiozRadioz · 2025-04-12T09:53:00 1744451580

It only takes one company to leak it

franga2000 · 2025-04-12T11:49:52 1744458592

Or one company to get hacked and the hackers leak it

spacebanana7 · 2025-04-12T12:16:36 1744460196

Realistically the only people able to run models of this size are large enterprises.

Those enterprises won’t take the risk of being sued for using a model without proper permission.

nxobject · 2025-04-12T14:04:35 1744466675

I don't know – if there's still dumb money being thrown towards AI in non-tech and non-privacy-heavy industries, especially ones traditionally targeted by ransomware, there'll always be a chance of datasets getting leaked. I'm thinking retail and consumer product-oriented companies. (There's always non-Western governments without strong security orgs, too.)

blackoil · 2025-04-12T14:04:58 1744466698

Nations.

FilosofumRex · 2025-04-13T08:10:34 1744531834

or large government sponsored entities like Mossad. Air gapping won't protect against spying. Good luck trying to sue them

BiteCode_dev · 2025-04-12T10:04:16 1744452256

They can get "hacked" and wooops.

bjackman · 2025-04-12T12:15:13 1744460113

> I wonder how long before someone cracks SEV-SNP

https://bughunters.google.com/blog/5424842357473280/zen-and-...

NoahZuniga · 2025-04-12T14:06:36 1744466796

I'd expect watermarked model weights plus a lot of liability to distinctivise leaking the model.