Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A bit thin on detail, but will this require confidential VMs with encrypted GPUs? (And I wonder how long before someone cracks SEV-SNP and TDX and pirate copies escape into the wild.)


At the pace models improve, the advantage of going the dark route shouldn't really hold for long, unless I'm missing something.


Access to proprietary training data: Search, YouTube, Google Books might give some moat.


We have Common Crawl, which is also scraped web data for training LLMs, provided for free by a non-profit.


The Common Crawl is going to become increasingly contaminated with LLM output and training data that is more likely to have less LLM output will become more valuable.


I see this misconception all the time. Filtering out LLM slop is not much different than filtering out human slop. If anything, LLM generated output is of higher quality that a lot of human written text you'd randomly find on the internet. It's no coincidence that state-of-art LLMs increasingly use more and more synthetic data generated by LLMs themselves. So, no, just because training data was produced by a human doesn't make it inherently more valuable; the only thing that matters is the quality of the data, and the Internet is full of garbage which you need to filter out one way or another.


But the signals used to filter out human garbage are not the same the signals that would be needed to filter LLM garbage. LLMs generate texts that look high-quality at a glance, but might be factually inaccurate. For example, an LLM can generate a codebase that is well-formatted, contains docstrings, comments, maybe even tests; but it will use a non-existent library or be logically incorrect.


LLM output is uniquely harmful because LLMs trained on LLM output are subject to model collapse

https://www.nature.com/articles/s41586-024-07566-y


Problem with filtering is that LLMs can generate few orders of magnitude more slop than humans.


Are the differences between Google Books and LibGen documented anywhere? I believe most models outside of Google are trained on the latter.


The number of folks that have the hardware at home to run it is going to be very low and the risk of companies for leaking it is gonna make it unlikely IMHO.


I think home users would be the least of their concerns.


It only takes one company to leak it


Or one company to get hacked and the hackers leak it


Realistically the only people able to run models of this size are large enterprises.

Those enterprises won’t take the risk of being sued for using a model without proper permission.


I don't know – if there's still dumb money being thrown towards AI in non-tech and non-privacy-heavy industries, especially ones traditionally targeted by ransomware, there'll always be a chance of datasets getting leaked. I'm thinking retail and consumer product-oriented companies. (There's always non-Western governments without strong security orgs, too.)


Nations.


or large government sponsored entities like Mossad. Air gapping won't protect against spying. Good luck trying to sue them


They can get "hacked" and wooops.


> I wonder how long before someone cracks SEV-SNP

https://bughunters.google.com/blog/5424842357473280/zen-and-...


I'd expect watermarked model weights plus a lot of liability to distinctivise leaking the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: