Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

or like a curve of model complexity versus results or whatever showing it asymptotically approaches whatever.

actually there was a great paper from microsoft research from like 2001 on spam filtering where they demonstrated that model complexity necessary for spam filtering went down as the size of the data set went up. That paper, which i can't seem to find now, had a big impact on me as a researcher because it so clearly demonstrated that small data is usually bad data and sophisticated models are sometimes solving problems will small data sets instead of problems with data.

of course this paper came out the year friedman published his gradient boosting paper, i think random forest also was only recently published then as well (i think there is a paper from 1996 about RF and briemans two cultures paper came out this year where he discusses RF i believe), and this is a decade before gpu based neural networks. So times are different now. But actually i think the big difference is these days i probably ask chatgpt to write the boiler plate code for a gradient boosted model that takes data out of a relational database instead of writing it myself.



I believe this is the paper which you are referring to: https://aclanthology.org/P01-1005.pdf

("Scaling to Very Very Large Corpora for Natural Language Disambiguation" by Michele Banko and Eric Brill, Microsoft Research, 2001)


omg i have been searching forever for this. THANK YOU.


> model complexity necessary for spam filtering went down as the size of the data set went up

My naive conclusion in that this means there are still massive gains to be had, since, for example, something like ChatGPT is just text, and the phrase "a picture is worth a thousand words" seems incredibly accurate, from my perspective. There's an incredible amount of non-text data out there still. Especially technical data.

Is there any merit to this belief?


GPT-4 is actually multi-modal, not text. ChatGPT does not yet expose image upsubmission to it. But it's part of how the model was trained already.


Yes. One of the frontiers of current research seems to be multi-modal models.


> "a picture is worth a thousand words"

and it might be opposite for the GPT models actually. it's just easier for humans to grasp the bunch of knowledge with one eyes sight, but usually most of useful information might be represented with just of bunch of words and machines are to scan through the millions of words in an instant.


Excellent points in your post. You wrote:

    There's an incredible amount of non-text data out there still. Especially technical data.
"Especially technical data." What does this part mean? Initially, I thought you meant things like images and video, but now I am confused.


Schematics (of any sort), block diagrams, general spatial awareness (including anything related to puzzle pieces/packing, like circuit layout), most physics problems involving force diagrams, anything mechanical, etc. The text representation of any of these is ludicrously more complex than simple images.

If you sit someone down, that works in one of these fields, you'll quickly see the limitations. It'll try to represent the concepts as text, with ascii art or some "attempt" at an ascii file format that can be used to draw, and its "reasoning" about these things is much more limited.

I think most people interacting with GPT are in a text-only (and especially programming) bubble.


They might mean numerical data like scientific simulation data, sensor data, polling data, statistics, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: