or like a curve of model complexity versus results or whatever showing it asympt...

greatwave1 · on April 17, 2023

I believe this is the paper which you are referring to: https://aclanthology.org/P01-1005.pdf

("Scaling to Very Very Large Corpora for Natural Language Disambiguation" by Michele Banko and Eric Brill, Microsoft Research, 2001)

mnky9800n · on April 18, 2023

omg i have been searching forever for this. THANK YOU.

nomel · on April 17, 2023

> model complexity necessary for spam filtering went down as the size of the data set went up

My naive conclusion in that this means there are still massive gains to be had, since, for example, something like ChatGPT is just text, and the phrase "a picture is worth a thousand words" seems incredibly accurate, from my perspective. There's an incredible amount of non-text data out there still. Especially technical data.

Is there any merit to this belief?

cjbprime · on April 17, 2023

GPT-4 is actually multi-modal, not text. ChatGPT does not yet expose image upsubmission to it. But it's part of how the model was trained already.

jacobr1 · on April 17, 2023

Yes. One of the frontiers of current research seems to be multi-modal models.

Ambix · on April 18, 2023

> "a picture is worth a thousand words"

and it might be opposite for the GPT models actually. it's just easier for humans to grasp the bunch of knowledge with one eyes sight, but usually most of useful information might be represented with just of bunch of words and machines are to scan through the millions of words in an instant.

throwaway2037 · on April 18, 2023

Excellent points in your post. You wrote:

    There's an incredible amount of non-text data out there still. Especially technical data.

"Especially technical data." What does this part mean? Initially, I thought you meant things like images and video, but now I am confused.

nomel · on April 19, 2023

Schematics (of any sort), block diagrams, general spatial awareness (including anything related to puzzle pieces/packing, like circuit layout), most physics problems involving force diagrams, anything mechanical, etc. The text representation of any of these is ludicrously more complex than simple images.

If you sit someone down, that works in one of these fields, you'll quickly see the limitations. It'll try to represent the concepts as text, with ascii art or some "attempt" at an ascii file format that can be used to draw, and its "reasoning" about these things is much more limited.

I think most people interacting with GPT are in a text-only (and especially programming) bubble.

sseagull · on April 18, 2023

They might mean numerical data like scientific simulation data, sensor data, polling data, statistics, etc.