A Cookbook of Self-Supervised Learning

cs702 · on April 25, 2023

Nice. It looks like it will be useful... although best practices are likely going to continue to evolve. The biggest question in my mind is whether we can come up, at some point in the future, with a kind of universal self-supervised learning objective that works well in practice for any task.

nmaley · on April 25, 2023

The NFL theorem means nothing if all the learning tasks have a common underlying structure. In the real world, they do. The laws of physics and chemistry create emergent causal relationships. Any SSL learning algorithm that learns to exploit causal relationships will consistently perform well over a variety of real world tasks.

cs702 · on April 26, 2023

Yes, I agree: In practice, all learning tasks have a common underlying structure because our physical reality can behave only in certain ways, determined by the symmetries in our universe. However, as far as I know, no one has formally specified or proved this commonality in learning tasks.

The most persuasive argument I've seen for this view -- our view -- is laid out in this paper by Lin, Tegmark, and Rolnick, written between 2016 and 2017 (although it feels like it was written a century ago!): https://arxiv.org/abs/1608.08225

medo-bear · on April 28, 2023

you say this as if it is trivial. for example, optimization of the lagrangian function is a recurring theme in physics and is pretty much a model of physical causation. once you have the lagrangian you pretty much have the domain knowledge. if you can come up with an ssl that discovers relevant lagrangians from observing the real world, that would be HUGE. however nfl might still stand in your way ;)

medo-bear · on April 25, 2023

no such thing as free lunch - https://en.m.wikipedia.org/wiki/No_free_lunch_theorem

tbalsam · on April 25, 2023

The most common statement I make is about the misapplication of the NFL.

This is not an appropriate use of the NFL.

The original commenter is asking about a general solution that will work well for all problems presented at it. The NFL details fine-grained tradeoffs in _ideal solutions_ for specific traits in certain areas. Additionally, we're not operating in an unbiased space here, so trivially there is a best estimator without bias. So by that very fact alone the NFL is invalidated in terms of the method it is being applied in here.

This is something I feel frustrated seeing a lot of younger people entering the field do (not saying you are young or new, it's just the trend). If this was the case in this kind of a way we never would have gotten beyond MLPs.

Yes, indeed there is in fact a set of general solutions that works roughly well over everything and is biased towards the situations where people will need it the most.

No, it will likely not be the technically best performing solution.

What the OP is looking for I believe is convenience, stability, and reliably good performance.

Hence, the NFL is not applicable or relevant to this matter in the way it is being used.

medo-bear · on April 26, 2023

> The original commenter is asking about a general solution that will work well for all problems presented at it

solution of what? nfl is actually pretty simple. it says that if you have a black box optimization technique then it will perform no better than a random function. a black box function is one that assumes no domain knowledge on the problem it is being applied to. since the op used the term 'any task' im pretty sure it applies to his statement

otherwise, please give me an example of an algorithm that goes against what im saying

tbalsam · on April 26, 2023

The primary poster with a high degree of likelihood means "an reasonable human task".

Unless in the future we need models that make summarized briefs from hyperintelligent dogs that communicate only in a language embedded in the Fourier Domain with an XOR-like mapping and zero redundancy as well as summarize a news article _with exactly equal probability to each other in the same model_, then the NFL theorem does not generally apply. Because there's bias.

If it helps, put another way, the previous poster isn't being literal. They're using vernacular meaning most expected outcomes, i e. Generally whatever minimizes the empirical risk via a log likelihood criterion.

The expected nonliterality of English at all of the wrong times can be quite bizzare.

medo-bear · on April 26, 2023

> Because there's bias

Not as much as you suggest. A lot of people are too dismissive of NFL because they read somewhere that some people use it incorrectly. If I am wrong please provide an example of an algorithm that has success rate greater than 50% on all practical human tasks that have true/false answers

tbalsam · on April 27, 2023

I'm a practicioner with a number of years of experience in the field who has contributed meaningful research to it. I've also spent thousands of hours working with models like this. It's about basic mathematical principles, not something someone read somewhere. That's why a number of people have attempted to explain this in a few different ways here.

> please provide an example of an algorithm that has success rate greater than 50% on all practical human tasks that have true/false answers

Yes, any language model with a true/false output trained for one step with a successful decrease in the loss on the full spectra of text satisfies this criterion. Measuring it will take quite a long while depending upon how low the snr is, but yes, that's how it works.

I do not mean to be condescending but I would really encourage you back to Shannon and some of the statistics underlying this principle. I feel you have something close to the principle but don't feel I can provide the "aha" moment that makes it click. When a number of experienced people in the field are trying to demonstrate a concept to you and explain why a difference in opinion is merited, they could all be wrong (after all -- humanity + statistics), but it does bear taking a solid look at the fundamentals otherwise.

Because if they are all wrong, there are some implications in that as well. Hopefully that helps, I'll probably have to move on from this conversation at this point. Best of luck and care. :thumbsup:

medo-bear · on April 27, 2023

> Yes, any language model with a true/false output trained for one step with a successful decrease in the loss on the full spectra of text satisfies this criterion

I understand that LLMs seem impressive but they still fail on human behaviour that is sufficiently outside of training sample

And to be sure, I agree that we can model human behaviour in various succesful ways. But I think it is wrong to dismiss NFL. Instead the real challange is defining and structuring the problem domain. If you do that successfuly then all talk about NFL becomes irrelevant

tbalsam · on April 27, 2023

Again, you've got to look at the mathematical foundations here. Shannon and frequentist statistics will help in particular. I'd encourage you to go back to the mathematical roots and take a good gander over it. I find it to be a very rewarding journey in and of itself.

medo-bear · on April 28, 2023

I read through few simple explanations that might help you. This seems like a good candidate

https://towardsdatascience.com/what-no-free-lunch-really-mea...

medo-bear · on April 28, 2023

I work in the field

tomrod · on April 25, 2023

We really need to revise this to say "there is no global free lunch."

You can and often do get local free lunches.

medo-bear · on April 25, 2023

the question was about "objective that works well in practice for any task". thats pretty global in my books

karpierz · on April 25, 2023

If you take "any task" to mean literally any conceivable task, then sure.

If you take "any task" to mean "any practical task" or "any task a human would conceivably want to have done", then no free lunch doesn't apply.

tbalsam · on April 25, 2023

Another way of looking at karpierz's comment is through the incompressibility of pure noise at scale.

As soon as some infinitely generated sort of noise is from some subset of possible noise, there is indeed (AFAIK) some kind of an ideal estimator that appropriately compresses that noise source with no bias and less entropy that the full space of possible noise.

I hope this shines an additional alternative light on the topic.

mitthrowaway2 · on April 25, 2023

I deeply suspect that this has more than surface similarities with the observation that you cannot extract work from a uniform heat bath, but you can extract work from a temperature gradient. There's probably not much an intelligence could learn to predict about its future environment living inside the core of the sun, even if it could withstand those temperatures -- just randomness everywhere in every direction. But on Earth, with the sun shining in and the coolness of space behind us, there are a lot of causal relationships that can be compressed to make useful predictions.

tbalsam · on April 26, 2023

The flow of entropy is a marvelous, strange, bizarre, and beautiful thing.

nl · on April 26, 2023

The "in practice" thing is pretty important.

In the no-free-lunch version of the world memory is infinite, all mathematical operations are the same performance wise and things the size of your floating point numbers are ignored.

But "in practice" these things matter a lot.

For practical purposes the no-free-lunch theorem is a distraction, or at best something to encourage people to try alternate approaches.

tbalsam · on April 26, 2023

A bumpy landscape indeedy.