Why Deep Learning Works II: the Renormalization Group

albertzeyer · on July 5, 2015

Note: This is about unsupervised learning and mostly about RBMs/DBNs. Most of the Deep Learning success is all about supervised learning. In the past, RBMs have been used for unsupervised pretraining of the model, however, nowadays, everyone uses supervised pretraining.

And the famous DeepMind works (Atari games etc) is mostly about Reinforcement learning, which is again different.

charleshmartin · on July 5, 2015

I will address the supervised vs unsupervised issue in my next post. Here, I believe the analogy would be that when a field is applied to a spin glass, it does not exhibit a glass transition to a non-self-averaging (highly non-convex) ground state.

As to supervised vs reinforcement learning, its not that different. See how Vowpal Wabbit incoporates both the 2 ideas in how the SGD update is formulated.

BenderV · on July 5, 2015

Well, if I understood correctly, the RL DeepMind implementation is basically making a RL algorithm work with a supervised model.

sieisteinmodel · on July 5, 2015

This has been done since the 90s. The Deepmind paper is about a few more tricks.

milesf · on July 5, 2015

Okay, I confess. I really didn't understand most of that post. It sounds really smart, but someone will have to vouch that it's legit, because the picture of Kadanoff cuddling Cookie Monster trigged my baloney detector https://charlesmartin14.files.wordpress.com/2015/04/kadanoff...

TravisDick · on July 5, 2015

I don't mean to be super negative, but because of the general tone early in the article and some sloppy notation, I never finished reading. I think the goal of an article like this should be to give a high-level intuitive explanation for some technical result, rather than sounding smart or complicated.

First, it is a little weird to me to talk about "old-school ML" as learning maps from inputs to hidden features. That seems neither old, nor very representative of the field of Machine Learning as a whole. It's also weird to say that RBMs and other deep learning algorithms are formulated using classical statistical mechanics. Moreover, implying that this scary-sounding formulation is the reason they are interesting seems like an attempt at sounding smart. Typically there are many ways to motivate and derive different algorithms, and it is /useful/ to acknowledge the multiple viewpoints because they often give different insights.

Second, the section about flow maps and fixed points seems to make a mess out of the notation by either being unclear or disagreeing with standard notation. What is meant by the notation "f(X) -> X"? Presumably this means something like f is a function that maps elements of the set X to elements in the set X. More standard notation for this would be something like "f: X -> X". Perhaps it means that the image of the set X under the function f is again the set X. But does that require that f be a surjective function? Confusingly, it also looks like the function f might be required to be the identity function, but given the context this is clearly not the intended interpretation.

When defining the fixed point, it seems that it would be more natural to say that x is a fixed point of f if f(x) = x. That is, x is fixed or unmoved by the function f. It turns out that for contractions (and some other functions, too), that the sequence f(x), f(f(x)), f(f(f(x))), and so on is guaranteed to converge to a unique fixed point of f. The notation f^n typically refers to the function f being applied n times, which is not the usage in the article. In the article, f^1, f^2, and so on are all identical copies of the function f. Using the standard notation, the definition of f_infty would be f_infty(x) = lim_{n -> infty} f^n(x). And, in the case of a contraction, the Banach fixed point theorem gives that f_infty is well-defined, and there exists a unique x_fix in X so that f_infty(x) = x_fix for all x in X (i.e., iterating f repeatedly converges to a unique fixed point x_fix of the function f).

These things do not necessarily mean that the article is uninteresting or uninformative or even technically incorrect. But if the author didn't take the time to make the simple things clear, then I'm not sure that I want to read the rest.

Sorry for the rant.

sieisteinmodel · on July 5, 2015

Well, there is more.

E.g. abbreviating deep belief nets with DBM, which is the commonly used acronym for deep boltzmann machines. These are similar, but very different. Calling an RBM an encoder is somehow not far fetched, but there are many differences between auto encoders and RBMs. He eventually claims an RBM minimises reconstruction error, which is just plain wrong and shows that this guy has absolutely no clue what he is writing about.

charleshmartin · on July 5, 2015

'Technically' this is correct--the RBM CD algo is not minimizing this function; that's not the point.

It is known that when training an RBM, the reconstruction error decreases but not monotonically; in fact it fluctuates. In the words of Hinton, 'trust it but don't use it'.

http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf (which is cited in the post as well)

So in a global sense, yes, I would say that the RBM does eventually minimize the reconstruction error even though it fluctuates.

I can even offer a conjecture here on why the error fluctuates ; in a discrete RG flow map, there could be finite size effects that would give log-periodic fluctuations. This is a stretch--but it is something that could be tested.

I explain this idea here http://charlesmartin14.wordpress.com/2015/01/16/the-bitcoin-...

As to stacking the RBMs to form a DBN--yeah that's the point. "Hinton showed that RBMs can be stacked and trained in a greedy manner to form so-called Deep Belief Networks (DBN)" http://deeplearning.net/tutorial/DBN.html

charleshmartin · on July 5, 2015

Thanks for the comments. It is helpful to have others read the blog and make suggestions and ask for clarifications. The motivations here are to (1) summarize and clarify the key ideas of the physics paper that observed this connection, and (2) set the stage for my next blog, where I try to connect Deep Learning to my idea around Spin Funnels.

I will review the comments and think how to update the blog to make it more clear.

Animats · on July 5, 2015

I got that feeling too, but I don't understand the subject well enough.

One key concept here is that you want your learning system to have some "free energy" metric which decreases as you run the training set through again and again. This insures some kind of convergence, rather than just thrashing around. Of course, the other problem is getting stuck at a local minimum, which is why you don't want to converge too fast. (I'm not up to speed on that; when I studied AI years ago, everybody was getting stuck at local minima. Now, that's less of a problem, and some of the old algorithms, run with slow learning rates over and over, get stuck less.)

The connection to physics seems to be from conservation of energy. The total energy of a system must decrease. For complex systems where energy is sloshing around from one form to another (flow, turbulence, gas compression, heat, maybe combustion) an energy-based analysis is a way of looking at the problem in a simpler way. That is apparently a useful way to look at deep learning systems. I think this is where the author is coming from.

Whether the connection to physical systems is useful, or merely an interesting analogy, isn't clear from the paper. It may be too soon to tell.

charleshmartin · on July 5, 2015

I am happy to answer questions. You can ask here or directly on the blog.

chmartin · on July 5, 2015

I try to lighten things up because the math by be unfamiliar.

oneloop · on July 5, 2015

I will read the post later, but I can tell you right now that I'm a physicist and when I was first introduced to deep learning my first thought, especially in the context of visual recognition, was "this smells of renormalization group".

contravariant · on July 5, 2015

Do you happen to know any good introductory texts to the renormalization group?

selimthegrim · on July 5, 2015

The standard one in use at Santa Barbara is http://www.amazon.com/Lectures-Transitions-Renormalization-F...

Some others I have sitting beside me at my desk right now are http://www.amazon.com/Theory-Critical-Phenomena-Introduction..., http://www.amazon.com/Renormalization-Introduction-Operator-... and http://www.amazon.com/Renormalization-Methods-Guide-For-Begi...

The most modern treatment is probably http://www.amazon.com/Scaling-Renormalization-Statistical-Ph...

(For God's sake, stay away from anything written by Zinn-Justin, Itzykson, or Zuber unless you know what you're doing)

paulpauper · on July 5, 2015

I think this is similar to the scalar theory of the stock market, which uses scale invariant geometric objects to represent stock market emery levels

http://greyenlightenment.com/sornette-vs-taleb-debate/

Sornette’s 2013 TED video, in which he predicts an imminent stock market crash due to some ‘power law’, is also wrong because two years later the stock market has continued to rally.

You write on your blog:

These kinds of crashes are not caused by external events or bad players–they are endemic to all markets and result from the cooperative actions of all participants.

Easier said than done. I don't think the log periodic theory is a holy grail to making money in the market. There are too many instances here it has failed, but you cherry-picked a single example with bitcoin where it could have worked.

charleshmartin · on July 5, 2015

It is easier to apply the Sornette theory to antibubbles.

Bitcoin seemed like a great example.

I gotta go back and see how well the predictions actually worked.

fizixer · on July 6, 2015

One way of think of it is that:

There are connections between Deep Learning and Theoretical Physics because there are (even stronger) connections between Information Theory and Statistical Mechanics.

sgt101 · on July 5, 2015

I don't like the assertion at all because so many techniques are held to be "deep learning" and because even when specific techniques are built on an analogy of this sort (think Simulated Annealing and Genetic Algorithms) they do not work "because" they are "like" the physical processes that served as an inspiration.

Names are useful, but only as a aide to thinking. Does this help us think about these techniques?

reader5000 · on July 5, 2015

Is the "group" in renormalization group the same "group" in group theory?

oneloop · on July 5, 2015

Almost. The name "group" in renormalization group was inspired by the groups in group theory, but in reality the renormalization "group" isn't a group but a semi-group. A semi-group satisfies the same axioms as a group except for the existence of inverse. And that is the mathematical reason why you can describe big things in terms of smaller things, but you can't describe small things in terms of bigger things: the renormalization (semi-) group flows from the ultraviolet to the infrared, but not the other way around :-)

eeperson · on July 5, 2015

Yes, it is the same as the "group" in group theory. However, I the name is a misnomer since I believe it is technically a semigroup because the binary operator is not invertable.

reader5000 · on July 5, 2015

Well, if say the elements of the permutation group are permutation functions/mappings and the operator is function composition,

in a renormalization group the elements would be "renormalization functions"? and the operator would be function composition?

chmartin · on July 5, 2015

Kinda. The 'group' refers to a group scale and/or conformal transformations. In the context of Deep Learning, the 'scale' transform is akin to adding layers.

ximeng · on July 5, 2015

oneloop, your comment looks helpful but you are hell-banned, you may want to email HN to be reinstated.

--

Edit - he's been reinstated.

selimthegrim · on July 5, 2015

His comment below about a semigroup is absolutely correct, if that helps his case any.

Qualifications: I studied RG in classes at Santa Barbara

chmartin · on July 5, 2015

what does hell-banned mean?

ximeng · on July 5, 2015

Comments are marked as dead (can only be seen by people with showdead set in their profile), but appear visible to the user so that they may not realise why nobody responds to them or upvotes their comments. Normally a punishment for bad behaviour or poor commenting, but seems inappropriate in this case.

jmount · on July 5, 2015

I think a key difference is the physics renormalization structures use fairly regular or uniform weights and the deep learning plays a lot with the weights. So there are going to be pretty big differences in behavior.

noobermin · on July 5, 2015

And here I thought the renormalization group had no application outside high energy physics and condensed matter. May be I should have stuck with HEP after all.

octatoan · on July 6, 2015

No MathJax. I am disappoint.

curiousjorge · on July 5, 2015

It always depresses me when I read anything with math formulas and esoteric terms, a constant reminder of my lifelong incompetence with math and university calculus courses.

abecedarius · on July 6, 2015

I expect very few people learned from this post; I didn't, and I kinda like math. This was the sort of math writing that makes sense only given nearly the same background as the author. (Someone above posted specific complaints about the unclear notation.)