You can get a lot of the way (just not to `exp((E_old - E_new) / Temp)`) by explaining the origin of the algorithm:
People who work metals have a process for making metals more supple, called "annealing". In annealing, you heat the metal up almost to the point where it loses its shape, then let it cool slowly. In the atomic picture, this chunk of metal has knots of stress and tiny mis-aligned crystals, but if we let the atoms jiggle, the knots can just work themselves out and the crystals can grow, because atoms want to align with their neighbors. It works the other way too though: If atoms jitter randomly then at any time they can exhibit a chaotic state, so if you suddenly stop that jitter ("lowering the temperature too fast"), then you just get a random state. So there is a sweet spot. You want just enough jitter to undo knots, but not so much jitter that you create new ones. By slowly reducing the temperature, you get to pass some special points where the neighbors have about the same impact on the crystal as the jitter, and as you slowly pass that, the jitter starts undoing the knots for you and then growing the crystals for you. You still often don't get perfection, but you can get closer.
What does this have to do with optimization? Physicists like to think of this "aligning with neighbors" process as minimizing some "interaction energy". But you can minimize whatever function you want with the same approach: (1) take any state you want, (2) let it randomly jitter to neighboring states, (3) accept the new states when it makes your function smaller, but also (with some probability p) when it makes your function larger, and (4) slowly lower p as you allow the system to randomly jitter around.
You can also get a good idea of what's going on with a graph:
\ _b / _
\_/ \ / _|-- B
\_/ _|-- A
Let's explain simulated annealing by looking that the barrier above (b), which has two different energy costs to jumping over it: B from the left and A + B from the right.
Here we understand the jitter as having a typical energy E which lets it jump up the graph, but it still tends to fall in the holes. When E > A + B then it will jump from the deeper valley on the right into the left and back. But as we get into the range B < E < A + B, then it's possible to jump from left to right, but not easily from right to left. The more we allow the system to jitter in this region as we slowly reduce E, the more likely the state ends up on the right somewhere, and then from that point on the algorithm mirrors your normal gradient descent algorithm.
> You can get a lot of the way (just not to `exp((E_old - E_new) / Temp)`) by explaining the origin of the algorithm
The exp is from the so-called Boltzmann factor [1] which (in somewhat simplified terms) describes the likelihood that a physical system in equilibrium at temperature T will transition from a state with energy E_old to one with energy E_new. So sure, you can derive the whole algorithm from classic "annealing" and physics alone.
I mean, I know where it comes from; I just don't think that there's a nice hand-wavy way to get people to understand what's going on there: you really need to start with how uncertainties multiply across a system, define entropy as log(W) in the NVE ensemble, discover temperature hiding there too, then switch to the NVT ensemble (i.e. connect to an infinite reservoir of energy at a certain temperature) -- then you can get at Boltzmann factors.
Even then, as was pointed out to me recently, it's not clear that the "always accept a lower energy, accept a higher energy proportional to the Boltzmann factor" yields a steady-state Boltzmann distribution -- especially because that would imply that the result is independent of the number of configurations with a given energy (the "density of states"), which seems surprising. If that doesn't hold then it's either a convenient approximation or a bit of magical thinking...
It's very hard to have 'proofs' about convergence of global optimization algorithms, because to do so, you have to make assumptions about the smoothness of the fitness landscape, the type of minimas, etc, etc.. This means that when someone is interested in publishing a new optimization algorithm, they just cherry pick a subset of problems and show that their algorithm works better on those, when used with a given set of hyper-parameters.
In theory, the no-free-lunch theorem means that there is no optimal algorithm for an arbitrary fitness landscape, but in practice, almost every single optimization of actual interest is somewhat smooth.
Anyway, here is an interesting set of benchmarks by Andrea Gavana about different optimization algorithms: http://infinity77.net/global_optimization/ampgo.html. In my experience, as long as the fitness landscape is relatively smooth, multi-start gradient based methods tend to perform better, assuming you can get a somewhat decent approximation of the gradient (with AD, it's not too difficult).
For example, the first sentence "Simulated annealing is a method for finding a good (not necessarily perfect) solution" is incorrect in theory, while correct in practice. I feel that it is this dichotomy that brings the most insight to the strengths and weaknesses of the approach. Further, the post doesn't give any good intuition on why it works, or how it works, or how it compares to alternatives... everything you would want to know. It also gets other details not quite right, like the origin of the Gibbs measure (the name "simulated annealing" comes from analogy to the annealing process in metallurgy, but the Gibbs formulation comes from probability theory, and the application here via the Ising model and electromagnetism). This isn't very important to a practical intro I guess, but it makes it harder to find the right places to read deeper if you wish to.
It's fairly easy to prove that simulated annealing done "properly" will converge to a uniform distribution over all global minima. The "properly" here involves a logarithmic cooling schedule and a little thought should convince you that this is in practice not going to help you any more than exhaustive search will.
It's very difficult to prove useful properties about convergence or distribution on faster cooling: therin lies the rub.
I suspect a big part of the popularity of the approach is that it is extremely easy to implement, but unfortunately naively application often doesn't work very well on complicated energy state spaces.
> It's fairly easy to prove that simulated annealing done "properly" will converge to a uniform distribution over all global minima. The "properly" here involves a logarithmic cooling schedule and a little thought should convince you that this is in practice not going to help you any more than exhaustive search will.
Are you sure this is proven? What if ergodicity is broken?
Too late to edit the parent, but you are right to call me out on this, thanks!. Clearly without some ergodicity constraint you cannot have uniform distribution.
I was sloppy with terminology, sorry. The original approach to this require weak and strong ergodicity conditions on an inhomogeneous Markov chain (but not stationarity of the chain, i.e. you don't have to run it forever). This was worked out in the mid 80s by Gidas and Mitra if I recall correctly, and several interesting updates since then. There are other probabilistic arguments made as well, after that.
The problem then is that proving ergodicity is extremely hard for anything that is not a Markov chain to begin with.
IIRC ergodicity of hard spheres is not yet proved in full generality (n spheres, d dimensions), and that's basically the simplest physical system there is!
I nearly snorted my coffee through my nose at this:
"Simulated annealing's strength is that it
avoids getting caught at local maxima ..."
It's true that SA generally has better local-maximum avoiding properties than simplistic hill-climbing, but I've generally found that shotgun stochastic ballistic hill-climbing works at least as well.
> shotgun stochastic ballistic hill-climbing works at least as well
This is the method that I'd like a lot of these stochastic optimization algorithms compared to, especially genetic algorithms.
At least brute force sampling of the entire space is guaranteed to find the global extremum at some point, an argument I've heard for random sampling & minimising vs. GAs for random structure searching in materials science (a very high dimensional problem). It doesn't code any assumptions about the surface in, so can't be optimal, but that can be a benefit if you want to find something surprising.
>> shotgun stochastic ballistic hill-climbing
>> works at least as well
> This is the method that I'd like a lot of these
> stochastic optimization algorithms compared to,
> especially genetic algorithms.
> At least brute force sampling of the entire space ...
Not sure you meant it to do so, but this sounds like you think that "shotgun stochastic ballistic hill-climbing" is a brute force sampling of the entire space. It's not, it's a stochastic optimization algorithm. Brute force sampling of the entire space, or even reasonable portions of it, is completely infeasible for the problems I deal with, so I'm not sure what you're saying here. You might like to explain a little more.
Now I've read your other comment with the terminology, I think "shotgun hill-climbing" is the appropriate phrase for the sort of materials science thing I'm talking about - you start atoms in random configurations, with minimal bias, and then minimise using the forces (available analytically) on those atoms using gradient descent/whatever optimiser (usually BFGS) until you find local minima. There's no momentum involved, and the gradient is available, but you still start in random positions and minimise to local minima. That sort of method is in competition with other structure finding algorithms that use GAs, SAs or some other fudged hybrids.
My reference to "brute force" was that, as far as I understand, in the limit of lots of attempts, "shotgun" will approach "brute force" -- imagine a really nasty landscape with lots of tiny basins. I understand "stochastic ballistic" modifies things slightly.
Yes to all - thanks for the clarification. The only comment is that in large dimension cases shotgun might theoretically approach brute force, but it never does so because there are just too many places to start.
Discrete gradients are always available - you can step in random directions and re-evaluate giving an approximation. If your space is smooth(ish) (and usually it's "smooth enough") then you can use the local approximations.
Yes, sometimes it's not possible, but usually local random search gives enough to use the concept of "gradient" to help. Mostly I find that people doing the optimisation/search don't know enough about the techniques they're using to adapt them and transform their problem to get the best out of things.
Yes, but it still makes no sense to compare GA performance with a gradient-based method on the problem GP is talking about, because an explicit gradient is available.
It depends on the problem you're working on and the number of variables you need to optimize. If you only have <20 variables on a nice smooth surface then yeah a shotgun approach is probably going to work, then again pretty much anything is going to work on a problem of that size.
On the other hand if you're working with >100 variables on a highly diverging surface you're going to want an iterative approach. In particular you need something that is history dependent and avoids regions that have been previously searched. SA is great because its simple, iterative, insanely stable, has great breadth and depth searching properties and its easily modifiable.
In comparison approaches like GA might be faster to converge in some cases but because they're more complex and system dependent its going to take longer to fine tune them to your particular problem.
> On the other hand if you're working with >100 variables
> on a highly diverging surface you're going to want an
> iterative approach.
I usually am, and I do - you seem to be misunderstanding the use of the word "shotgun" when accompanied by all the other words. The idea is to start in lots of places, randomly chosen, well spread, and then start climbing from all of them. As you do so you gain velocity - perhaps better thought of as minimising an error rather than maximising a fitness or "profit" function. As you drop down the surface you gain speed, changing direction according to a random sampling of the local gradients. Keep doing this and you tend to climb out of the well again, but if you then bleed off the energy you end up stuck in a deep well.
Cross-comparing the behaviours of the different starting points lets you think globally about some of the structure, but that usually only helps if it's got global structure. Usually it's better just to start with lots of points and look at the distribution of error of where they end up.
* Shotgun - start in lots of places and run simultaneously
* Ballistic - have your location include a velocity that changes as you move over the surface
* Stochastic - don't try to analyse your local surroundings deterministically - with 100s or thousands of dimensions you can't, so you sample projections of the local gradients.
* Hill-climbing - or hill descending, if you think of it as minimising error.
In general, I find it faster than either of SA or GA.
Do you have a repo with your source or an example? In my experience conjugate gradient, powell's method or any other linear optimization is essentially useless with >100 variables or so. Optimizing anything with thousands of variables is a feat, I'd be interested to see more details of how you do it.
Everything I do in this field is commercial-in-confidence, so I can't share any of my existing code like this. I may yet write this up, get clearance, and blog about it, but that won't happen real soon.
I also found Powell's method largely useless in high dimensional spaces, but I think that's because it's semi-static. The search point doesn't gain momentum, so it quickly gets stuck in local minima. If the search point "picks up speed" then it can coast up and back out of minima. Then you bleed off the speed at greater or lesser rates so the search location (which you can think of as a particle) eventually doesn't have enough energy to get out of the minimum it's in. This is very similar to Boltzmann-like behaviour - where you end up depends less on the area (measure) of your catchment, and more on the depth of your local optimum. In this sense it shares some characteristics with SA.
Hence the "ballistic" part of this.
And truth be told, in several hundred dimensions everything is a crap shoot. High-dimensional spheres are spikey, and the terrain you're exploring is effectively a mesa with wells and flagpoles. I've just had more success with SSBHD than with SA, GA, Powell, gradient-descent, or anything else.
"Shotgun" just means "with restarts" (right?) but what do you mean by "ballistic"? "With momentum" maybe? I can't see that term used anywhere in the literature.
Sorry - how does what differ from a "learning rate"? Learning rate as used in what connection? As far as I can see there is no connection between what I've described and anything I will call a learning rate in any of the other methods, so you'll have to be more explicit.
Ah, I meant your notion of velocity/ballistic. I'm thinking of the learning rate parameter you might use to control how quickly a gradient descent algorithm converges for some machine learning algorithm.
As a tangent, I'm really happy to see that Python is slowly but surely becoming the go-to demonstration language of posts like this. Wasn't always this way. Python is so lucidly readable that it serves great as "executable pseudocode". I mean, if you provide pseudocode for an algorithm, might as well write Python - it's mostly as readable and a huge advantage is that you can actually run and test it.
I think the widespread use of scipy/numpy is directly correlated. Python code is generally very easy to port, too, so I'm very happy when it's used for examples even though I don't "know" or write Python.
This method comes from statistical physics. Basically you want to generate a Boltzmann-distribution of states in the function of "energy". By defining the Markov chain in a certain way you can achieve this. This is the Metropolis-Hasting algorithm[1].
I believe it's somewhat hard to use this algorithm for optimum finding. You have to tune the temperature and it's works best if you have a good initial temperature and you need to cool the system down slowly (so you have to choose a reasonable timescale for thermalization). Some physical insight could help with both.
However this algorithm is used to actually simulate physical systems on finite temperature where you do not need dynamics. You only simulate the Boltzmann-distribution and you measure interesting mean values and correlations. Mostly used for describing phase transitions.
Genetic algorithms and simulated annealing are probably some of my favorite topics which ended up not being applicable in my day job, despite working very heavily in non-linear optimization. Our problems are of the size where gradient-based wins the day, hands down. But GA/SA are still very fun for my side projects.
Simulated annealing can be gradient based - the simulated annealing step refers to the probability of accepting a given minima as a function of the 'temperature'.
Simulated annealing is gradient based. It's an extension of gradient descent and in the degenerate case (zero temperature) they're the same: it generates random neighbouring states, and if the fitness of that state is better than the current one then it jumps there. That is, it seeks the local minimum.
The key part that simulated annealing adds is that when the temperature is non-zero there's also a chance of jumping to worse states. The probability depends on how much worse the new state is and what the temperature is. It's unlikely to make the jump if it's much worse or if the temperature is too cold. The idea here is to jump out of local minima to (hopefully) find the global minimum.
The temperature is a knob that trades off between exploration (high temperature) and exploitation (low temperature). You start off with a high temperature and jump around a lot, hopefully seeking the general vicinity of the global minimum. As you do so, you gradually decrease the temperature and settle on a locally-optimal solution.
Let's say the fitness landscape looks like this:
\ B /
\A/ \C /
\D/
If you are at A and you're seeking the lowest point, then there's a probability that you will jump to B even though it is worse. The hope is that you'll then discover C and finally D, the global minimum. There's no route from A to D with monotonically increasing fitness, so gradient descent won't find it. But simulated annealing might.
The local minimization step does not have to be gradient based. You could use something like Nelder-Mead to do the local minimization step, and still use an annealing schedule to decide whether to accept or reject that particular minima.
There's a whole genre of stochastic optimization algorithms in the same family - parallel tempering in particular has some nice theoretical properties (it strictly dominates simulated annealing, assuming you have computational power to run all the replicas, and has a light guarantee that "eventually" it will converge), and there are adaptive methods to optimize the step sizes, number of replicas, and temperature ranges.
Aaah good old simulated annealing, i remember using it for my computational modelling based final year project (which lead to a workshop paper [1] ) where brute forcing wasn't really a viable option.
If only I'd implemented the obvious extension to my graduate dissertation on optimising taxi routes (using a slightly different mechanism, tabu search and squeaky wheel)...
Unfortunately someone has fairly recently done it.
Although she doesn't say so, it appears she is assuming the cost function is normalized (range is 0-1), in this formula: exp((E_old - E_new) / Temp). Is that the case?
Okay, once I looked into simulated
annealing and here just read all the
comments on this thread. So, apparently
what I have to contribute is not here yet.
Here I outline non-linear duality theory
and Lagrangian relaxation and report a
war story where this approach badly beat
some efforts with simulated annealing.
Basically non-linear duality theory has
some surprising properties commonly just
too powerful to ignore.
Basically we're considering a case of
optimization or mathematical
programming.
So, let R denote the set of real numbers,
and suppose positive integers m and n are
given. Let R^n be the usual n-dimensional
real vector space of n-tuples. To use
matrix notation, we regard each n-tuple x
in R^n as a matrix with n rows and one
column, that is, as n x 1.
Similarly for R^m.
We are given set C a subset of R^n and
functions
f: R^n --> R
g: R^n --> R^m
We seek x in R^n to solve non-linear
program (NLP)
min z = f(x)
subject to
g(x) <= 0
x in C
where the 0 here is the m x 1 matrix of
zeros.
We let the feasible region R, a subset of
R^n, be
F = { x | x in C and g(x) <= 0 }
We regard g as m x 1 where for j = 1, 2,
..., m component j of g is
g_j: R^n --> R
where here g_j, borrowing TeX notation, is
g with subscript j.
Okay, given 1 x m l, we let the
Lagrangian
L( x, l ) = f(x) + l g(x)
Then for l >= 0 and x in F,
f(x) >= L( x, l )
and
z >= L( x, l )
Then the dual is to find l to maximize
M( l ) = min_{x in C} L( x, l )
Then M( l ) <= z to that M( l ) is a lower
bound on z.
In particular (H. Everett), if g(x) = 0, l
>= 0, and
z >= M( l ) = min_{x in C} L( x, l ) =
min_{x in C} f(x) + l g(x) = f(x)
and x in F so that x solves NLP.
A practical special case is essentially if
spend all the productive resources and do
so optimally, then are optimal.
Of course, the usual notation used for l
is lambda, and for a while around DC
Everett ran Lambda Corporation which did
resource allocation problems for the US
DoD.
Theorem: M is concave.
Proof: For t in the interval [0,1] and m
x 1 l_1 and l_2, we have
M( tl_1 + (1 - t)l_2) =
min_{x in C} L( x, t l_1 + (1 - t) l_2) =
min_{x in C} f(x) + ( t l_1 + (1 - t) l_2) g(x) =
min_{x in C} ( t + (1 - t)) f(x) + ( t l_1 + (1
- t) l_2) g(x) =
min_{x in C} t L( x, l_1 ) + (1 - t) L( x, l_2 )
>= t min_{x in C} L( x, l_1 ) + (1 - t) min_{x in C} L(
x, l_2 )
= t M( l_1 ) + (1 - t) M( l_2 )
so that M is concave. Done.
Immediately we have supporting hyperplanes
for our concave M:
Suppose y and l are given and
M( l ) = L( y, l )
Then for 1 x m u,
M( l + u ) = min_{x in C} L( x, l + u )
= min_{x in C} ( L( x, l ) + u g(x) )
<= L( y, l ) + u g(x)
= M( l ) + u g(x)
so that for 1 x m u
s( u ) = M( l ) + u g(x)
is a supporting hyperplane for concave
M( l ); that is,
M( l + u ) <= s( u ) = M( l ) + u g(x)
and s( 0 ) = M( l + 0 ) = M( l ).
So, here we have the basic theory of
non-linear duality as needed by the
iterative algorithmic approach of
primal-dual Lagrangian relaxation.
War Story:
Once some guys had a big resource
application problem having to do with a
suddenly legal case of cross selling for
banks. They had formulated their problem
as a 0-1 integer linear program with
~40,000 constraints and 600,000 variables.
They had tried simulated annealing, ran
for days, and quit with no idea how close
they were to optimality.
I looked at their formulation and found
that, except for 16 of the constraints,
the problem was comparatively easy.
So, I wrote out
L( x, l ) = f(x) + l g(x)
where l was 1 x 16 and g represented the
16 troublesome constraints. The set C
represented the 0-1 constraints and the
rest of the 40,000 constraints.
Then for each l, finding x to solve
M( l ) = min_{x in C} L( x, l )
was easy. Then I just had to find l to
solve
max M( l )
subject to
l >= 0
and that was just to maximize a concave
function.
So, I did some primal-dual iterations:
Set
k = 1
Pick l^k >= 0
Step 1 (Primal):
Find x = x^k to solve
min_{x in C} L( x, l^k )
Then
M( l^k ) = L( x^k, l^k )
Step 2 (Termination)
If x^k in F then
M( l^k ) <= z <= f( x^k )
so that if
M( l^k ) - f( x^k )
is sufficiently small, then
terminate. Else if k i
to large, terminate.
Step 3 (Dual):
Find 1 x m u and w in R to solve
max w
subject to
w <= M( l^p ) + u g(x^p)
p = 1, 2, ..., k
Of course, this is just a linear
program. Set
l^(k + 1) = l^K + u
k = k + 1
Go to Step 1.
There are refinements: E.g., for the
linear program to find l^{k + 1}, can hope
to get better numerical performance with
by using central cutting planes.
For the real problem, in 905 seconds on a
90 MHz processor, I ran 500 primal-dual
iterations and, then, had a feasible
solution, from the bound inequalities,
guaranteed to be within 0.025% of
optimality.
Worked fine! So, simulated annealing is
not the only tool in the tool box!
Sometimes Lagrangian relaxation can work
well, too!
Introducing lagrangian relaxation quickly in a comment to a simulated annealing post is perhaps not the best way to convey a method. The machinery of mathematical programming you have to introduce quickly becomes an incomprehensible wall of text to most people. Better write this in a blog post somewhere and refer that.
Indeed, there are often better solutions than simulated annealing, if the problem you are facing have some nice properties. I've been doing optimization on a highly nonlinear mathematical problem, but the objective function is so nice it only has a single maxima in the region I'm looking. So I just ran Nelder-Mead which terminates much quicker.
For anyone interested in simulated annealing,
what I wrote could be of more interest to
them than simulated annealing and, really,
easier to read than a comparably detailed
description of simulated annealing.
It's some quite precisely done mathematics
with one theorem with a careful proof
and some smaller results also proved.
And there's some impressive computational
experience. We're talking a 0-1 integer
linear program with 40,000 constraints and
600,000 variables -- that is an example of
a problem in NP-complete. That we could
do anything with such a problem is a bit
amazing.
That we can apply non-linear optimization
to a problem in discrete or combinatorial
optimization, that is, 0-1, is curious,
in this case powerful, and, thus,
generally welcome.
That we are making good progress with
Lagrange multipliers without
taking any derivatives, e.g., as in
the Kuhn-Tucker conditions,
is also
curious -- so, this is not your
father's Lagrange multipliers.
And I proved Everett's theorem: That's
a famous result, and quite useful,
much more useful than difficult to prove,
in
a wide range of resource allocation problems.
And Everett is a famous guy, a physics
Ph.D. under J. Wheeler (one of the
main guys in relativity theory, in the world)
at Princeton
and the first guy to expound on the
many worlds (parallel universes)
interpretation of quantum
mechanics. Everett thought that his
result was worth publishing, and what
I wrote is more general and more powerful.
Simulated annealing is famous;
that some non-linear duality theory
and Lagrangian relaxation totally blew
away simulated annealing on an impressive
practical problem should also be of
interest.
That it is possible to get better
performance using central cutting planes
is actually not very well known.
Also I covered the mathematics
in essentially full detail (everything
important and not obvious, proved)
starting with next to nothing,
gave a description of an algorithm
that is ready to code (it's basically
just what I did code from successfully)
and also
the impressive computational experience,
all in just that one post.
Simulated
annealing is rarely described so
carefully, and it doesn't provide bounds
on optimality, that is, we don't know
when to stop.
As I illustrated, my work
provides bounds which commonly in practice
provide a good stopping criterion.
Simulated annealing is sometimes regarded
as part of computer science; well, my post
shows something that sometimes is better
and from applied mathematics.
Actually, the mathematics for such
optimization problems is a well-developed
and deep field; try to do much in
optimization, and really should consider
some of the mathematics, e.g., what
I posted.
Oh, by the way, my Ph.D. is in
stochastic optimal control, and
I've published peer-reviewed original
research on non-linear optimization,
e.g., some deep details about the
Kuhn-Tucker conditions.
My professors included some of the
best people in the world in optimization.
The Chair of my Ph.D. orals committee
is a Member, US National Academy of
Engineering and was a student of
A. Tucker of the Kuhn-Tucker conditions --
Kuhn and Tucker were both at Princeton.
You might let us know where you found the
part with the "opinion"?
Supposedly HN likes only really highly
relevant comments and viciously down votes
anything else, especially jokes, and
everything else, no matter how serious or
relevant, by the user they are down voting.
Here I've been viciously, bitterly, attacked
by some people apparently totally pissed off
at me for next to nothing, a
simple contribution follow up,
a famous quote from the movie Casablanca,
as a follow up
to a post
with the earlier part of that famous
quote from Casablanca.
So, my post on Lagrangian relaxation, that took
much of an afternoon to write, and with careful
theorems and proofs, with some material not
so easy to find, especially in such a succinct,
careful, and complete
form, and that stands to be of high interest
to anyone interested in simulated annealing,
gets downvoted -- a joke about my post does not,
and my response to the joke post does.
HN is getting to be a very nasty place,
not nearly as nice as, say, Reddit. Sorry
HN.
Sam,
PG, you've got a problem. You need to look
at your log files and find who has decided
to hate me and down vote anything I post,
no matter what I write.
My post on Lagrangian relaxation is right up
there in quality with some of the best
expository writing in optimization, and I
know enough to know. That I get down voted
is the problem of HN, not my post. In the
past few days, the bitter, hostile, personal
attacks on me have cost me 70+ points,
no doubt to the great pleasure of
my attackers. I'm being bitterly attacked
personally, for no good reason, and now
independent of what I post.
People who work metals have a process for making metals more supple, called "annealing". In annealing, you heat the metal up almost to the point where it loses its shape, then let it cool slowly. In the atomic picture, this chunk of metal has knots of stress and tiny mis-aligned crystals, but if we let the atoms jiggle, the knots can just work themselves out and the crystals can grow, because atoms want to align with their neighbors. It works the other way too though: If atoms jitter randomly then at any time they can exhibit a chaotic state, so if you suddenly stop that jitter ("lowering the temperature too fast"), then you just get a random state. So there is a sweet spot. You want just enough jitter to undo knots, but not so much jitter that you create new ones. By slowly reducing the temperature, you get to pass some special points where the neighbors have about the same impact on the crystal as the jitter, and as you slowly pass that, the jitter starts undoing the knots for you and then growing the crystals for you. You still often don't get perfection, but you can get closer.
What does this have to do with optimization? Physicists like to think of this "aligning with neighbors" process as minimizing some "interaction energy". But you can minimize whatever function you want with the same approach: (1) take any state you want, (2) let it randomly jitter to neighboring states, (3) accept the new states when it makes your function smaller, but also (with some probability p) when it makes your function larger, and (4) slowly lower p as you allow the system to randomly jitter around.
You can also get a good idea of what's going on with a graph:
Let's explain simulated annealing by looking that the barrier above (b), which has two different energy costs to jumping over it: B from the left and A + B from the right.Here we understand the jitter as having a typical energy E which lets it jump up the graph, but it still tends to fall in the holes. When E > A + B then it will jump from the deeper valley on the right into the left and back. But as we get into the range B < E < A + B, then it's possible to jump from left to right, but not easily from right to left. The more we allow the system to jitter in this region as we slowly reduce E, the more likely the state ends up on the right somewhere, and then from that point on the algorithm mirrors your normal gradient descent algorithm.