The Simulated Annealing Algorithm

drostie · on Jan 20, 2015

You can get a lot of the way (just not to `exp((E_old - E_new) / Temp)`) by explaining the origin of the algorithm:

People who work metals have a process for making metals more supple, called "annealing". In annealing, you heat the metal up almost to the point where it loses its shape, then let it cool slowly. In the atomic picture, this chunk of metal has knots of stress and tiny mis-aligned crystals, but if we let the atoms jiggle, the knots can just work themselves out and the crystals can grow, because atoms want to align with their neighbors. It works the other way too though: If atoms jitter randomly then at any time they can exhibit a chaotic state, so if you suddenly stop that jitter ("lowering the temperature too fast"), then you just get a random state. So there is a sweet spot. You want just enough jitter to undo knots, but not so much jitter that you create new ones. By slowly reducing the temperature, you get to pass some special points where the neighbors have about the same impact on the crystal as the jitter, and as you slowly pass that, the jitter starts undoing the knots for you and then growing the crystals for you. You still often don't get perfection, but you can get closer.

What does this have to do with optimization? Physicists like to think of this "aligning with neighbors" process as minimizing some "interaction energy". But you can minimize whatever function you want with the same approach: (1) take any state you want, (2) let it randomly jitter to neighboring states, (3) accept the new states when it makes your function smaller, but also (with some probability p) when it makes your function larger, and (4) slowly lower p as you allow the system to randomly jitter around.

You can also get a good idea of what's going on with a graph:

    \   _b    / _
     \_/ \   /  _|-- B
          \_/   _|-- A

Let's explain simulated annealing by looking that the barrier above (b), which has two different energy costs to jumping over it: B from the left and A + B from the right.

Here we understand the jitter as having a typical energy E which lets it jump up the graph, but it still tends to fall in the holes. When E > A + B then it will jump from the deeper valley on the right into the left and back. But as we get into the range B < E < A + B, then it's possible to jump from left to right, but not easily from right to left. The more we allow the system to jitter in this region as we slowly reduce E, the more likely the state ends up on the right somewhere, and then from that point on the algorithm mirrors your normal gradient descent algorithm.

bjornsing · on Jan 20, 2015

> You can get a lot of the way (just not to `exp((E_old - E_new) / Temp)`) by explaining the origin of the algorithm

The exp is from the so-called Boltzmann factor [1] which (in somewhat simplified terms) describes the likelihood that a physical system in equilibrium at temperature T will transition from a state with energy E_old to one with energy E_new. So sure, you can derive the whole algorithm from classic "annealing" and physics alone.

1. http://en.wikipedia.org/wiki/Boltzmann_distribution

drostie · on Jan 20, 2015

I mean, I know where it comes from; I just don't think that there's a nice hand-wavy way to get people to understand what's going on there: you really need to start with how uncertainties multiply across a system, define entropy as log(W) in the NVE ensemble, discover temperature hiding there too, then switch to the NVT ensemble (i.e. connect to an infinite reservoir of energy at a certain temperature) -- then you can get at Boltzmann factors.

Even then, as was pointed out to me recently, it's not clear that the "always accept a lower energy, accept a higher energy proportional to the Boltzmann factor" yields a steady-state Boltzmann distribution -- especially because that would imply that the result is independent of the number of configurations with a given energy (the "density of states"), which seems surprising. If that doesn't hold then it's either a convenient approximation or a bit of magical thinking...

js2 · on Jan 20, 2015

In annealing, you heat the metal up almost to the point where it loses its shape, then let it cool slowly.

See also: tempering chocolate. Sounds like an identical process.

Fede_V · on Jan 20, 2015

It's very hard to have 'proofs' about convergence of global optimization algorithms, because to do so, you have to make assumptions about the smoothness of the fitness landscape, the type of minimas, etc, etc.. This means that when someone is interested in publishing a new optimization algorithm, they just cherry pick a subset of problems and show that their algorithm works better on those, when used with a given set of hyper-parameters.

In theory, the no-free-lunch theorem means that there is no optimal algorithm for an arbitrary fitness landscape, but in practice, almost every single optimization of actual interest is somewhat smooth.

Anyway, here is an interesting set of benchmarks by Andrea Gavana about different optimization algorithms: http://infinity77.net/global_optimization/ampgo.html. In my experience, as long as the fitness landscape is relatively smooth, multi-start gradient based methods tend to perform better, assuming you can get a somewhat decent approximation of the gradient (with AD, it's not too difficult).

ska · on Jan 20, 2015

As introductions go, this is not very good.

For example, the first sentence "Simulated annealing is a method for finding a good (not necessarily perfect) solution" is incorrect in theory, while correct in practice. I feel that it is this dichotomy that brings the most insight to the strengths and weaknesses of the approach. Further, the post doesn't give any good intuition on why it works, or how it works, or how it compares to alternatives... everything you would want to know. It also gets other details not quite right, like the origin of the Gibbs measure (the name "simulated annealing" comes from analogy to the annealing process in metallurgy, but the Gibbs formulation comes from probability theory, and the application here via the Ising model and electromagnetism). This isn't very important to a practical intro I guess, but it makes it harder to find the right places to read deeper if you wish to.

It's fairly easy to prove that simulated annealing done "properly" will converge to a uniform distribution over all global minima. The "properly" here involves a logarithmic cooling schedule and a little thought should convince you that this is in practice not going to help you any more than exhaustive search will.

It's very difficult to prove useful properties about convergence or distribution on faster cooling: therin lies the rub.

I suspect a big part of the popularity of the approach is that it is extremely easy to implement, but unfortunately naively application often doesn't work very well on complicated energy state spaces.

carlob · on Jan 20, 2015

> It's fairly easy to prove that simulated annealing done "properly" will converge to a uniform distribution over all global minima. The "properly" here involves a logarithmic cooling schedule and a little thought should convince you that this is in practice not going to help you any more than exhaustive search will.

Are you sure this is proven? What if ergodicity is broken?

ska · on Jan 20, 2015

Too late to edit the parent, but you are right to call me out on this, thanks!. Clearly without some ergodicity constraint you cannot have uniform distribution.

I was sloppy with terminology, sorry. The original approach to this require weak and strong ergodicity conditions on an inhomogeneous Markov chain (but not stationarity of the chain, i.e. you don't have to run it forever). This was worked out in the mid 80s by Gidas and Mitra if I recall correctly, and several interesting updates since then. There are other probabilistic arguments made as well, after that.

I can dig out references if anyone asks.

carlob · on Jan 20, 2015

The problem then is that proving ergodicity is extremely hard for anything that is not a Markov chain to begin with.

IIRC ergodicity of hard spheres is not yet proved in full generality (n spheres, d dimensions), and that's basically the simplest physical system there is!

ColinWright · on Jan 20, 2015

I nearly snorted my coffee through my nose at this:

    "Simulated annealing's strength is that it
     avoids getting caught at local maxima ..."

It's true that SA generally has better local-maximum avoiding properties than simplistic hill-climbing, but I've generally found that shotgun stochastic ballistic hill-climbing works at least as well.

But all these things work the same way:

* Do something random and see how good it is;

* Jiggle it at random and see if it's better;

* Consider changing.

tfgg · on Jan 20, 2015

> shotgun stochastic ballistic hill-climbing works at least as well

This is the method that I'd like a lot of these stochastic optimization algorithms compared to, especially genetic algorithms.

At least brute force sampling of the entire space is guaranteed to find the global extremum at some point, an argument I've heard for random sampling & minimising vs. GAs for random structure searching in materials science (a very high dimensional problem). It doesn't code any assumptions about the surface in, so can't be optimal, but that can be a benefit if you want to find something surprising.

ColinWright · on Jan 20, 2015

  >> shotgun stochastic ballistic hill-climbing
  >> works at least as well

  > This is the method that I'd like a lot of these
  > stochastic optimization algorithms compared to,
  > especially genetic algorithms.

  > At least brute force sampling of the entire space ...

Not sure you meant it to do so, but this sounds like you think that "shotgun stochastic ballistic hill-climbing" is a brute force sampling of the entire space. It's not, it's a stochastic optimization algorithm. Brute force sampling of the entire space, or even reasonable portions of it, is completely infeasible for the problems I deal with, so I'm not sure what you're saying here. You might like to explain a little more.

tfgg · on Jan 20, 2015

Now I've read your other comment with the terminology, I think "shotgun hill-climbing" is the appropriate phrase for the sort of materials science thing I'm talking about - you start atoms in random configurations, with minimal bias, and then minimise using the forces (available analytically) on those atoms using gradient descent/whatever optimiser (usually BFGS) until you find local minima. There's no momentum involved, and the gradient is available, but you still start in random positions and minimise to local minima. That sort of method is in competition with other structure finding algorithms that use GAs, SAs or some other fudged hybrids.

My reference to "brute force" was that, as far as I understand, in the limit of lots of attempts, "shotgun" will approach "brute force" -- imagine a really nasty landscape with lots of tiny basins. I understand "stochastic ballistic" modifies things slightly.

ColinWright · on Jan 20, 2015

Yes to all - thanks for the clarification. The only comment is that in large dimension cases shotgun might theoretically approach brute force, but it never does so because there are just too many places to start.

jmmcd · on Jan 21, 2015

If gradient is available then it makes little sense to say you want people to compare that method to GAs. GAs are for black box search.

ColinWright · on Jan 21, 2015

Discrete gradients are always available - you can step in random directions and re-evaluate giving an approximation. If your space is smooth(ish) (and usually it's "smooth enough") then you can use the local approximations.

Yes, sometimes it's not possible, but usually local random search gives enough to use the concept of "gradient" to help. Mostly I find that people doing the optimisation/search don't know enough about the techniques they're using to adapt them and transform their problem to get the best out of things.

jmmcd · on Jan 21, 2015

Yes, but it still makes no sense to compare GA performance with a gradient-based method on the problem GP is talking about, because an explicit gradient is available.

acadien · on Jan 20, 2015

It depends on the problem you're working on and the number of variables you need to optimize. If you only have <20 variables on a nice smooth surface then yeah a shotgun approach is probably going to work, then again pretty much anything is going to work on a problem of that size.

On the other hand if you're working with >100 variables on a highly diverging surface you're going to want an iterative approach. In particular you need something that is history dependent and avoids regions that have been previously searched. SA is great because its simple, iterative, insanely stable, has great breadth and depth searching properties and its easily modifiable.

In comparison approaches like GA might be faster to converge in some cases but because they're more complex and system dependent its going to take longer to fine tune them to your particular problem.

ColinWright · on Jan 20, 2015

  > On the other hand if you're working with >100 variables
  > on a highly diverging surface you're going to want an
  > iterative approach.

I usually am, and I do - you seem to be misunderstanding the use of the word "shotgun" when accompanied by all the other words. The idea is to start in lots of places, randomly chosen, well spread, and then start climbing from all of them. As you do so you gain velocity - perhaps better thought of as minimising an error rather than maximising a fitness or "profit" function. As you drop down the surface you gain speed, changing direction according to a random sampling of the local gradients. Keep doing this and you tend to climb out of the well again, but if you then bleed off the energy you end up stuck in a deep well.

Cross-comparing the behaviours of the different starting points lets you think globally about some of the structure, but that usually only helps if it's got global structure. Usually it's better just to start with lots of points and look at the distribution of error of where they end up.

* Shotgun - start in lots of places and run simultaneously

* Ballistic - have your location include a velocity that changes as you move over the surface

* Stochastic - don't try to analyse your local surroundings deterministically - with 100s or thousands of dimensions you can't, so you sample projections of the local gradients.

* Hill-climbing - or hill descending, if you think of it as minimising error.

In general, I find it faster than either of SA or GA.

acadien · on Jan 20, 2015

Do you have a repo with your source or an example? In my experience conjugate gradient, powell's method or any other linear optimization is essentially useless with >100 variables or so. Optimizing anything with thousands of variables is a feat, I'd be interested to see more details of how you do it.

ColinWright · on Jan 20, 2015

Everything I do in this field is commercial-in-confidence, so I can't share any of my existing code like this. I may yet write this up, get clearance, and blog about it, but that won't happen real soon.

I also found Powell's method largely useless in high dimensional spaces, but I think that's because it's semi-static. The search point doesn't gain momentum, so it quickly gets stuck in local minima. If the search point "picks up speed" then it can coast up and back out of minima. Then you bleed off the speed at greater or lesser rates so the search location (which you can think of as a particle) eventually doesn't have enough energy to get out of the minimum it's in. This is very similar to Boltzmann-like behaviour - where you end up depends less on the area (measure) of your catchment, and more on the depth of your local optimum. In this sense it shares some characteristics with SA.

Hence the "ballistic" part of this.

And truth be told, in several hundred dimensions everything is a crap shoot. High-dimensional spheres are spikey, and the terrain you're exploring is effectively a mesa with wells and flagpoles. I've just had more success with SSBHD than with SA, GA, Powell, gradient-descent, or anything else.

jmmcd · on Jan 20, 2015

"Shotgun" just means "with restarts" (right?) but what do you mean by "ballistic"? "With momentum" maybe? I can't see that term used anywhere in the literature.

ColinWright · on Jan 20, 2015

See my reply here: https://news.ycombinator.com/item?id=8917353

I probably should write this up properly sometime, along with the comment[0] dang[1] suggested I expand on.

[0] https://news.ycombinator.com/item?id=8807580

[1] https://news.ycombinator.com/user?id=dang

anonymousDan · on Jan 20, 2015

How does that differ from a "learning rate"?

ColinWright · on Jan 21, 2015

Sorry - how does what differ from a "learning rate"? Learning rate as used in what connection? As far as I can see there is no connection between what I've described and anything I will call a learning rate in any of the other methods, so you'll have to be more explicit.

anonymousDan · on Jan 21, 2015

Ah, I meant your notion of velocity/ballistic. I'm thinking of the learning rate parameter you might use to control how quickly a gradient descent algorithm converges for some machine learning algorithm.

eliben · on Jan 20, 2015

As a tangent, I'm really happy to see that Python is slowly but surely becoming the go-to demonstration language of posts like this. Wasn't always this way. Python is so lucidly readable that it serves great as "executable pseudocode". I mean, if you provide pseudocode for an algorithm, might as well write Python - it's mostly as readable and a huge advantage is that you can actually run and test it.

getsat · on Jan 20, 2015

I think the widespread use of scipy/numpy is directly correlated. Python code is generally very easy to port, too, so I'm very happy when it's used for examples even though I don't "know" or write Python.

leni536 · on Jan 20, 2015

This method comes from statistical physics. Basically you want to generate a Boltzmann-distribution of states in the function of "energy". By defining the Markov chain in a certain way you can achieve this. This is the Metropolis-Hasting algorithm[1].

I believe it's somewhat hard to use this algorithm for optimum finding. You have to tune the temperature and it's works best if you have a good initial temperature and you need to cool the system down slowly (so you have to choose a reasonable timescale for thermalization). Some physical insight could help with both.

However this algorithm is used to actually simulate physical systems on finite temperature where you do not need dynamics. You only simulate the Boltzmann-distribution and you measure interesting mean values and correlations. Mostly used for describing phase transitions.

[1] http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_alg...

thearn4 · on Jan 20, 2015

Genetic algorithms and simulated annealing are probably some of my favorite topics which ended up not being applicable in my day job, despite working very heavily in non-linear optimization. Our problems are of the size where gradient-based wins the day, hands down. But GA/SA are still very fun for my side projects.

Fede_V · on Jan 20, 2015

Simulated annealing can be gradient based - the simulated annealing step refers to the probability of accepting a given minima as a function of the 'temperature'.

a-priori · on Jan 20, 2015

Simulated annealing is gradient based. It's an extension of gradient descent and in the degenerate case (zero temperature) they're the same: it generates random neighbouring states, and if the fitness of that state is better than the current one then it jumps there. That is, it seeks the local minimum.

The key part that simulated annealing adds is that when the temperature is non-zero there's also a chance of jumping to worse states. The probability depends on how much worse the new state is and what the temperature is. It's unlikely to make the jump if it's much worse or if the temperature is too cold. The idea here is to jump out of local minima to (hopefully) find the global minimum.

The temperature is a knob that trades off between exploration (high temperature) and exploitation (low temperature). You start off with a high temperature and jump around a lot, hopefully seeking the general vicinity of the global minimum. As you do so, you gradually decrease the temperature and settle on a locally-optimal solution.

Let's say the fitness landscape looks like this:

    \   B     /
     \A/ \C  /
          \D/

If you are at A and you're seeking the lowest point, then there's a probability that you will jump to B even though it is worse. The hope is that you'll then discover C and finally D, the global minimum. There's no route from A to D with monotonically increasing fitness, so gradient descent won't find it. But simulated annealing might.

jmmcd · on Jan 21, 2015

It's not gradient based because it runs without reference to the derivative (gradient) of f. It's black box.

Gradient descent does not work by generating random neighbours, it follows the gradient.

Fede_V · on Jan 20, 2015

The local minimization step does not have to be gradient based. You could use something like Nelder-Mead to do the local minimization step, and still use an annealing schedule to decide whether to accept or reject that particular minima.

fiatmoney · on Jan 21, 2015

There's a whole genre of stochastic optimization algorithms in the same family - parallel tempering in particular has some nice theoretical properties (it strictly dominates simulated annealing, assuming you have computational power to run all the replicas, and has a light guarantee that "eventually" it will converge), and there are adaptive methods to optimize the step sizes, number of replicas, and temperature ranges.

fmax30 · on Jan 20, 2015

Aaah good old simulated annealing, i remember using it for my computational modelling based final year project (which lead to a workshop paper [1] ) where brute forcing wasn't really a viable option.

[1] http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnum...

andy_ppp · on Jan 20, 2015

If only I'd implemented the obvious extension to my graduate dissertation on optimising taxi routes (using a slightly different mechanism, tabu search and squeaky wheel)...

Unfortunately someone has fairly recently done it.

krazydad · on Jan 20, 2015

Although she doesn't say so, it appears she is assuming the cost function is normalized (range is 0-1), in this formula: exp((E_old - E_new) / Temp). Is that the case?

graycat · on Jan 20, 2015

Okay, once I looked into simulated annealing and here just read all the comments on this thread. So, apparently what I have to contribute is not here yet.

Here I outline non-linear duality theory and Lagrangian relaxation and report a war story where this approach badly beat some efforts with simulated annealing.

Basically non-linear duality theory has some surprising properties commonly just too powerful to ignore.

Basically we're considering a case of optimization or mathematical programming.

So, let R denote the set of real numbers, and suppose positive integers m and n are given. Let R^n be the usual n-dimensional real vector space of n-tuples. To use matrix notation, we regard each n-tuple x in R^n as a matrix with n rows and one column, that is, as n x 1.

Similarly for R^m.

We are given set C a subset of R^n and functions

f: R^n --> R

g: R^n --> R^m

We seek x in R^n to solve non-linear program (NLP)

min z = f(x)

subject to

g(x) <= 0

x in C

where the 0 here is the m x 1 matrix of zeros.

We let the feasible region R, a subset of R^n, be

F = { x | x in C and g(x) <= 0 }

We regard g as m x 1 where for j = 1, 2, ..., m component j of g is

g_j: R^n --> R

where here g_j, borrowing TeX notation, is g with subscript j.

Okay, given 1 x m l, we let the Lagrangian

L( x, l ) = f(x) + l g(x)

Then for l >= 0 and x in F,

f(x) >= L( x, l )

and

z >= L( x, l )

Then the dual is to find l to maximize

M( l ) = min_{x in C} L( x, l )

Then M( l ) <= z to that M( l ) is a lower bound on z.

In particular (H. Everett), if g(x) = 0, l >= 0, and

z >= M( l ) = min_{x in C} L( x, l ) =

min_{x in C} f(x) + l g(x) = f(x)

and x in F so that x solves NLP.

A practical special case is essentially if spend all the productive resources and do so optimally, then are optimal.

Of course, the usual notation used for l is lambda, and for a while around DC Everett ran Lambda Corporation which did resource allocation problems for the US DoD.

Theorem: M is concave.

Proof: For t in the interval [0,1] and m x 1 l_1 and l_2, we have

M( tl_1 + (1 - t)l_2) =

min_{x in C} L( x, t l_1 + (1 - t) l_2) =

min_{x in C} f(x) + ( t l_1 + (1 - t) l_2) g(x) =

min_{x in C} ( t + (1 - t)) f(x) + ( t l_1 + (1

- t) l_2) g(x) =

min_{x in C} t L( x, l_1 ) + (1 - t) L( x, l_2 )

>= t min_{x in C} L( x, l_1 ) + (1 - t) min_{x in C} L( x, l_2 )

= t M( l_1 ) + (1 - t) M( l_2 )

so that M is concave. Done.

Immediately we have supporting hyperplanes for our concave M:

Suppose y and l are given and

M( l ) = L( y, l )

Then for 1 x m u,

M( l + u ) = min_{x in C} L( x, l + u )

= min_{x in C} ( L( x, l ) + u g(x) )

<= L( y, l ) + u g(x)

= M( l ) + u g(x)

so that for 1 x m u

s( u ) = M( l ) + u g(x)

is a supporting hyperplane for concave M( l ); that is,

M( l + u ) <= s( u ) = M( l ) + u g(x)

and s( 0 ) = M( l + 0 ) = M( l ).

So, here we have the basic theory of non-linear duality as needed by the iterative algorithmic approach of primal-dual Lagrangian relaxation.

War Story:

Once some guys had a big resource application problem having to do with a suddenly legal case of cross selling for banks. They had formulated their problem as a 0-1 integer linear program with ~40,000 constraints and 600,000 variables.

They had tried simulated annealing, ran for days, and quit with no idea how close they were to optimality.

I looked at their formulation and found that, except for 16 of the constraints, the problem was comparatively easy.

So, I wrote out

L( x, l ) = f(x) + l g(x)

where l was 1 x 16 and g represented the 16 troublesome constraints. The set C represented the 0-1 constraints and the rest of the 40,000 constraints.

Then for each l, finding x to solve

M( l ) = min_{x in C} L( x, l )

was easy. Then I just had to find l to solve

max M( l )

subject to

l >= 0

and that was just to maximize a concave function.

So, I did some primal-dual iterations:

          Set

               k = 1

          Pick l^k >= 0

     Step 1 (Primal):

          Find x = x^k to solve

               min_{x in C} L( x, l^k )

          Then

               M( l^k ) = L( x^k, l^k )

     Step 2 (Termination)

          If x^k in F then

               M( l^k ) <= z <= f( x^k )

          so that if

               M( l^k ) - f( x^k )

          is sufficiently small, then
          terminate.  Else if k i
          to large, terminate.

          Step 3 (Dual):

          Find 1 x m u and w in R to solve

               max w

               subject to

               w <= M( l^p ) + u g(x^p)

               p = 1, 2, ..., k

          Of course, this is just a linear
          program.  Set

               l^(k + 1) = l^K + u

               k = k + 1

          Go to Step 1.

There are refinements: E.g., for the linear program to find l^{k + 1}, can hope to get better numerical performance with by using central cutting planes.

For the real problem, in 905 seconds on a 90 MHz processor, I ran 500 primal-dual iterations and, then, had a feasible solution, from the bound inequalities, guaranteed to be within 0.025% of optimality.

Worked fine! So, simulated annealing is not the only tool in the tool box! Sometimes Lagrangian relaxation can work well, too!

jlouis · on Jan 21, 2015

Introducing lagrangian relaxation quickly in a comment to a simulated annealing post is perhaps not the best way to convey a method. The machinery of mathematical programming you have to introduce quickly becomes an incomprehensible wall of text to most people. Better write this in a blog post somewhere and refer that.

Indeed, there are often better solutions than simulated annealing, if the problem you are facing have some nice properties. I've been doing optimization on a highly nonlinear mathematical problem, but the objective function is so nice it only has a single maxima in the region I'm looking. So I just ran Nelder-Mead which terminates much quicker.

graycat · on Jan 21, 2015

For anyone interested in simulated annealing, what I wrote could be of more interest to them than simulated annealing and, really, easier to read than a comparably detailed description of simulated annealing.

mml · on Jan 21, 2015

Pff. That's, like, your opinion man.

graycat · on Jan 21, 2015

> opinion?

It's some quite precisely done mathematics with one theorem with a careful proof and some smaller results also proved.

And there's some impressive computational experience. We're talking a 0-1 integer linear program with 40,000 constraints and 600,000 variables -- that is an example of a problem in NP-complete. That we could do anything with such a problem is a bit amazing.

That we can apply non-linear optimization to a problem in discrete or combinatorial optimization, that is, 0-1, is curious, in this case powerful, and, thus, generally welcome.

That we are making good progress with Lagrange multipliers without taking any derivatives, e.g., as in the Kuhn-Tucker conditions, is also curious -- so, this is not your father's Lagrange multipliers.

And I proved Everett's theorem: That's a famous result, and quite useful, much more useful than difficult to prove, in a wide range of resource allocation problems. And Everett is a famous guy, a physics Ph.D. under J. Wheeler (one of the main guys in relativity theory, in the world) at Princeton and the first guy to expound on the many worlds (parallel universes) interpretation of quantum mechanics. Everett thought that his result was worth publishing, and what I wrote is more general and more powerful.

Simulated annealing is famous; that some non-linear duality theory and Lagrangian relaxation totally blew away simulated annealing on an impressive practical problem should also be of interest.

That it is possible to get better performance using central cutting planes is actually not very well known.

Also I covered the mathematics in essentially full detail (everything important and not obvious, proved) starting with next to nothing, gave a description of an algorithm that is ready to code (it's basically just what I did code from successfully) and also the impressive computational experience, all in just that one post.

Simulated annealing is rarely described so carefully, and it doesn't provide bounds on optimality, that is, we don't know when to stop.

As I illustrated, my work provides bounds which commonly in practice provide a good stopping criterion.

Simulated annealing is sometimes regarded as part of computer science; well, my post shows something that sometimes is better and from applied mathematics.

Actually, the mathematics for such optimization problems is a well-developed and deep field; try to do much in optimization, and really should consider some of the mathematics, e.g., what I posted.

Oh, by the way, my Ph.D. is in stochastic optimal control, and I've published peer-reviewed original research on non-linear optimization, e.g., some deep details about the Kuhn-Tucker conditions. My professors included some of the best people in the world in optimization. The Chair of my Ph.D. orals committee is a Member, US National Academy of Engineering and was a student of A. Tucker of the Kuhn-Tucker conditions -- Kuhn and Tucker were both at Princeton.

You might let us know where you found the part with the "opinion"?

SixSigma · on Jan 21, 2015

I believe it was a joke

graycat · on Jan 21, 2015

Supposedly HN likes only really highly relevant comments and viciously down votes anything else, especially jokes, and everything else, no matter how serious or relevant, by the user they are down voting.

Here I've been viciously, bitterly, attacked by some people apparently totally pissed off at me for next to nothing, a simple contribution follow up, a famous quote from the movie Casablanca, as a follow up to a post with the earlier part of that famous quote from Casablanca.

So, my post on Lagrangian relaxation, that took much of an afternoon to write, and with careful theorems and proofs, with some material not so easy to find, especially in such a succinct, careful, and complete form, and that stands to be of high interest to anyone interested in simulated annealing, gets downvoted -- a joke about my post does not, and my response to the joke post does.

HN is getting to be a very nasty place, not nearly as nice as, say, Reddit. Sorry HN.

Sam, PG, you've got a problem. You need to look at your log files and find who has decided to hate me and down vote anything I post, no matter what I write.

My post on Lagrangian relaxation is right up there in quality with some of the best expository writing in optimization, and I know enough to know. That I get down voted is the problem of HN, not my post. In the past few days, the bitter, hostile, personal attacks on me have cost me 70+ points, no doubt to the great pleasure of my attackers. I'm being bitterly attacked personally, for no good reason, and now independent of what I post.