The "Soft Maximum" function

mcantor · on Jan 20, 2010

This is my problem with mathematics. The concept of a "soft maximum," the principles behind calculating it, and its startling similarity to a hard maximum, are all fascinating and exciting to me. Look at that! Two completely separate functions, and such magical results. According to this post, it's useful for "convex optimization." I clicked through to the "related post," which was merely a comment about someone else's opinions on "convex optimization," so I looked it up on Wikipedia:

http://en.wikipedia.org/wiki/Convex_optimization

Ah! A technique used to "minimize convex functions." Maybe some of this notation will make sense to me if I understand the underlying concept of whatever a "convex function" is.

http://en.wikipedia.org/wiki/Convex_functions

Great. An entire article that is completely and utterly meaningless to me. I mean, absolutely nothing in that article--oh! "Convex sets?" That looks promising.

http://en.wikipedia.org/wiki/Convex_set

Jackpot! The pretty pictures make the idea of a convex set clear to me. Unfortunately, by now I'm 4 clicks away, and my actual understanding of the subject is clearly just scratching the surface. Connecting my newfound--and obviously still naive--understanding of convexity* to "soft maximums," which initially inspired this search, feels dumbfoundingly impossible.

Am I approaching this all wrong? Am I expecting too much? Thinking too little? I would love to understand more about this subject, and I have tried to learn it the same way I learned how to program: by Googling and working on my own problems. However, the resources simply don't seem to be there in the same way. What's the deal, here?

* Would it be more accurate to say "Euclidian convexity?" What would that mean, exactly?

lmkg · on Jan 20, 2010

I am a mathematician myself, and I have some of the same problems when learning about topics away from my experience, or over my head. Wikipedia is great for cherry-picking topics if you're already familiar with them, but math is a subject that really builds on foundations of previous concepts (arguably, that's all it is). Wikipedia's content organization doesn't lend itself well to "entry points" nor to the partial ordering of topics.

Until you get some foundational knowledge, math is one of the topics where you're better off using traditional textbooks than wikipedia. The information is arranged in a logical manner that builds on previous concepts. Unfortunately, most individual topics are better-explained in wikipedia than in most books. A suitable compromise might be using a textbook's table of contents as the order to follow in navigating wikipedia, but the articles also have the problem of not always being consistent with each other, in style, nomenclature, required knowledge, and point of view.

After about one semester's worth each of algebra and analysis, you would probably have enough working knowledge to at least not stumble around like a blind man on 50% of topics. My personal favorite analysis textbook was Rudin's Principles of Mathematical Analysis, however it's... not for the feint of heart. Seriously. It could kill a man. If you want to get the full value out of that book, you will be spending over an hour per page to make sure you understand every step of every argument.

fgimenez · on Jan 20, 2010

Trying to learn math from Wikipedia is like trying to learn English from the dictionary.

As far as the subject matter goes, a convex function can just be thought of as a function with a global maximum/minimum and no other local maximum/minimum. I.E. The derivative is zero in only one place which is the global max/min. These functions can then be easily optimized by taking their derivatives and finding the zero. It is a really nice way to optimize a problem.

yummyfajitas · on Jan 21, 2010

...a convex function can just be thought of as a function with a global maximum/minimum and no other local maximum/minimum.

This is false. The convex function exp(x) has no global max or min. The function f(x,y)=x^2 has infinitely many global minima.

The nice thing about convex functions is that the set of places where they achieve a minimum is a convex set: the empty set for exp(x) and the line x=0 for f(x,y)=x^2.

ntownsend · on Jan 21, 2010

No, in this case, a convex function is one where you can draw a straight line between any two points on its graph and not have the line intersect the graph at a third point.

pieceofpeace · on Jan 21, 2010

You are not contradicting what fgimenez said, its the same condition stated differently.

edit: if two statements are not essentially same (and I am wrong), could you please explain it.

jamii · on Jan 21, 2010

The logarithm function is convex by this definition but has no maximum/minimum point and never has 0 derivative.

mahmud · on Jan 21, 2010

I have about 5 years of informal/independent mathematical study and routinely use Wikipedia to crack fields open. I usually skim the intro then head straight to the references.

Once I have a vague idea of what it is, I drill down MathWorld using the new keywords.

swombat · on Jan 20, 2010

I would love to understand more about this subject, and I have tried to learn it the same way I learned how to program: by Googling and working on my own problems. However, the resources simply don't seem to be there in the same way. What's the deal, here?

I would suggest the method that works (and is proven to work by the gazillion undegraduate mathematicians and physicists who get it to work every year): get some tutoring in the topic and/or acquire some quality undegraduate (or graduate) textbooks.

Once you have the basic principles down (basically the curriculum of a degree), you will still find it hard to approach a completely new topic like "convex sets/functions/optimisations", but you'll have a method that works.

mcantor · on Jan 20, 2010

What, you mean if I complain hard enough, the knowledge will not zap itself into my brain, or magically appear in my browser window? Preposterous! Shenanigans! Chicanery!

Joking aside... I'll have to start snooping around for some good textbooks. I bet a couple of HNers out there might have some good suggestions, too!

swombat · on Jan 20, 2010

For a good grounding in a lot of maths, I have somewhat fond memories of Boas: http://www.amazon.com/Mathematical-Methods-Physical-Sciences...

Arfken was also tough, but excellent: http://www.amazon.com/Mathematical-Methods-Physicists-Sixth-...

Yes, these books are fucking expensive.

PS: I did a physics degree, so my favourite books are maths for physicists... maybe former Maths students will have "purer" books to recommend. Anyway, between these two, you'll have a good year or two of hard studying before you need to buy another one :-)

nlindig · on Jan 20, 2010

If you're looking for something less pricey... I highly recommend Stephen Boyd's book (free here: http://www.stanford.edu/~boyd/cvxbook/), as well as his cvx matlab toolbox. There are also some video lectures which might be less mind-numbing (http://www.stanford.edu/class/ee364a/index.html).

jules · on Jan 21, 2010

Expensive? I had to buy http://www.amazon.com/Statistical-Mechanics-Thermodynamics-B...

Half the number of pages, double the price.

edw519 · on Jan 20, 2010

This is my problem with mathematics.

"Do not worry about your problems with mathematics, I assure you mine are far greater." - Albert Einstein

mcantor · on Jan 20, 2010

Damn... Mr. Einstein saw me coming from 50 years away. You win this time, Al!

mahmud · on Jan 21, 2010

Einstein was addressing Georg Cantor, not you ;-)

jerf · on Jan 20, 2010

You need textbooks, which are primarily useful for their flattening of some subject into a sequence that can be absorbed. That's pretty much their purpose. Poking around online you can find a surprising array of reasonable quality text books both free and legal, though I have no particular suggestions for this topic.

yummyfajitas · on Jan 20, 2010

I think you are trying to understand too much at once. Rather than trying to understand all of the "convex functions" page, focus on the first sentence or two and the first picture.

Convex functions: if you draw a straight line between the points (x, f(x)) and (y, f(y)), then f(z) is below that straight line if z is between x and y. (The green line is below the red line in the first picture.)

Optimization: find x so that f(x) is as small as possible.

Convex optimization: Same as optimization, but now you know f(x) is a convex function (e.g., the soft max).

tel · on Jan 21, 2010

One of the most frustrating parts about learning math is that it is a body of knowledge driven constantly to generalization. This usually pays off, because once mathematicians connect just a few dots then often whole new problem spaces just collapse into yesterday's math, taken down by powerful, general theories, like convex optimization, applied with just the right slant.

The problem is that often those applications are so wide and varied that nobody bothers to talk very much about them. You're just supposed to understand the underlying purpose and need on your own. For someone with the background, that's ok, because the background is the context that lets you internalize the math. Without the background, applications are necessary.

Of course, this is well known for basic math. That's why you learn how to add apples and subtract oranges instead of number theory and fields. At some point, though, if you want to understand "real" mathematics, you need to somehow build that critical mass and context shift to where you can get by on the general level.

My advice is to take the math you do know and go read the proofs that make it operate. Use the applications you already are comfortable with and understand the reasons why they work. Instead of diving both into generalized math and a complex new subject at once, just use your strengths.

My anecdote is that I liked math but never really had a grasp on it until I got thrown through the ringer on infinite series, linear algebra, abstract vector spaces all at once. I learned to read and understand proofs (sometimes), I learned how to prioritize when reading complex math (i.e. internalizing definitions is often half or more of the battle), I learned how to synthesize my own general mathematical arguments.

I wouldn't call myself anything close to a mathematician, but by jumping into the language and structure of generalized math, I have a lot of the tools to get somewhere with whatever field I want. Wikipedia still often confuses me, but if I want to learn it I know how to take what I learn and use it as the building blocks to a deeper understanding.

jules · on Jan 21, 2010

Maybe I can connect the dots for you. Optimization is the general problem of finding the minimum of some function f in a region. The problem with this is that it's hard. One of the reasons why it's hard is that f could have many local minima. So we look for conditions on f that make the problem easier.

That's where convexity comes in. There are two different things that can be convex: sets and functions. Convexity of sets and convexity of functions is related but not the same. Convexity of sets is well explained by the pictures on wikipedia. A function is convex if for any two points you choose, the value of the function stays below the line that connects these two points.

An example of a convex function is e^x: http://www.wolframalpha.com/input/?i=e^x Take any two points on the graph and connect them with a line. The graph of e^x will stay below that line.

An example of a function that is not convex is sin(x): http://www.wolframalpha.com/input/?i=sin+x You can find two points on the graph such that if you connect them with a line sin(x) will go above that line. For example x=0 and x=pi.

If the function is convex then it will have only 1 minimum. So what you can do to find the minimum is just start at an arbitrary starting position and walk down hill and you will arrive at the minimum of f if f has a minimum. Note that if f had two local minima at different heights you might end up stuck in the higher minimum if you apply this method.

Now there are several techniques to walk down hill. One thing you could do is approximate f by a line. This tells you the slope of f at the point you're standing, so you know which direction is down hill. You walk a small step in this direction and repeat. This is gradient descent.

Another way to do it is to approximate f by a parabola. A parabola has a minimum that can be computed exactly. So you compute the minimum of the parabola and jump to it. Because it was an approximation this minimum will generally not be exactly f's minimum, but close. So you repeat the process to get closer and closer. This is Newton's method.

A complication is that many optimization problems have constraints on the parameters of f. For example f(a,b) could represent the profit you get from investing a dollars in product 1 and b dollars in product 2. But you only have 10 dollars so the constraint is a+b <= 10. If you just start looking for the maximum of f by taking steps or jumping around you will have to keep in mind that a+b cannot go above 10.

Another complication is that f may not be differentiable, which makes it harder to compute a line approximation and even harder to compute a parabola approximation.

There are books written on solving these issues. This one is available online: http://www.stanford.edu/~boyd/cvxbook/

yummyfajitas · on Jan 21, 2010

If the function is convex then it will have only 1 minimum...

f(x,y)=x^2 has an infinite set of minima: {(x,y) : x=0}, but is still convex.

ntownsend · on Jan 21, 2010

In the case of real-valued functions of one variable (which is the case at hand), if the function has a minimum and is convex then the minimum is global. If the function is strictly convex (also this case) then it will have at most one global minimum.

jules · on Jan 21, 2010

True. To keep the explanation simple I glossed over some details. The problem you describe exists even in the 1D case, e.g. the function f(x) = 0. However if a convex function has multiple minima then the minima will be connected and at the same height, so the algorithms still apply.

imurray · on Jan 20, 2010

The soft maximum often comes up when dealing with probabilistic models and in neural networks. I'm surprised that a blog that has been highlighting numerical issues hasn't pointed out how the soft maximum should be computed.

If x is vector of values then the naive code is:

  log(sum(exp(x)))

where exp operates elementwise. However, if even a single item of x is large (1000 say) this will return Inf. If x are log probabilities, where you want the log of the sum of the probabilities (common), the elements of x might all be less than -1000 (also common) and then the function will return -Inf.

Many people have a function called logsumexp() kicking about to compute the softmax robustly. It will do something like:

    y = max(x);
    return y + log(sum(exp(x-y)));

One example, slightly more elaborate, Matlab implementation is in: http://research.microsoft.com/en-us/um/people/minka/software...

dalke · on Jan 21, 2010

See the followup posting

tjic · on Jan 20, 2010

The huge flaw with this article: explaining WHY I would ever want to use a soft maximum.

Yes, it "sands off the corners".

Why do I want that?

flipper · on Jan 20, 2010

The article was possibly aimed at engineers or mathematicians so the author thought the benefits didn't need explaining.

A soft maximum might be useful if you want a differentiable function that closely approximates the max() function. By differentiable I mean you can work out the rate of change of the function at any point. With the hard maximum, the rate of change at the hard edge is not defined. Nicer to have one function to represent the RoC and not have to worry about special cases.

tjic · on Jan 21, 2010

Yep, I do have a CS degree, and I realized that the crisp corners are not differentiable ... but a short paragraph about that and giving a few real world examples would have strongly improved the post.

gchpaco · on Jan 20, 2010

As flipper points out, the problem is that the hard maximum is not differentiable; that is, the differential of max (x, y) is undefined when x == y. Most numerical approximation techniques require a smooth, differentiable function, so this is bad.

A related problem is that of bounding. Suppose I want to minimize the parameters in a set of nonlinear equations; I can numerically differentiate the equations to get the gradient and the Hamiltonian and then I can minimize that. But I may want to impose additional criteria, like x > 0. I could just say the function goes to some preposterously high number when x <= 0, but then we have this hard corner problem again. Instead if you use a continuous function like a logarithm to impose your bound, it affects the solution space minimally, makes for a solution that will not wander outside your bounds, and is differentiable.

snprbob86 · on Jan 20, 2010

These sorts of softened/smoothed functions are particularly useful in games and animation. I've personally used this function for animated sliding board game pieces in a 3D simulation. The piece launches from one square at full speed and then comes to a nice soft stop after some amount of time. Much more attractive.

DougBTX · on Jan 20, 2010

For your googling: "easing function"

ramanujan · on Jan 21, 2010

This is useful analytically, but one tradeoff here is that if this is in an inner loop, the exp/sum/log process will be a lot more expensive than a simple max.

For proofs and stuff it can be invaluable to have a differentiable objective function, but in actual computation it's not always necessary.

roundsquare · on Jan 21, 2010

Depending on your application, a good compromise might be to use soft max if the numbers are nearby and a hard max if they are very distant (though you'd need to find a suitable definition of distant and nearby).

mitko · on Jan 21, 2010

Also known as "smoothed" max/min which in my opinion is better name as it is mathematically more meaningful as the max function [NOT smooth] is approximated with a smooth function.

There is no such thing as soft function in mathematics! Please be consistent when naming...

shalmanese · on Jan 21, 2010

Is there a accepted modification to the function to make it scale invariant? I'm kind of weirded out by something that would spit out a different result if did a unit change.

memetichazard · on Jan 21, 2010

Here's an idea. Take the scaled function: g(x, y; k) = log( exp(kx) + exp(ky) ) / k

Let k be 1/max(x,y) so that you have:

g(x,y) = log(exp(1)+exp(min(x,y)/max(x,y))*max(x,y)

There are some issues if the maximum value can be negative or zero, though.