One technique Prof Gower's mentions (the phenomenon of the
concentration of measure) has many applications in machine
learning/statistics (which may be of more interest to HN readers than
some of the probabilistic combinatorics mentioned in the paper).
A few (non-rigourous) examples in ML/statistics are:
- The use of the LASSO estimator [0] (minimizer of |Y - β X|₂ + λ |β|₁)
for variable selection is justified by Candes & Tao's 2005 results
[1, 2]
that show, under fairly general conditions, that this estimate
exactly recovers the true set of active variables. A key step in the
proof is finding a concentration of measure result for sub-Gaussian
random matrices. A corollary is that (with high probability) we can
exactly solve the non-convex problem |Y - β X|₂ + λ |β|₀ by passing
to the convex relation |Y - β X|₂ + λ |β|₁ and have strong
guarantees that our relaxed solution is optimal.
- The Johnson-Lindenstrauss lemma for dimensionality reduction, etc
[3] can be proven by essentially the same principle (random projections
are "probably" nearly isometries) that Prof Gowers describes in this
note.
I was lucky enough to have Tim Gowers as a lecturer in my first term of university (hmm... 2002) when he gave a lecture course introducing number theory, set theory, combinatorics and axiomatic construction of number systems. It was intoxicating to be taught by someone who was (a) brilliant, and (b) so thoroughly enthusiastic about mathematics. I eventually routed off into applied math and physics, but I look back on those initial lectures fondly, and still have an appreciation for the kind of "problem solving" mathematics that he taught.
The argument can be adapted almost verbatim for hacker type vs theory type programmers. One is in the problem solver camp the other in the fundamental understanding and theory building camp. Silicon Valley is dominated by hackers and they are constantly re-inventing wheels because of lack of theoretical understanding, e.g. NoSQL, and the theory builders are working on problems that will almost surely collect dust in some dark corner of a university library.
Funny… While I personally like beautiful theory, the last bit of deepish theoretical work I am studying was to solve a particular problem.
I'm currently delving into Earley Parsing, so I can write compilers easily, so I can write DSLs economically, so I can write short programs in any language, so I can spend less time writing those programs, so I can generate less bugs from those programs, so these programs ultimately cost less money$$$.
Then there is Bret Victor, who uses very simple principles to solve the long standing problem of understanding in various domains.
I probably stand in the "the best solutions come from the deepest theories" camp. With the caveat that "deep" does not necessarily mean "complex", on the contrary. Lambda calculus is very deep, but quite simple. So, while I believe problems are the centre of our field, I also believe we need to break down those problems into deep and simple math before we can tackle them efficiently. Not doing so is a recipe for bloat.
> I'm currently delving into Earley Parsing, so I can write compilers easily, so I can write DSLs economically, so I can write short programs in any language, so I can spend less time writing those programs, so I can generate less bugs from those programs, so these programs ultimately cost less money$$$.
'Earley parsing' immediately trips a flag in my memory, so I'll just ask—have you seen Jeffrey Kegler's Marpa parser (http://jeffreykegler.github.io/Marpa-web-site, with associated blog)?
(I don't know you, so I apologise in advance if you are Jeffrey Kegler. :-) )
Okay, a cursory look shows this will only work on context-free grammar. I have read another paper showing that you can add some context sensitivity to Earley parsers, I hope you can do that with Marpa parsers as well: ultimately, I want to combine the generality of Earley parsing with the expressive power of Parsing Expression Grammars.
Damn, though. If Marpa parsing is a good as its promises, I may have to jump ship again. (The first time I did was when going from PEGs to Earley.)
I disagree about Bret Victor. There is nothing simple about the principles that he uses. In fact you need a whole lot of very deep theoretical understanding to even get to the point of Bret's abilities. Time traveling debuggers for one are very advanced technology. If anything Bret Victor is applying very advanced theory to mundane domains like programming.
> Time traveling debuggers for one are very advanced technology.
Yes they are. But the underlying principles are dead simple: just keep all the history of the memory, never erase anything!
Of course, the implementation is not that simple, because efficiency, because complex imperative languages, because many other things. But the basic principles are still dead simple. You may be amazed by time travelling with GDB, but OCaml had that years before GNU did it.
I feel the same way about Bret Victor. He does understand things that are hard to grok. Probably very abstract stuff. But I suspect it wouldn't take much space to write those things down.
I gather you have seen his videos, most notably the dead fish and dynamic visualization ones? Learnable Programming also counts. Tell, me, why did you find those presentations so amazing?
The reason I found them so amazing was their simplicity. Bret's demos show incredibly innovative features, but at the same time, those features felt incredibly simple: easy to use, the feedback is obvious, and frankly, most of his particular examples don't seem that hard to implement. The hard part was coming up with the idea in the first place.
Plus, I can see 2 overarching principles in Bret's work: the first one is, using our visual system to see the relevant stuff at a glance. Give me a graph, not a listing. The second one is direct interaction. You could think of it as a REPL on steroids, or even a live image, but it's more than that. In a live environment, you would change a variable, and see the consequences in real time. Bret Victor goes beyond even time, and let us manipulate systems from the outside, as if they were a block universe (where time is just another dimension).
In learnable programming, you can see how he slides the force of the jump, to make it just right so our little Mario can go through that little secret passage. We can visualize the whole path (that's the first principle), and we can see what that path would have been if you change the laws of physics (that's the second principle).
Very deep stuff. Very hard stuff. But as I said, very simple as well. And I expect future insights will make it even simpler.
---
Let us go back to my Earley Parsing for a second, because it's quite illustrative. My journey began with this talk[1] by Ian Piumarta. He mentioned Earley parsing at some point saying it was very general, handles every possible context-free grammars, and is also very simple, very easy to implement: half an hour to translate the algorithm in any language.
Months later, I'm still at it.
But you know what? As I come to understand Earley parsing, I agree with him more and more. This thing is genuinely simple. I wrote a recognizer in less than a hundred lines of Lua code, and the whole parser will likely take less than 200 lines. As I said, simple.
Simple does not mean easy however. Without the proper insights, I could only struggle. But then I stumble upon a sentence, a picture… and then something clicks. And I can see the simplicity, as well as the depth. And then it clicks again. And it becomes even simpler. To the point that now, I understand where Piumarta is coming from. And I still remember why it was so difficult for me back then.
I mean, Earley parsing did use to feel kinda complex, and quite ad-hoc. At the same time, it didn't seem too difficult. I was wrong. Earey parsing proved to be much more difficult, and at the same time much simpler, than I had expected.
I feel the parallel is more sophisticated. In the paper Gowers bemoans that the theoreticians dismiss the depth of combinatorics. Similarly, in modern distributed systems, there is a huge gulf between the theoretically 'interesting' problems of optimality and the difficult problems solved by the practitioners (deployment, fault mitigation and control mechanisms) and similarly a lack of vocabulary to articulate a grand sounding theory from real problems.
I would put myself in the problem solver class. I want to understand the theory, so I can solve the problems more effectively. Certainly the more experience I get, the better I write code, due to a greater depth of understanding. My programming is becoming more declarative / functional, though at the end of the day, sometimes crappy hacks need to be added to get things working.
I am currently reading "The Master and His Emissary: The Divided Brain and the Making of the Western World" by Iain McGilchrist. It provides a good overview of current insight into the roles of the separate hemispheres of the brain (way beyond the usual hippy stuff). One of the distinctions made is that the left hemisphere creates "abstraction" and the right hemisphere works by using "examples". A good course in (pure) mathematics will emphasize both: understand the theorem but also consider these examples that demonstrate the theorem. Now it strikes me that this article by Gowers paints a deeper picture of this left/right brain divide in mathematics, and the tension between them that McGilchrist describes in his book. How remarkable.
This isn't just with Mathematics. In general in any profession, people who start new things are valued more than the people who solve problems with existing things.
On earth day I was speaking to a volunteer, who talked about how people who talk about saving existing forests are taken to be anti industrial progress but people who plant new trees are considered as people who are changing the world.
This applies to any thing you will ever do.
This is also why you must avoid jobs and responsibilities that deal with legacy projects, and seek projects that are doing something new.
Is this something really new? I've always thought that for any field of science, there are always those that research for the sake of understanding, and those that study/ research the field to be use as an abstraction for another, more practical field (as noted in http://xkcd.com/435/ ). Even back hundreds of years ago, Newton developed Calculus for physics, but Euler is studying maths for the sake of understanding.
Computer science is embodied mathematics... It stands to reason that some trends would be reproduced.
The divide discussed in the article is definitely quite real. There's a fairly significant bunch of young mathematicians who develop software and do extensive computer exploration to identify interesting patterns and attack hard problems in combinatorics. But learning to be actually good at this requires a significant investment of time, meaning that our programming problem solvers are even further removed from the trendy world of algebraic geometry. Just yesterday I was hearing a story of an algebraic geometer who worked with the late, great Alain Lascoux. Lascoux did some interesting computations, which the geometer dropped in his talk without mention of where the numbers came from. He was quite resistant to think of the skill of actually computing these numbers as a real skill with merit...
There is a real need for problem solving; it can illuminate new directions when the combinatorics of combining existing theorems becomes exhausting. But it does seem to be rather undervalued in a world dominated by algebraic geometers and such.
Though more subtle, I'd say yes, there are both (1) people who enjoy software development regardless of its impact on the world, and (2) others with more utilitarian motives.
Personally, I'm split. Though I idealize that genuine curiosity of the pure mathematician, I often feel obligated to use my skills 'for good', whatever that means.
I think there are several different axes one can split programmer culture down (which may or may not be orthogonal).
Brian Beckman explained one such axis, as an aside, in "Don't fear the Monad" [1], about two differing views that born in the '70s and split two programmers into two camps:
The bottom-up people, and the top-down people:
- The bottom-up people start with the hardware and only add abstractions and trade performance where necessary (fortran, c, java)
- The top-down people started with perfect abstraction/logic and reduced/removed abstraction to get access to the hardware and performance where necessary (lisp, ml, haskell)
What is interesting these days, is that these two camps are trying to steal more and more ideas from each other [2]
Computer Science vs. Software Engineering; although usually the label "Computer Science" is applied to both.
Paraphrasing the PDF:
Loosely speaking, I mean the distinction between programmers who regard their central aim as being to solve problems, and those who are more concerned with building and
understanding theories... If you are unsure to
which class you belong, then consider the following two statements.
(i) The point of solving problems is to understand computation better.
(ii) The point of understanding computation is to become better able to solve problems.
Here (i) corresponds to Computer Science and (ii) to Software Engineering.
> If you are unsure to which class you belong, then consider the following two statements. (i) The point of solving problems is to understand computation better. (ii) The point of understanding computation is to become better able to solve problems.
Like the original statement in the article, your paraphrasing strikes me as being egg and chicken. I think most mathematicians and hackers would sit on the fence.
Can someone please expand on this statement, "if one is counting solutions, inside a given set, to a linear equation, then it is enough, and usually easier, to estimate Fourier coefficients of the characteristic function of the set"
The characteristic function of a set is one in the set and zero outside [1]. Take fourier transform of this function. Fourier transform is a linear operator. Turn handle, bingo.
A few (non-rigourous) examples in ML/statistics are:
- The use of the LASSO estimator [0] (minimizer of |Y - β X|₂ + λ |β|₁) for variable selection is justified by Candes & Tao's 2005 results [1, 2] that show, under fairly general conditions, that this estimate exactly recovers the true set of active variables. A key step in the proof is finding a concentration of measure result for sub-Gaussian random matrices. A corollary is that (with high probability) we can exactly solve the non-convex problem |Y - β X|₂ + λ |β|₀ by passing to the convex relation |Y - β X|₂ + λ |β|₁ and have strong guarantees that our relaxed solution is optimal.
- The Johnson-Lindenstrauss lemma for dimensionality reduction, etc [3] can be proven by essentially the same principle (random projections are "probably" nearly isometries) that Prof Gowers describes in this note.
[0]: http://en.wikipedia.org/wiki/Lasso_regression#Lasso_method
[1]: http://authors.library.caltech.edu/10092/1/CANieeespm08.pdf
[2]: http://arxiv.org/pdf/math/0410542.pdf
[3]: http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_l...