Hacker News new | past | comments | ask | show | jobs | submit login

Statistical natural language processing is a very prominent example of a field which has developed effective methods for coping with "black swans" or the occurance of improbable events that have never been seen before.

If you take any large corpus of text and start counting the frequencies that different words occur you will rapidly notice that a huge fraction of the words have never been seen before or occur only once. Biblical scholars called these words hapax legomena instead of black swans. Methods for estimating the probabilities of these events go back to Alan Turning and his codebreaking work at Blechley Park. One need not assign zero probability to unseen events a la maximum likelihood.

Taleb's rants against VAR and mediocristan have always seemed like borderline straw man bashing to me. Sure there are many fools who believe in the gaussian or lognormal returns model, but the best knowledge in the field doesn't make these assumptions. Why doesn't he give authors like bouchard and potters who have built on mandelbrot's work their due? or does he?




I'm assuming you're referring to Jean-Philippe Bouchaud and Marc Potters.

Taleb does give credit to them in terms of them understanding the presence of fat tails and focussing on the failure of the gaussian. However, both Bouchaud and Potters, coming from physics backgrounds, believe that the tail-exponent can be calibrated accurately from a finite sample-set. When in fact estimating the tail-exponent (the 'alpha' of a power law process) is fraught with small-sample effects and one cannot reliably make decisions even when you do come up with an estimate of the tail-exponent. This is Taleb's main beef with the breed of physicists with power-law models.

The best paper to understand this is: Weron 2001: "Levy-stable distributions revisited: Tail index > 2 does not exclude the levy-stable regime" http://citeseer.ist.psu.edu/448515.html

More on my website if you're interested: navanitarakeri.com

A Black Swan is not just a rare event - it is defined as a rare event with _high impact_.

Statistical NLP may handle rare occurrences well (are you talking about smoothing?), but, by it's very nature, a rare event in a block of text is not going to wipe out a library now, is it? So the appearance of a rare word in a block of text is hardly in the same class of problems as epidemics, wars, market crashes etc.


Thanks for the interesting reply. Perhaps I'm guilty of not reading Taleb as carefully as I should have! I definately respect the guy's technical chops - he was mandelbrot's student.

I'll think a more critically the next time I see a straight line drawn on a log-log plot. I remember reading a sermon by cosma shalizi chastising physicists for making basic errors when estimating the exponent of power laws. I'm sure there's a lot of suspect results in the literature due to the error that you described :) maybe even a few bank failures.

Regarding the impact of rare words: misinterpreting certain rare words like "teratogenic" or "mesothelioma" could conceivably have a pretty high impact!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: