Hacker News new | past | comments | ask | show | jobs | submit login

If you have a background in IT, and are modestly comfortable with statistics, the following foundational paper on algorithms to reconstruct the tree of life from genetic data should actually be understandable: http://science.sciencemag.org/content/311/5765/1283

The statistics really only ever amount to Occam’s Razor, I. e. fewer differences in the genome means closer relationship.

Edit: actually free full-text link: http://bioinformatics.bio.uu.nl/pdf/Ciccarelli.s06-311.pdf




I am a statistician (although as I confessed nowhere near computational biology), so, uh, yeah this is up my alley.


One more thing: Jared Diamond (of “Guns, Germs, Steel” fame) transferred the exact, same method to linguistics, using it to discover the ancestry of Pacific Islanders using features of their respective languages instead of letters of DNA. The result is both a tree of their (cultural) ancestry, and a map of their migration/expansion from island to island.


Yet another thing: Julien d’Huy uses these methods to analyze folk literature and myths. Very, very interesting research.


Excellent! I believe I remember a far longer version of the paper, and interactive widgets to explore the tree. Couldn’t find those right now and am on the road, unfortunately.

In any case, the process consists of basically two steps:

“Align” the DNA of all the species you have sequenced. That’s done using algorithms such as smith-waterman, which minimize the number of edits needed to go from one species’ version of a gene to another.

That number defines a metric that measures all the distances between the species. So, Orang-Utan to Homo Sapiens may be, say, “450”, while Homo Sapiens to Ficus Benjamini is, say “12321”.

(The difference may also be measured on the level of protein sequences. The process is basically the same, only that edits may be assigned different distance scores, because some result in more functional differences, while others are functionally “silent”. Evolutionary pressure would make the first rarer than the latter)

The metric is unitless (I. e. It doesn’t allow translation into, say, “years since species diverged”. But it should fulfill the triangle equality).

Once you have a (triangle) matrix with all the difference, all that’s needed to reconstruct a binary tree that maximizes plausibility, I. E. minimizes the sum of difference at each branching point.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: