I wish everyone would stop saying that PCA is about eigenvectors and eigenvalues... it's about singular vectors and singular values. The best way to compute PCA is NOT to compute the eigenvalues of A'*A. There are specialised SVD algorithms that should be used instead:
This is one case where the machine learning community takes a numerical analysis idea and by giving it a different name also loses some of the insights associated with the other name. Normally I don't care if they call it "training" instead of "line search" or "features" instead of "dimensions", but PCA should be using better algorithms. Instead of telling everyone to just use the inferior method of using generic eigenvalue solvers, use better specialised methods.
I know that finding eigenvalues of A'A is "good enough", and I guess "eigenvalue" is already a big enough word for most programmers, but there's no reason to not relegate better methods to specialised libraries and make those specialised libraries available. If nothing else, LAPACK and ARPACK have bindings for almost every language.
N-grams are also regularly used in bioinformatics. For example you can split a
genome sequence into words of length k, where we use the term 'k-mers'. You can
use a k-mer frequency table as a heuristic where ever performing a full sequence
analysis is too computationally expensive. For example if two genomes A <-> B
have a smaller euclidean distance in their respective kmer tables than A <-> C
you might assume that A and B are more closely related in evolutionary
distance. This is just one example and kmers are widely used through out the
bioinformatics field. For instance, the wikipedia entry for velvet has a
description on how they can be used for genome assembly:
Have you by any chance taken a look at applying NLP techniques like generating sentence vectors with recurrent neural networks to this type of data? 2+ pair mappings (ngrams) have been in use successfully for a while in NLP, but they don't know about any context outside the current ngram.
Recent work has involved creating word and sentence vectors that capture much richer dynamics and this is where a lot of the excitement is coming from in the deep learning world wrt text (of any language, maybe even computer code, as it turns out)
I've been wanting to try applying these machine learning techniques to this type of problem (could be for clustering--which was your example--classification, regression, or sample generation--of eg novel and probable sequences?)
If you or anyone here as an interest in this sort of thing on the bio side, please contact me (email in profile)
Coming from OCaml, I'm jealous of what I see of the F# standard library. Looks like there is a lot of nice built-in stuff. The different case conventions make a bit jarring, though.
https://en.wikipedia.org/wiki/Singular_value_decomposition#C...
Randomised methods have also gotten popular in recent years:
https://code.google.com/p/redsvd/
http://arxiv.org/abs/0909.4061
This is one case where the machine learning community takes a numerical analysis idea and by giving it a different name also loses some of the insights associated with the other name. Normally I don't care if they call it "training" instead of "line search" or "features" instead of "dimensions", but PCA should be using better algorithms. Instead of telling everyone to just use the inferior method of using generic eigenvalue solvers, use better specialised methods.
I know that finding eigenvalues of A'A is "good enough", and I guess "eigenvalue" is already a big enough word for most programmers, but there's no reason to not relegate better methods to specialised libraries and make those specialised libraries available. If nothing else, LAPACK and ARPACK have bindings for almost every language.