Hacker News new | past | comments | ask | show | jobs | submit login

When I was working on a recommender for television shows, I ran SVD on a large User/Item matrix to create a low rank approximation, essentially reducing thousands of user features (TV show preferences) to user vectors representing twenty or thirty abstract "features". Then I looked at the actual item preferences of users who expressed each feature at the greatest and least magnitude. The features, in some cases, mapped to recognizable constructs. There were distinct masculine and feminine features, several obvious Hispanic / Latino elements, and strong liberal versus conservative indicators. Others were less explainable using common labels.

It struck me at the time that the qualities that were expressed most strongly were the ones that ended up having names in our language. But there were others for which I would say to myself, there is something about this group (e.g. those with the greatest expressed value of F124) that I recognize, but can't quite put my finger on.

Of course, I was looking at people through a keyhole, their TV viewing preferences being the only information I had.

Also, I noticed that these "came into focus" most clearly at a certain level of compression (rank).

FWIW




I started reading the essay not knowing what to think, and it turned out to be more relevant to my work than I thought.

The issues being discussed in the essay have been a central issue in some area of psychology and behavioral sciences for some time--how to interpret components such as these.

One thought about your "coming into focus at a certain level of compression" comment: I've done some analyses of these vectors as applied to text samples, and one thing that struck me was how unreplicable some of them were across datasets that should be ostensibly similar (but are not the same). Others, in contrast, reappeared across multiple corpora. To the extent some of these components represent "real" features, they should reappear consistently across different datasets where you'd expect them to. That is, they should be robust to changes in idiosyncratic features of the database.


Did you ever compare that focus with a graph over the singular values?


It's a good question, FWIW I would expect a reasonably sharp "L" shaped curve in the focus. The assumption there I guess being that this metric of 'focus' is something well characterized by low-frequency type basis matrices given by the first few rows/columns of the SVD's U and V.


Exactly what I saw. Your expectation is correct.


Computers have us figured out in a way that we don't.


A question: is this much better / different than a principal component analysis (or a factor analysis)?


It's a bit of an apples/oranges comparison to compare SVD to PCA. SVD is a numerical technique, whereas PCA is a method to analyze a dataset. You can use SVD to perform PCA (although there are other ways to perform PCA without explicitly doing a SVD). I'm guessing that the GP performed PCA using SVD. There's a good Stack Exchange answer to exactly this question here:

http://stats.stackexchange.com/questions/121162/is-there-any...


One way to do PCA is using SVD to find a transformation matrix of eigenvectors to project your data with, so they're similar.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: