I think your question is oriented towards X being a business problem. Netflix ha...

I think your question is oriented towards X being a business problem.

Netflix has users (say 100M) who have been liking some movies (say 100k). Say The question is: for every user, find movies he/she would like but have not seen yet.

The dataset in question is large, and you have to answer this question with data regarding every user-movie pair (that would be 1e13 pairs). A problem of this size needs to be distributed across a cluster.

Spark lets you express computations across this cluster, letting you explore the problem. Spark also provides you with a quite rich Machine Learning toolset [1]. Among which is ALS-WR [2], which was developped specifically for a competition organised by Netflix and got great results [3].

[1] http://spark.apache.org/docs/latest/mllib-guide.html [2] http://spark.apache.org/docs/latest/mllib-collaborative-filt... [3] http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/...