My concern was that I saw the blog did the run on real life big(actually big) data and outperform everything on the market. This got me thinking, are we still not there on the problem size/computing power curve so as to justify use of distributed computing?
Of course distributed makes sense in a lot of places specially web and where problem is embarrassingly parallel but running graph algos is a bug chunk of the market and it seems for most firms a laptop will do.
I would wager that an appropriately-sized Spark, GraphLab, or Naiad cluster could outperform a laptop when running PageRank on this graph. Distributed graph processing research got stuck in a rut when the gold standards for evaluation were the Twitter crawl [1] and uk-2007-05 web graph [2]. Now that the Common Crawl has made a much bigger graph available, I hope that we'll see it used in more performance evaluations—with a COST baseline, of course!
I'd also wager that those Spark, GraphLab, or Naiad programs would be far from optimal, by at least an order of magnitude. My ideal would be to see someone attack the distributed graph processing problem with the same eye for systems performance issues that the TritonSort folks brought to the distributed sort problem [3] and MapReduce [4].
My concern was that I saw the blog did the run on real life big(actually big) data and outperform everything on the market. This got me thinking, are we still not there on the problem size/computing power curve so as to justify use of distributed computing?
Of course distributed makes sense in a lot of places specially web and where problem is embarrassingly parallel but running graph algos is a bug chunk of the market and it seems for most firms a laptop will do.
Again, thanks for sharing.