A Trillion Edge Graph on a Single Commodity Node

thinkloop · on April 28, 2017

Isn't computing on a single machine generally always faster? Clustering is for handling more data than a single machine can hold, which requires paying network costs. There are parallelization benefits, but CPUs have so many cores now.

frankmcsherry · on April 29, 2017

No. At least, not in the case of graph computation. This was the conventional wisdom with some people, but circa a year or so ago when we did some measurements, no.

https://github.com/frankmcsherry/blog/blob/master/posts/2015...

If you don't want to read everything (but you should, before trusting the conclusions), the relevant measurements are here:

https://github.com/frankmcsherry/blog/blob/master/posts/2015...

Edit: Reading the paper, they cite these measurements and present their work as "only [..] 8.8× slower than a distributed in-memory" system (GraM, a different system than timely dataflow). The quote is on the first page.

makmanalp · on April 28, 2017

This. It can be faster per-node because of removed IO and abstraction overheads, but perhaps not overall (and they allude to that fact). There are other side effects too though - you're limited by how large hardware on a single node can go, and what the cost/performance ratio is with highly specialized hardware like the above or commodity machines.

People also might want distributed systems for other reasons - fault tolerance is one. So as your SLA requirements go higher, you sometimes end up with little choice but to pay that cost. Other times you don't.

They do cite challenges with scaling this up, but I think quite a bit of their innovation is around just leveraging hardware better, which is a big enough deal on its own.

jcbeard · on April 28, 2017

In general, with current systems, it's all a partitioning problem. Speed-up is a trade-off between granularity of the compute kernel placed on a resource, the communication between said resources, latency at each level of the memory hierarchy (cache line utilization and reuse have a direct impact on this), and also (eek) protocol overheads between all the abstraction layers. Current systems are largely built around single CPU abstractions pasted on to make them multi-CPU abstractions. Working with accelerators, RDMA, etc. is quite a challenge both for the programmer manually making things work and for systems developers beating the abstractions into submission.

yeukhon · on April 28, 2017

But can we have a cluster of computers that look like a single computer sharing the underlying hardware, because that's just basically linking a remote device. I supposed special kernel has to be written to overlay.

EDIT: Looks like this is already a thing [1]

[1]: https://en.wikipedia.org/wiki/Single_system_image

makmanalp · on April 28, 2017

That's not quite the issue - SSI (which I didn't know about!) sounds cool, but it's a programming model that solves similar issues at a different abstraction layer. The question that remains unsolved is, regardless of what abstraction you use, whether the overhead of communication (latency, throughput) between nodes is large enough to make things annoying, and whether it's possible to work around this effectively or not. Stuff like InfiniBand along with tech like RDMA is helping a lot with this, and hardcore database installations do use this.

rhizome · on April 29, 2017

SSI is awesome, and (IIRC) something DragonflyBSD was working towards once upon a time.

user5994461 · on April 28, 2017

> Isn't computing on a single machine generally always faster?

Single machine is only faster on problems that cannot be parallelized well.

rspeer · on April 28, 2017

Such as graph algorithms, the current topic of discussion.

omginternets · on April 28, 2017

Graph algorithms can't be well-parallelized? This surprises me; I would have guessed the exact opposite.

Is this true in the general case, or what?

kajecounterhack · on April 28, 2017

"Efficient processing of large graphs is challenging. Graph algorithms often exhibit poor locality of memory access, very little work per vertex, and a changing degree of parallelism over the course of execution [31, 39]. Distribution over many machines exacerbates the locality issue, and increases the probability that a machine will fail during computation."

https://kowshik.github.io/JPregel/pregel_paper.pdf

^ (Pregel is Google's batch graph processing framework)

FWIW most graph computations can fit on a single powerful machine these days. There really aren't that many that require distributed processing.

anonetal · on April 28, 2017

The parallelization problem doesn't really go away though, since you still need to figure out how to do that to exploit all the cores effectively.

rspeer · on April 29, 2017

You don't need to. You're never forced to solve really thorny problems to make your code run faster by a constant.

rspeer · on April 28, 2017

Many graph algorithms benefit hugely from locality of reference, because graphs have an inherently relevant concept of things being affected by other things they are near. Taking advantage of locality of reference can make a difference of many orders of magnitude in speed, in my experience.

If you're trying to divide up the work by dividing the graph into parts that run on different nodes, where do you cut it? Even finding a good cut is a graph algorithm in itself. And any information that flows across that cut has to be serialized and deserialized, which is relatively quite slow.

Meanwhile, a single-node algorithm is usually hitting results that are in RAM or even in cache.

deepnotderp · on April 28, 2017

From what I remember, graph algorithms that usually run on supercomputers can be well parallelized.... just not on GPUs because they tend to be very branchy.

moab · on April 28, 2017

Very interesting paper, but the experimental section confuses me. They compare their system which preprocesses the graph into the tile-structure (a compressed representation), but none of the in-memory systems they compare against use a compressed representation. Is this comparison really apples-to-apples? If you can reduce the size in-memory of the graph by 60%, then I would expect an algorithm running on the compressed version to be faster than the same algorithm running on the uncompressed version (assuming that decoding the compressed adjacency-list is significantly cheaper than the cost of a cache-miss).

jacobparker · on April 28, 2017

Wasn't expecting to see the reference to the 9p file system!

douglasfshearer · on April 28, 2017

The paper doesn't mention "commodity", and I'm not sure that one to four Xeon Phi can be described as commodity.

Can the title be edited?

thechao · on April 28, 2017

I live the Xeon Phi---I worked on the original Larrabee project---but this article reads like an ad fir Phis.

douglasfshearer · on April 28, 2017

The Larrabee project, and all the things it spawned [1], is fascinating.

[1] http://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20d...

vertias718 · on April 28, 2017

so, any repo code available yet?