Isn't computing on a single machine generally always faster? Clustering is for handling more data than a single machine can hold, which requires paying network costs. There are parallelization benefits, but CPUs have so many cores now.
No. At least, not in the case of graph computation. This was the conventional wisdom with some people, but circa a year or so ago when we did some measurements, no.
Edit: Reading the paper, they cite these measurements and present their work as "only [..] 8.8× slower than a distributed in-memory" system (GraM, a different system than timely dataflow). The quote is on the first page.
This. It can be faster per-node because of removed IO and abstraction overheads, but perhaps not overall (and they allude to that fact). There are other side effects too though - you're limited by how large hardware on a single node can go, and what the cost/performance ratio is with highly specialized hardware like the above or commodity machines.
People also might want distributed systems for other reasons - fault tolerance is one. So as your SLA requirements go higher, you sometimes end up with little choice but to pay that cost. Other times you don't.
They do cite challenges with scaling this up, but I think quite a bit of their innovation is around just leveraging hardware better, which is a big enough deal on its own.
In general, with current systems, it's all a partitioning problem. Speed-up is a trade-off between granularity of the compute kernel placed on a resource, the communication between said resources, latency at each level of the memory hierarchy (cache line utilization and reuse have a direct impact on this), and also (eek) protocol overheads between all the abstraction layers. Current systems are largely built around single CPU abstractions pasted on to make them multi-CPU abstractions. Working with accelerators, RDMA, etc. is quite a challenge both for the programmer manually making things work and for systems developers beating the abstractions into submission.
But can we have a cluster of computers that look like a single computer sharing the underlying hardware, because that's just basically linking a remote device. I supposed special kernel has to be written to overlay.
That's not quite the issue - SSI (which I didn't know about!) sounds cool, but it's a programming model that solves similar issues at a different abstraction layer. The question that remains unsolved is, regardless of what abstraction you use, whether the overhead of communication (latency, throughput) between nodes is large enough to make things annoying, and whether it's possible to work around this effectively or not. Stuff like InfiniBand along with tech like RDMA is helping a lot with this, and hardcore database installations do use this.
"Efficient processing of large graphs is challenging. Graph
algorithms often exhibit poor locality of memory access, very
little work per vertex, and a changing degree of parallelism
over the course of execution [31, 39]. Distribution over many
machines exacerbates the locality issue, and increases the
probability that a machine will fail during computation."
Many graph algorithms benefit hugely from locality of reference, because graphs have an inherently relevant concept of things being affected by other things they are near. Taking advantage of locality of reference can make a difference of many orders of magnitude in speed, in my experience.
If you're trying to divide up the work by dividing the graph into parts that run on different nodes, where do you cut it? Even finding a good cut is a graph algorithm in itself. And any information that flows across that cut has to be serialized and deserialized, which is relatively quite slow.
Meanwhile, a single-node algorithm is usually hitting results that are in RAM or even in cache.
From what I remember, graph algorithms that usually run on supercomputers can be well parallelized.... just not on GPUs because they tend to be very branchy.
Very interesting paper, but the experimental section confuses me. They compare their system which preprocesses the graph into the tile-structure (a compressed representation), but none of the in-memory systems they compare against use a compressed representation. Is this comparison really apples-to-apples? If you can reduce the size in-memory of the graph by 60%, then I would expect an algorithm running on the compressed version to be faster than the same algorithm running on the uncompressed version (assuming that decoding the compressed adjacency-list is significantly cheaper than the cost of a cache-miss).