I don't know much about graph neural networks although it is a topic that I want to study in the next few months.
But what bothered me in your article is what you wrote about graph data structures.
NetworkX is indeed very slow, this is due to two facts:
- NetworkX is a pure Python implementation and does not relay on some methods written in a faster language like C.
- They use dictionaries to represent the graphs, which may have some advantages when mutating graphs, but of course have much worse locality than an adjacency list or a some sparse matrix format.
But even for python there are much faster libraries such as igraph. The data structure in igraph is an edge list.
A lot of single core graph libraries use an adjacency list internally, and while it is true that the index list for each vertex can be somewhere arbitrary in memory, they usually do not behave like a linked list, unless you graph is really sparse.
One of the most used operations in graph algorithms is to iterate over the neighbors of a vertex, and for this, adjacency lists are very good.
They also have a small advantage over CSR matrices for adding or removing edges, and they might use slightly less memory, as their index type only needs to be able to index all vertices and not all edges, so they need half of the space, which is better for caches.
The index type for CSR need only index nodes. It's the offsets (one per row) that need to index edges.
The blog's portrayal of NVMe vs distributed memory is a bit off:
> With modern NVMe drives random seeks aren’t slow anymore, much faster than distributed network calls like you do when scaling the linked list-based graph.
Distributed memory machines has 1 microsecond off-node latency and 15-20 microsecond reductions (with thousands of nodes). These latencies have been pretty flat for the past 20 years in supercomputing, and have been widely available at cloud providers for the past few years. NVMe is not as fast as it's made out to be. https://twitter.com/SwiftOnSecurity/status/99079794835421184...
The other issue is that NVMe is more expensive than DRAM or HBM when you care about bandwidth (and graph/network analysis is almost always memory bandwidth limited, even on HBM). Suppose you need to do 1000 traversals of a 1 TB data structure. This takes about 10 seconds with 100 HBM devices ($1 at on-demand pricing, assuming amortized startup), but days of saturating NVMe bandwidth.
Thanks for the correction. I was aware of the speed of distributed cluster reads, but I might be overestimating the speed of NVMe random seeks.
> The other issue is that NVMe is more expensive than DRAM
How are you calculating that?
When I upgraded my home server the other week, 16GB of DDR3-1600 cost nearly as much as 1TB of decently fast NVMe (HP SX920).
Would you still think an on-disk mmap'ed CSR graph library be useful for medium-scale usecases given modern SSDs? Or do you think such latencies kill any use
Unless your data is at rest, you're paying for bandwidth, not capacity. SRAM > HBM > DRAM > NVMe when you're buying bandwidth. Seriously, when we have a low-latency network, the most cost-effective way to run some simulations and graph algorithms is out of L3 on EPYC (where mid-tier parts have over 1 TB/s at a fraction of the energy and cost of an A100/MI100).
Your NVMe quotes 3.2 GB/s, versus 25 GB/s (theoretical) for a stick of DDR4-3200 (which is in the same price range). A standard 2-socket EPYC server can give you 300 GB/s from DRAM, but you'd need several servers servers packed with NVMe to provide that bandwidth.
The rationale for persistent storage (like NVMe) as an algorithmic device is either (a) you need it to be persistent or (b) you have physical or algorithmic constraints that prevent you from using more parallelism and you're willing to pay 5-10x for the privilege of executing from NVMe.
Cloud pricing is competitive and thus a good proxy for relative on-prem costs if you keep the on-prem hardware busy. If you want to buy the capability to run the job on-prem, but the hardware sits idle most of the time, then NVMe might be optimal. However, you should be aware that these constraints have greatly increased the cost of the job itself.
I've found sparse good for many-columns (10K+). However, for graphistry's graph workloads, we do most scipy-size graphs at much higher speeds via cudf + cugraph. Similar apis, more speed.
Ex: Preprocess many cols with scikit sparse (genomes, many-featured fraud, ...) -> cudf (= gpu dataframes) for most of workload -> infer relationships with gpu UMAP then gpu k-nn -> graphistry for visual analysis.
We're starting to experiment with bigger-than-memory (TB etc) scale via the gpu-enabled dask libs. I definitely recommend trying. If interested in adding visual analytics here, these'll be hitting our free + self-hosted layers soon, and drop a contact method if you'd like early access. Exciting times!
Anyone who is serious about graph analysis in Python using linear algebra should check out GraphBLAS (graphblas.org). The inclusion of semirings makes things like SSSP and BFS extremely elegant on sparse adjacency matrices.
The main implementation is SuiteSparse::GraphBLAS, a C library which has two Python bindings (search for grblas or pygraphblas). Disclosure: I'm the author of grblas. Both are available on conda-forge for easy installation.
NetworkX isn't a bad library, nor is it only suitable for "babies". Yeesh. Actual people work on this stuff, you know, and they may have different goals, requirements, etc than you. This whole post reeks of someone who's so in love with being a maverick speaking uncomfortable truths that they've lost sight of this human element.
If you're offended by (say) research papers that fail to break enough new ground to satisfy you, sorry about your luck, but again, not everyone's optimizing the same function as you. You'll probably do better bringing people around when you're not implying (or outright saying) their work is shit.
I have to say I agree. NX is totally fine for the applications it was built for. I appreciate the simple API and the flexibility of using pure Python. There's a reason it's so popular.
I'm not an expert in graph neural networks, so can't really say much about the novelty of Node2Vec. But I do think it's often misguided to judge scientific work as trivial or incremental in retrospect. Specially in a relatively young field like deep learning, where four years (Node2Vec is from 2016) is a long time.
I never said NX is bad, but it is for "baby" graphs. NX can't scale past a few hundred thousand nodes.
I appreciate NX's place in the ecosystem, but its implementation leaves a large gap to be filled for an intermediate library that is a Pandas analogue for graphs.
Has the author looked at how GNNs are used in say predicting molecular properties[1] or molecule generation[2]? GNNs are the state of the art on those tasks and those problems are certainly much harder than the Cora/Citeseer/etc semisupervised node label prediction tasks.
I need to handle (for work) a graph with 40 million nodes and more than 130 million edges.
As expected, networkx couldn't handle more than a million nodes so I had to search for python libs which might handle that much data.
Having a graph with several million nodes isn't just some edge case, social graph for instance grow way faster than anyone could expect.
So I do totally agree that there is a gap between popular graph libs which handle very small graphs and real life libs which need to handle way bigger graphs.
Btw, thanks for the work you've done with 'nodevectors'.
PS: I'm not criticizing either networkx which is very handy and quite good when prototyping a solution.
nodevectors implements many more algorithms than fastnode2vec that focuses on node2vec.
I'm pretty sure fastnode2vec is the fastest node2vec implementation because it uses CSR format, JIT compilation but also supports multiprocessing. The sampling algorithm has been improved compared to the paper (and to all other implementations) and allows truly linear memory consumption.
On the other hand, nodevectors has a lot of very cool methods like ProNE so you should definitely try it on your data. However, the original ProNE code is probably faster as it is written in C++ and uses multiple cores while nodevectors just uses the installed BLAS (you can probably get a massive speedup by installing OpenBLAS).
I suspect we will not see a revolution in deep learning from GNN without a corresponding hardware(FPGA?) or optimization method(HSIC? HTMs?) advance.
GNNs trained by backprop are making many of the same mistakes that LSTMs did: solve one important problem(exploding/vanishing gradients), but introduce a bunch of hyperparameters that bring along their own set of problems. GRUs are successful in my opinion, because they remove some of those tunable parameters.
Of course as the post suggested, not being able to tune something, often loses out against some more tuned and curated solution. GNNs being the newest version, having tons of parameters to tweek. In the end do we get a better, ultimately more generalizable solution? Or do we just get more hyperparameters to tune, and spend more time and money for a modest, unrepeatable gain.
By no means I am an expert in deep learning, but lately I've been considering graph NNs snake oil of neural networks. There are no impressive results on any common tasks and that picture is supported by a recent paper claiming transformers to "contain a graph network inside".
In my opinion there aren't many deep learning applications that are fundamentally snake-oil, it's more that people who don't quite understand the limitations and advantages of particular methods that end up being (witting or unwitting) snake-oil salespeople.
Deep graph NN's are very useful for lots of data that are inherently graph-structured (another commenter above mentioned chemistry applications, and there are lots of other examples). Whether or not for the time-being they give SOTA results on common datasets, being able to work directly with graph-structured data is quite appealing in many cases.
The post can't be accessed with a Internal Server Error. Or is it just my ISP? I often confuse myself with other representation learning methods such as graph kernels https://en.wikipedia.org/wiki/Graph_kernel
This beauty got an outstanding paper award at COLING 2020.
GNNs are not applied to "common tasks" because they are primarily suited for environments where graph structure is present. In this case, predicting how the Indian congress will vote.
A bunch of fully connected layers also "contains a CNN inside", but empirically CNNs lead to better performance on image classification tasks. When you can cut away an enormous, and mostly irrelevant part of the parameter space, that has a lot of value.
You should look at label propagation. Combining Label Propagation and Simple Models Out-performs Graph Neural Networks: https://arxiv.org/abs/2010.13993
This is like someone going "I installed NLTK and it's slow and doesn't support transformers or word2vec!". In fact, I think the author of the article may have actually done exactly that.
But what bothered me in your article is what you wrote about graph data structures.
NetworkX is indeed very slow, this is due to two facts: - NetworkX is a pure Python implementation and does not relay on some methods written in a faster language like C. - They use dictionaries to represent the graphs, which may have some advantages when mutating graphs, but of course have much worse locality than an adjacency list or a some sparse matrix format.
But even for python there are much faster libraries such as igraph. The data structure in igraph is an edge list.
A lot of single core graph libraries use an adjacency list internally, and while it is true that the index list for each vertex can be somewhere arbitrary in memory, they usually do not behave like a linked list, unless you graph is really sparse. One of the most used operations in graph algorithms is to iterate over the neighbors of a vertex, and for this, adjacency lists are very good.
They also have a small advantage over CSR matrices for adding or removing edges, and they might use slightly less memory, as their index type only needs to be able to index all vertices and not all edges, so they need half of the space, which is better for caches.