Hacker News new | past | comments | ask | show | jobs | submit login
Cayley – An open-source graph database (github.com/google)
167 points by indatawetrust on June 13, 2016 | hide | past | favorite | 56 comments



"Not a Google project, but created and maintained by a Googler"

Interesting location, being in the main Google organization, given this disclaimer. I've seen other projects with this disclaimer under Googlers' personal pages before, and I always figured a /google/ URL probably meant it was officially a Google project.


Nah, I think that's the default place that Googler's code goes, if you spend 20% time on it (so it's the company's code).

There was a discussion about it a couple months ago [0] when that rust text editor came up on HN.

[0] https://news.ycombinator.com/item?id=11576703


For people that are concerned about lack of development on Cayley it looks like there is a fork that is actively merging PRs if nothing else:

https://github.com/dennwc/cayley

Found from the mailing list:

https://groups.google.com/forum/?hl=en#!topic/cayley-users/D...


The project has no activity in the last 6 months. I would like to see an open-source graph database that's being actively developed and maintained.


http://github.com/google/badwolf

Graph store currently being used by Google's Spam & Abuse Team. Last commit last week.


I'm pretty excited about https://github.com/dgraph-io/dgraph

Distributed (yes, really!) graph database, from a guy who was on Google's knowledge graph team.




Neo4j, like all graph databases I've tried, is only okay with small data.

Suppose I want to import a medium-sized graph into Neo4j. Medium-sized as in "fits on a hard disk and doesn't quite fit in RAM". One example would be importing DBPedia.

Some people have come up with not-very-supported hacks for loading DBPedia into Neo4j. Some StackOverflow comments such as [1] will point you to them, and the GitHub pages will generally make it clear that this is not a process that really generalizes, just something that worked once for the developer.

[1] http://stackoverflow.com/questions/12212015/how-to-setup-neo...

Now suppose you want to load different medium-sized graph-structured data into Neo4j. You're basically going to have to reinvent these hacks for your data.

And the last time I tried to load my multi-million-edge dataset into Neo4j through its documented API, I estimated that it would have taken several weeks to finish.

Don't tell me that I need some sort of enterprise distributed system to import a few million edges. Right now I keep these edges in a hashtable that I wrote myself, in not-very-optimized Python, that shells out to "sort" for the computationally expensive part of indexing. It's not a very good DB but it gets the job done for now. It takes about 2 hours to import.


>>And the last time I tried to load my multi-million-edge dataset into Neo4j through its documented API, I estimated that it would have taken several weeks to finish.

Use the Import tool, it can do a million writes a second. Here is how to import hacker news into Neo4j using it: https://maxdemarzi.com/2015/04/14/importing-the-hacker-news-...


Thanks, I'll put that on my list of things to try, although I've spent more than enough time banging my head against graph DBs for today.

This isn't insurmountable, but I'm just going to gripe about it: I'm annoyed by the idea that I need to make a table of nodes and load it in first. Every graph DB tutorial seems to do this, because it looks like what you'd do if you were moving your relational data into a graph DB. But I have RDF-like data where nodes are just inconsequential string IDs.

Hm, this indicates that I should definitely be looking at Cayley, which directly supports loading RDF quads.


Let me preface this that have not used neo in ~1 year.

But before then I used it all the time. Imported the whole US patent DB into it. Its performance was very solid, both for importing and querying using their CYPHER syntax language, which is well tailored for graph representations and graph algos. Definitely had millions of edges, with ~8M nodes.

That said if you are having import issues, you must be not using batch. If the that's the case you are creating significant overheads in your write process (every write acquires certain locks and syncs the db to avoid concurrent mod issues)


Very interesting. I think the reality in a lot of situations is that most people don't really need the full feature-set that graph databases provide.

I ran into a similar problem trying to explore Wikidata's json dumps. It's a lot simpler to load it into MongoDB and create indices where you need them, rather than figuring out how to interface to a proprietary system that you may or may not end up using in the long run.

I'm still having trouble keeping my indices in memory though, and would be keen to know what sort of latency you encounter hitting an on-disk hashtable.


What I end up with is a 4 GB index, whose contents are byte offsets into an 8 GB file containing the properties of the edges.

When I mmap these, an in-memory lookup takes about 1 ms, but I can have unfortunate delays of like 100 ms per lookup if I'm starting cold or hitting things that are swapped out of memory.


Also, yeah, I don't really need most of the things that graph DBs are offering. They seem to focus a lot on making computationally unreasonable things possible -- such as arbitrary SPARQL queries.

I'm not the kind of madman who wants to do arbitrary SPARQL queries. I just want to be able to load data quickly, look up the edges around a node quickly, and then once I can do that, I'd also like to be able to run the occasional algorithm that propagates across the entire graph, like finding the degree-k core.


I use MongoDB for a Wikidata replica and index performances are quite good. I use some hacks in order to keep size of indexed values low (see https://github.com/ProjetPP/WikibaseEntityStore/blob/master/... ). It helps a lot in order to be able to keep indexes into memory.

It powers https://askplatyp.us quite well.


I've loaded the full Freebase dump into Cayley one year and something ago (Freebase is probably already big-sized). It took around one week on a pretty beefy machine, the problem being mainly the way the data was structured in the dump.

But after the import, it worked pretty well without issues and with decent performances. With decent meaning: not good for a Front End, good enough for running analysis in a back-end using a little bit of caching.


FYI, we have had EXTREME performance with the LOAD CSV feature of Cypher.


there is also: https://github.com/dgraph-io/dgraph also a golang project as the title mentions, if this matters to anyone.


Is there a benchmark of a DBPedia import in dgraph?


Their demo is very pretty... and they seem pretty keen on performance.


Looks promising. Graphql support!



have you used it on non-trivial scales? did it turn out ok for you? any major outstanding issues?

asking because i've looked at it a couple of years back and decided to go with SQL but the project looked reeeaally interesting back then and it still might fit for new development.



thanks!


I tried OrientDB. It's not clear how to use their fast data importer on data that's actually structured as a graph (instead of "hey, I've got a SQL database that I want to put into a graph database for some reason"). A couple of their employees have responded to me once but haven't actually answered the question.

I also tried it before they had a fast data importer and... well, you need a fast data importer.


What kind of data did you try to import into OrientDB? CSV? Or from any other GraphDB (GraphML format)?

Have you already tried http://orientdb.com/docs/last/Graph-Batch-Insert.html?

In case you import data from a RDBMS, you can use Teleporter: http://orientdb.com/docs/last/Teleporter-Home.html.


The data is a list of triples. They can be in .nt format, for example. They can also be in a CSV that looks like .nt format without all the angle brackets and escaping, if that would be better.

Contrary to the assumptions of the OrientDB CSV tutorial, my edges are not being exported from a SQL database. The nodes aren't, for example, foreign keys into a SQL table. They don't have sequential IDs. They are just strings that identify the things that the edges connect. This is typical in N-Triples.

The triples are not currently in any kind of relational database, which I think is what Teleporter is about.

With the link to http://orientdb.com/docs/last/Graph-Batch-Insert.html, you seem to be asking me to write my own importer in Java. I'd rather not.


Neo4J had activity an hour ago



https://github.com/amark/gun

Open Source, graph database, P2P/decentralized, 1800+ stars, browser/javascript friendly, active community, does realtime updates like Firebase.

With some INSANE performance specs:

https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2...


Can you explain the concern over this maybe not being maintained any more? My perspective (for reference) is that it meets your requirements for a project or it does not. Some of my favorite libraries haven't been updated in over a decade. They are in C, though, and I recognize that this removes a lot of worries about environmental churn someone might natively have if they come from (say) Javascript.


Blazegraph. As used for wikidata among others. Has lots of property graph support not just semantic graphs.


While I've been pointing out the problems I've encountered with Neo4J and OrientDB, I should say as a counterpoint that I just tried Blazegraph based on this recommendation, and so far, it works.

Its importer was not the most intuitive thing to use -- I had to dig up configuration items that were scattered across its documentation and Stack Overflow -- but I got it to work, it imported 25 million edges in less than 8 minutes, and it's providing reasonably quick access to those edges now.


Yes, the dev activity seems to be declining - https://go.libhunt.com/project/cayley


OTOH, I've been using it in production for several months with good results. Small dataset though, only 9M quads.

Cayley's codebase is not large and very clean.


I know neo4j is the 800 lb. gorilla in this space, but it's interesting it has way more stars (almost 8K) than neo4j's almost 3K.


Latest commit is March 23rd.


What are the main reasons someone would use this over Neo4j? Is being open source the primary differentiator?


I've used it in the past (and also contributed some patches), it has some roughness around the edges, but it works far better than Neo4J.

It handles far bigger dataset than Neo4J (even if performances are not great) and it is easier to use and maintain.


Uses little memory, no JVM. Also interesting for golang shops since it's codebase is relatively small and clean.


Neo4J is open source.

> The Neo4j Community Edition is licensed under the free GNU General Public License (GPL) v3 while the Neo4j Enterprise Edition is dual licensed under Neo4j commercial license as well as under the free Affero General Public License (AGPL) v3.

http://neo4j.com/open-source-project/


You don't have to pollute your environment with Java.


How is your environment "polluted" when running something on the JVM?


It is hard to explain - having JVM on the server is like having empty cans of beer laying around in the living room.


Clair the vulnerability database for docker images uses Cayley which is how I found about it, both neat projects and worth a look:

https://github.com/coreos/clair


I looked at the history of that project a while back. My recollection is that Cayley was pulled out last year. Barak Michener works at CoreOS now, but he may be quite busy.



I've been looking at Cayley recently, for a project of ours. We want to ingest millions of Hearthstone replays (simple XML documents describing thousands of key/value deltas per game) and analyze the game state for every turn, etc.

Cayley seems uniquely suited for that, but its development activity is concerning. If anyone has any other suggestions, I'm all ears.



TitanDB is dead in the water. They have been unresponsive since getting acquired by Datastax, it looks like the team is now working on their commercial offering. Everyone I know is looking for alternatives.


the new Datastax product may be built on top of Titan. http://intsantglobalnews.com/datastax-adds-graph-databases-t...


Dgraph looks pretty great.


I like https://github.com/GraphChi/graphchi-cpp quite a bit. Handles out-of-core graphs really well. Fast.


What exactly is an open source graph?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: