"Not a Google project, but created and maintained by a Googler"
Interesting location, being in the main Google organization, given this disclaimer. I've seen other projects with this disclaimer under Googlers' personal pages before, and I always figured a /google/ URL probably meant it was officially a Google project.
Neo4j, like all graph databases I've tried, is only okay with small data.
Suppose I want to import a medium-sized graph into Neo4j. Medium-sized as in "fits on a hard disk and doesn't quite fit in RAM". One example would be importing DBPedia.
Some people have come up with not-very-supported hacks for loading DBPedia into Neo4j. Some StackOverflow comments such as [1] will point you to them, and the GitHub pages will generally make it clear that this is not a process that really generalizes, just something that worked once for the developer.
Now suppose you want to load different medium-sized graph-structured data into Neo4j. You're basically going to have to reinvent these hacks for your data.
And the last time I tried to load my multi-million-edge dataset into Neo4j through its documented API, I estimated that it would have taken several weeks to finish.
Don't tell me that I need some sort of enterprise distributed system to import a few million edges. Right now I keep these edges in a hashtable that I wrote myself, in not-very-optimized Python, that shells out to "sort" for the computationally expensive part of indexing. It's not a very good DB but it gets the job done for now. It takes about 2 hours to import.
>>And the last time I tried to load my multi-million-edge dataset into Neo4j through its documented API, I estimated that it would have taken several weeks to finish.
Thanks, I'll put that on my list of things to try, although I've spent more than enough time banging my head against graph DBs for today.
This isn't insurmountable, but I'm just going to gripe about it: I'm annoyed by the idea that I need to make a table of nodes and load it in first. Every graph DB tutorial seems to do this, because it looks like what you'd do if you were moving your relational data into a graph DB. But I have RDF-like data where nodes are just inconsequential string IDs.
Hm, this indicates that I should definitely be looking at Cayley, which directly supports loading RDF quads.
Let me preface this that have not used neo in ~1 year.
But before then I used it all the time. Imported the whole US patent DB into it. Its performance was very solid, both for importing and querying using their CYPHER syntax language, which is well tailored for graph representations and graph algos. Definitely had millions of edges, with ~8M nodes.
That said if you are having import issues, you must be not using batch. If the that's the case you are creating significant overheads in your write process (every write acquires certain locks and syncs the db to avoid concurrent mod issues)
Very interesting. I think the reality in a lot of situations is that most people don't really need the full feature-set that graph databases provide.
I ran into a similar problem trying to explore Wikidata's json dumps. It's a lot simpler to load it into MongoDB and create indices where you need them, rather than figuring out how to interface to a proprietary system that you may or may not end up using in the long run.
I'm still having trouble keeping my indices in memory though, and would be keen to know what sort of latency you encounter hitting an on-disk hashtable.
What I end up with is a 4 GB index, whose contents are byte offsets into an 8 GB file containing the properties of the edges.
When I mmap these, an in-memory lookup takes about 1 ms, but I can have unfortunate delays of like 100 ms per lookup if I'm starting cold or hitting things that are swapped out of memory.
Also, yeah, I don't really need most of the things that graph DBs are offering. They seem to focus a lot on making computationally unreasonable things possible -- such as arbitrary SPARQL queries.
I'm not the kind of madman who wants to do arbitrary SPARQL queries. I just want to be able to load data quickly, look up the edges around a node quickly, and then once I can do that, I'd also like to be able to run the occasional algorithm that propagates across the entire graph, like finding the degree-k core.
I use MongoDB for a Wikidata replica and index performances are quite good. I use some hacks in order to keep size of indexed values low (see https://github.com/ProjetPP/WikibaseEntityStore/blob/master/... ). It helps a lot in order to be able to keep indexes into memory.
I've loaded the full Freebase dump into Cayley one year and something ago (Freebase is probably already big-sized). It took around one week on a pretty beefy machine, the problem being mainly the way the data was structured in the dump.
But after the import, it worked pretty well without issues and with decent performances. With decent meaning: not good for a Front End, good enough for running analysis in a back-end using a little bit of caching.
have you used it on non-trivial scales? did it turn out ok for you? any major outstanding issues?
asking because i've looked at it a couple of years back and decided to go with SQL but the project looked reeeaally interesting back then and it still might fit for new development.
I tried OrientDB. It's not clear how to use their fast data importer on data that's actually structured as a graph (instead of "hey, I've got a SQL database that I want to put into a graph database for some reason"). A couple of their employees have responded to me once but haven't actually answered the question.
I also tried it before they had a fast data importer and... well, you need a fast data importer.
The data is a list of triples. They can be in .nt format, for example. They can also be in a CSV that looks like .nt format without all the angle brackets and escaping, if that would be better.
Contrary to the assumptions of the OrientDB CSV tutorial, my edges are not being exported from a SQL database. The nodes aren't, for example, foreign keys into a SQL table. They don't have sequential IDs. They are just strings that identify the things that the edges connect. This is typical in N-Triples.
The triples are not currently in any kind of relational database, which I think is what Teleporter is about.
Can you explain the concern over this maybe not being maintained any more? My perspective (for reference) is that it meets your requirements for a project or it does not. Some of my favorite libraries haven't been updated in over a decade. They are in C, though, and I recognize that this removes a lot of worries about environmental churn someone might natively have if they come from (say) Javascript.
While I've been pointing out the problems I've encountered with Neo4J and OrientDB, I should say as a counterpoint that I just tried Blazegraph based on this recommendation, and so far, it works.
Its importer was not the most intuitive thing to use -- I had to dig up configuration items that were scattered across its documentation and Stack Overflow -- but I got it to work, it imported 25 million edges in less than 8 minutes, and it's providing reasonably quick access to those edges now.
> The Neo4j Community Edition is licensed under the free GNU General Public License (GPL) v3 while the Neo4j Enterprise Edition is dual licensed under Neo4j commercial license as well as under the free Affero General Public License (AGPL) v3.
I looked at the history of that project a while back. My recollection is that Cayley was pulled out last year. Barak Michener works at CoreOS now, but he may be quite busy.
I've been looking at Cayley recently, for a project of ours. We want to ingest millions of Hearthstone replays (simple XML documents describing thousands of key/value deltas per game) and analyze the game state for every turn, etc.
Cayley seems uniquely suited for that, but its development activity is concerning. If anyone has any other suggestions, I'm all ears.
TitanDB is dead in the water. They have been unresponsive since getting acquired by Datastax, it looks like the team is now working on their commercial offering. Everyone I know is looking for alternatives.
Interesting location, being in the main Google organization, given this disclaimer. I've seen other projects with this disclaimer under Googlers' personal pages before, and I always figured a /google/ URL probably meant it was officially a Google project.