Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Graph databases and Python (slideshare.net)
87 points by bjerun on Dec 7, 2013 | hide | past | favorite | 36 comments


Does anyone have a good list of books, tutorials, howtos and whatnot about graph databases in general, preferably with python examples (but any language is good really...)?

I've used graphdbs in the past but a nice collection of patterns and best practices would be nice - upping my game on this topic is a current interest of mine!


For general graph DB tutorials, see...

* TinkerPop Book "Resources" section: http://www.tinkerpopbook.com/

* Marko's blog: http://markorodriguez.com/ -- start with the "On Graph Computing" post (http://markorodriguez.com/2013/01/09/on-graph-computing/).

* Aurelius Blog: http://thinkaurelius.com/blog/

For Python, see the Bulbs Docs: http://bulbflow.com/docs/

I've been meaning to update the Bulbs docs for Bulbs/Titan. It's essentially the same as Bulbs/Neo4jServer and Bulbs/Rexster, except Titan does indexing a bit different.

Here's a few pointers...

* Boutique Graph Data with Titan: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with...

* Titan Overview: https://github.com/thinkaurelius/titan/wiki

* Download: https://github.com/thinkaurelius/titan/wiki/Downloads

* Titan Server: https://github.com/thinkaurelius/titan/wiki/Rexster-Graph-Se...

* Bulbs Titan Example: https://gist.github.com/espeed/3938820


Awesome, thanks for this list!


I blog about Neo4j and the things you can do with it => http://maxdemarzi.com/

Also, all the code you see on the blog and then some is open sourced and available on github => https://github.com/maxdemarzi?tab=repositories


Very cool, thanks for the pointer!


1) read this https://python-graph-lovestory.readthedocs.org/en/latest/blu...

2) make sure you have the API in mind

3) choose a problem

4) take a pen

5) take sheet of paper

6) solve the probem

7) ???

8) profit!

There is most of the time (to not say always) several solution to solve a problem, but only one particular solution will be the best...

The thing is there is still no FOSS projects in the wild using graphdbs from which you can copy the designs, but is it really what you want?


Good resource. Thanks for the pointer!


http://shop.oreilly.com/product/0636920028246.do

This is a good book for a quick intro to graph dbs for anyone interested.


The book is free for download at http://graphdatabases.com/


Good resource, thanks!


Could anyone explain to me what it means "native" versus "non-native" graph processing in that slide show? Ditto for "native" versus "non-native" graph storage. I simply have no idea what I'm supposed to picture when I see that.

Also, on the neo4j.org page, the claim that "graph data model['s] expressiveness supersedes the relational model" seems a little bit spurious, seeing as, as I understand it, the relational model and graph data are both anchored in first-order predicate logic, and therefore should be able to do the same things essentially (although Codd-style RDBMS with a little bit more fuss regarding the necessary schemas).


Could anyone explain to me what it means "native" versus "non-native" graph processing in that slide show?

One of the leading native graph processing engines is GraphLab (http://graphlab.org/); however, the creator of GraphLab, Dr. Joey Gonzalez, is now focused on GraphX, which is essentially GraphLab built on Spark (http://spark.incubator.apache.org), which is a non-native analytics platform.

Building a graph-processing engine on a general processing system like Spark makes pre-processing and post-processing much easier.

See "Introduction to GraphX - Presented by Joseph Gonzalez, Reynold Xin - UC Berkeley AmpLab 2013" (http://www.youtube.com/watch?v=mKEn9C5bRck)

Also, a bunch of advancements in graph processing are coming down the pipe, which will be released in a few months (see https://news.ycombinator.com/item?id=6786563).

Ditto for "native" versus "non-native" graph storage.

See this post by Dr. Matthias Broecheler, the creator of Titan (https://github.com/thinkaurelius/titan/wiki)...

"A Letter Regarding Native Graph Databases" (http://thinkaurelius.com/2013/11/01/a-letter-regarding-nativ...)


So essentially, it's totally meaningless marketing bullshit? As much as I favor memory optimizations, I think that merely trying to linearize the access patterns is completely futile in the case of graph databases. On that level of brute-force approach to speeding things up, you'll most likely gain more performance by using lower-latency memory modules, or simply by using different data structures to accommodate for your specific cache line sizes and latencies, then by trying to linearize generic graphs.


You might appreciate this landscape distinction. http://www.slideshare.net/slidarko/titan-the-rise-of-big-gra...

Furthermore, please have a look at "On Graph Computing" for a break down of 3 different categories of graph computing systems -- toolkit, database, analytics. http://markorodriguez.com/2013/01/09/on-graph-computing/

Finally, yes -- there is no theoretical expressivity gains between RDBMS and property graphs (and, RDF graphs). Nor is SQL (Turing Complete versions) any less expressive than Gremlin (Turing Complete path recognition). The only argument you can make is that graphs are more (or less) effective in terms of conciseness of expression and speed of execution at particular problems. Typically (as expected), its the difference between problem datasets that look like networks (graphs) and those that look like spreadsheets (tables).


I just joined a project using neo4j. They're using the latest version so we've had to build our own python tooling. It's still very immature but hopefully we'll get it into an opensourcable state.

Modelling in graphs is new to me so I was wondering if anyone had any tips or pointers.


It shouldn't take too much to update Bulbs/Neo4j to Neo4j 2.0 -- add the Gremlin Plugin on Neo4j Server (which isn't installed by default anymore) or swap out the Bulbs built-in Gremlin scripts for Cypher equivalents, if Cypher will let you do everything you need...

Bulbs Python Client: https://github.com/espeed/bulbs


I'm working on a project that requires a tree datastructure (basically a graph but with only single-direction parent-child relationships) and the number of nodes will stay under a thousand. I could have chosen a graph database, but for my level of complexity I just used postgres and a table that has foriegn-key relationships to itself.

Then I made a rails front end using the acts-as-sane-tree gem, which is designed to use this postgres data model and recursive queries: https://github.com/chrisroberts/acts_as_sane_tree


Gremlin has custom Tree step: https://github.com/tinkerpop/gremlin/wiki/Tree-Pattern

You can use Gremlin with any TinkerPop/Blueprints (https://github.com/tinkerpop/blueprints/wiki) enabled graph database (which means almost all graph DBs).


>I could have chosen a graph database, but for my level of complexity I just used postgres and a table that has foriegn-key relationships to itself.

Good point. Relational databases have been used for BOM (Bill-Of-Material) modelling (a manufacturing application) for ages. DB records representing a manufactured product or component can have fields that point to other record(s) in the same table, which can be child components of the product. E.g. airplane -> engine, wings. Engine -> engine parts. Wings -> wing parts. Etc. And this can be recursive.

Another such example is when you want to model an employee entity, where a manager (who has employees - or reports) is also an employee.


Have you looked at the ltree extension for postgres? http://www.postgresql.org/docs/9.3/static/ltree.html

It's quite fast.


No, looks cool though. That would impact deployability to heroku though, yes?


It's on their list of approved extensions https://postgres.heroku.com/blog/past/2012/8/2/announcing_su...

You should be able to say:

    CREATE EXTENSION ltree;
In the database you want it in.


I had a look into python client library for neo4j a year ago, and couldn't find a way to perform multiple graph writes in a single transaction, because the only API available was the http one. Has that changed since ?


You can use client-side or server-side Gremlin scripts for this...

http://stackoverflow.com/questions/16759606/is-there-a-equiv...

Here's how to use server-side Gremlin scripts in Python with Rexster, which is TinkerPop's open-source server that runs multiple graph databases, including Neo4j...

https://groups.google.com/d/topic/gremlin-users/Up3JQUwrq-A/...


As @espeed said, Gremlin will work (or just Groovy + the Java API). Cypher can handle this as of Neo4j 2.0 using the transactional endpoint. I'm not sure whether the old(er) batch HTTP endpoint kept the writes in one tx- I believe it did, though batching them in one HTTP call was frustrating.


Page 10 "Я из Одессы я просто бухаю." translation: I'm from Odessa I just drink. Meaning his drinking a lot of "Vodka" ^_^


This is local meme - when someone asking question and you will look stupid in case you don't have answer.


Many things in the graphdb space are broken.

Tinkerpop people are pushing too hard Gremlin DSL/API/whatever which is AFAIK only useful in some situation somewhat complex and more or less a nice way to write some common queries. But in simple situations any language with the raw Graph API can do the job. And there is still no drivers for Python in Rexster. I tried, but it was too complicated. Rexster itself is too complicated.

Neo4J with their own query language made things even more complicated. Instead of a “Graph that can be queried with your preferred language” you get a “Graph that can be queried with something that looks like SQL but is not“

ArangoDB is nice for people that want to do JavaScript full stack. Which is not the case of people doing Python.

Also, there is nobody marketing graphdbs just saying “it solve the general problem“. period.

The only thing that may hold you back from using graphdbs are performances but in a lot of situtations you don't care especially in situations where you want to be flexible and to move fast. That's where GraphDBs shine a lot. Of course there is also the graph/tree problem solving space but this is taken for granted.

GraphDB actors market a lot the specialized database aspect of graphdbs, nonetheless graphdbs are good even for solving generic webdev problems.

Also if you are looking for a Graph Database server that does just that, and where you can query the graph in Python 2.7 (or Scheme) have a look at: https://github.com/python-graph-lovestory/Java-GraphitiDB


please explain the downvote.


The last graph database that I used for a large project was Virtuoso, which wasn't mentioned in the slides. It's worth a look:

http://www.openlinksw.com/dataspace/doc/dav/wiki/Main/


graph-tool is a Python library for graph handling:

http://jugad2.blogspot.com/2013/01/graph-tool-python-module-...



The one thing I thought was missing from the python tooling for Neo4J was that because *.cyp files are so new they aren't yet handled by the standard documentation toolchain.


Tried neo4j , and I find it handy using python and py2neo . Since my laptop is limited in memory , I couldn't visualize graphs properly from web interface.


What is the secret ingredient of graph databases? The presentation linked from the presentation mentions physical addresses instead of IDs. I get that that would be a speedup, but I would expect it to be more like a constant factor?

Then maybe you can save all links from a node in the node, so you can get all the links with one read access. Fine. But as soon as you get to the second or third level, I would expect the magic to be gone. Say every node has 100 links. OK, so the first 100 links you get in constant time c. But to get the second level, you already need 100 requests (one for each node and it's attached link list). So 100c time. For the third level you need 10000 reads, 10000c time. The next level would be 1000000 requests.

Just saying I'd expect things to get ugly with a graph database pretty fast, too (not as fast as with a relational db, but still).

I haven't really coded a big graph based app, but my expectation would be that get really good performance, a hand coded solution would always be required. For example trying to squeeze as much of the relevant data into memory in a compressed way. Am I wrong?

Oh and also I am not sure how good relational DBs are at query optimization. Just because the visible model is "one row per link" doesn't mean the db couldn't do some intelligent caching internally.


Data models that use deep "JOIN"s are way faster on a graph database. You're right about branching factor- if you always traverse all relations, any database will be slow. In most cases, however, you don't.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: