Hacker News new | past | comments | ask | show | jobs | submit login
New Query Language for Graph Databases to Become International Standard (neo4j.com)
290 points by Anon84 on Sept 18, 2019 | hide | past | favorite | 166 comments



There is an ISO-standardized graph query language: Prolog and its decidable fragment Datalog, both widely used (relatively) for decades. Will the new language be based on it?

Another question is whether the model having driven ISO standardization in the past (software vendors working together to create a large, visible, and diverse market) is still relevant in post-standard cloud times. I sure hope it is, but we haven't seen public demand for standards (with the exception of web standards) for well over a decade now.


Kind of a tangent, but there are some folks working on a "model-driven graph database" written in Prolog, TerminusDB:

https://github.com/terminusdb

https://medium.com/terminusdb

See also: Categorical Query Language (CQL) https://www.categoricaldata.net/

Not Prolog, but it's a mathematical treatment of DBs with Category Theory.

A paper by these folks was mentioned in a sib comment ( https://news.ycombinator.com/item?id=21005452 ) "Algebraic Property Graphs" Joshua Shinavier, Ryan Wisnesky (Submitted on 11 Sep 2019) Last week!

> In this paper, we use algebraic data types to define a formal basis for the property graph data models supported by popular open source and commercial graph databases. Developed as a kind of inter-lingua for enterprise data integration, algebraic property graphs encode the binary edges and key-value pairs typical of property graphs, and also provide a well-defined notion of schema and support straightforward mappings to and from non-graph datasets, including relational, streaming, and microservice data commonly encountered in enterprise environments. We propose algebraic property graphs as a simple but mathematically rigorous bridge between graph and non-graph data models, broadening the scope of graph computing by removing obstacles to the construction of virtual graphs.


I think the cloud times renews the need for standards. Ten years ago devs in the linux/BSD world had the vast majority of the code that would make up our stack, from OS kernel to the front end (with perhaps some proprietary driver or external API call here or there).

The code was the standard, which traded easy readability for complete accuracy and transparency.


> The code was the standard

That's emphatically not what a standard is about. Standards in the field you described were developed over many years by proprietary Unix vendors (including BSD as descendents of the original TCP/IP stack), enshrined in POSIX/SUS (also published as Open Group and ISO standards), and then implemented on Linux. Apart from early RedHat stuff such as PAM and things like LSB, it is only since about 2005 or so that Linux dominates the market by de-facto implementations rather than standards (such as bash, Docker-style containers based on Linux namespaces, etc.)


Of course that's not what a standard is about, but I believe it's what happened for technologies during the time that standards fell off.

I identified 2009 as about when you expected de facto open source implementations instead of any talk of standards but I would buy 2005 as well. Obviously it's a fuzzy boundary.


Very true. Any graph query language that's not clearly derived from Prolog is unworthy of the title. Prolog is a lousy way to store data, but it's by far the most elegant way to query a database, and the only sensible way to query a graph.


Standards happen when the time is right. Property graphs have been in the making and matured for the last 10+ years, driven by Neo4j, other vendors, and the community and are going to stay. It's a sign of success that the SQL standards committee recognized this development by starting the GQL project.


Ha. When I was reading Prolog books this year, I couldn’t help but think how cool would it be to hook it up with a Neo4j backend. With your comment, I guess I wasn’t the only one thinking this


I spent many years working on a Neo4j Ruby gem and really grew to love Cypher, their query language. I always found it phenomenally expressive, one of those things that seems simple (almost trivial) at first but was extremely flexible and powerful when you needed more from it. It's highly readable, easy to teach, and intuitive in a way that I never found SQL to be.

It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language. It's a little unclear to me how GQL will incorporate Cypher but I hope the initiative is successful if for no other reason than a selfish one: I'd love Cypher to be around if I ever wind up using a GraphDB again.


> It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language

Would you mind expanding on what you don't miss about Neo4j?


I was trying to use it instead of Postgres as a general database and it just didn't shine in that role. The value proposition made sense on paper: it's a more natural way to reason about data than RDBMS, relationships as first-class citizens offer some really great benefits, avoiding the schema seems reasonable when the rest of your stack is dynamic, Cypher is more pleasant than SQL. While some of that might be true, I encountered the following:

- Using a somewhat esoteric database increased my devops responsibilities.

- Open source license didn't allow hot backups.

- Ruby is just slow. I was using it before the binary protocol was implemented so performance was often lousy compared to Postgres.

- Writes were slow. Neo4j seemed to excel at reads and, in particular, the kinds of reads that are best handled by a graph. Shocking, I know.

- I really didn't have enough data or the need for the kinds of queries that would really let it shine.

- A big part of the culture at meetups and tutorials was based on "DON'T YOU JUST HATE JOINS!?" It felt culty. I was a big part of that problem for a long time. The last meetup I went to just felt like people bashing Postgres.

So I think that a lot of the things that I don't miss are because I wasn't right for it, I should have just been using Postgres.

All that said, there was so much about it to love. Cypher, as already mentioned. While the culture at meetups might have kind of relied on bashing SQL and RDBMS, the company itself was full of brilliant, generous, creative, wonderful people. Their CEO (Emil) in particular is a total sweetheart. The organization was extremely supportive of open source projects, very inclusive, very eager to get people involved, and it made a big impact on me.

So, again, while Postgres is the tool I always reach for first, I think that Neo4j as an organization does great work and as a product has a lot to offer if you're the right fit for it. I'm glad to see they're doing well.


I've built an app using Neo4j and have the same impression. Cypher is such a good query language. I wish I could use it over SQL in pretty much every situation.


So, is there a draft spec yet? I can't find anything.

Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.


Seriously, people should not choose names which are already taken in the same general area of technology. Even if it "makes sense". The whole point of naming things is to refer to things more or less unambiguously.


> Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.

Google's GQL [0], which is older than either, doesn't help the searchability, either.

[0] https://cloud.google.com/datastore/docs/reference/gql_refere...


The GQL project just started. Since it is going to be an ISO standard, the specification is only available for members while under development but may be purchased from ISO once final (Same as for SQL, though you can find copies on the internet).

GQL will be a declarative language in the spirit of existing property graph query languages like Cypher, so that gives you an idea. I'm sure as the project proceeds, various artefacts (software or otherwise) will become freely available.

If you want to dig deeper, gqlstandards.org links to some documents that are copyright Neo4j and have been submitted to the standards process.

https://drive.google.com/drive/folders/16CUhVI1PQ4hBlhD80_Ys...


"Purchased"? Really? Hmmm.

However, ISO does have a store that prominently features "best selling standards": https://www.iso.org/store.html


that's nothing new. draft versions are usually released for free, but the final ISO standards have always required fees for access and the copyright for the documents belongs to ISO. a standard for the C programming language, for example, will run about $200. average developers don't usually need access to the standard, as they will be working with an implementation of the standard anyway (which should be providing its own documentation).


Many standards are also available via the national standards body of your locale (e.g. ANSI), often at a cheaper rate.


Also google has a query language for some of its Cloud data products with that name https://cloud.google.com/datastore/docs/reference/gql_refere...


Or even a single code sample? It sounds like they've decided simply that "the international standard will be called 'GQL'" but they have no idea what the language will be (or they're keeping it a secret, which would be even more troubling).



Is GQL a new name for Cypher?


Sort of. ISO GQL is largely about taking the best ideas from Cypher (and other graph query languages) into a language overseen by an international standards body backed by multiple companies and implementations.


Can we have a search uniqueness tag, like 'GQL19'?

We have search terms, meta data keywords, urls, but no short optimized for humans search uniqueness tag?

Seems like there should be some standard or convention at least, given it could be opt-in. It would help any site or content that referenced the tag and anything that didn't would still work as well as usual.


Just say ISO GQL


Is GQL already used elsewhere? As long as nobody calls it graphql we should be fine, right? Nobody says structured ql or even structured query language, its just sql.


Yes, it's used by a lot of GraphQl libraries:

    graphql-python/gql: A GraphQL client in Python

    graphql-python/gql-next: A Python GraphQL Client

    grooviter/gql: Groovy GraphQL library
 
    99designs/gqlgen: go generate based graphql server


I read the entire article and came away mistakenly thinking this was the same thing as GraphQL.


.gql extensions are already associated with GraphQL in some IDEs.


> Is GQL already used elsewhere?

GQL (for “Google Query Language”) is the SQL-like query language for Google Cloud Datastore, and has been around since it was just the App Engine datastore.


That's quiet an unfortunate name clash with the existing GraphQL language in a similar domain.


I think GQL from Neo4J has been around for quite a while before GraphQL.


> I think GQL from Neo4J has been around for quite a while before GraphQL.

Cypher has been, GQL has not.

GQL from Google has, but that's another problem with the (EDIT: naming of the) new GQL.


How would you think anyone should call a graph query language done by the SQL committee?


SGQL? SQL-GRAPH? G?

Or, heck, Cypher (with SQL itself, and most other languages, ISO didn't change the name even though the spec they adopted didn't necessarily exactly match what existed previously, they kept the name of the existing language they were standardizing.)


What's wrong with SPARQL? What advantages has this over SPARQL?


Very few real advantages on a technical level.

Practically, the PG model has many syntactic advantages over the equivalence in expression power RDF+reification. (See excitement over RDF*, which is can be pure syntactic sugar) Syntax is important for usability.

I don't believe that PG in Cypher or GQL will be significantly more expressive than SPARQL 1.1. And in any case are quite different from the tinkerpop model.

I believe it is essential for Neo4J as a growing company that they move beyond their own Cypher to something that is more defined and critically allows them to check a "we are a standard" box on big deals. OpenCypher has solid adoption but lacks coherence between implementations. i.e. same data same query, different result.

Still a more grounded GQL will allow Neo4j competitors to gain on them.


> Still a more grounded GQL will allow Neo4j competitors to gain on them.

Can you expand on this? What do you mean?


Currently it is relatively difficult to move data from one (open)cypher implementation to an other. Also as support is uneven for all features it is not so simple to get started on Neo4J and then evaluate e.g. TigerGraph if you find that Neo4J is not ideal for your usecase.

If your application only uses GQL then you could start on Neo4J but after two years in production cheaply move to a competitor. By just switching your backend and run your test cases etc. your data did not change, your queries remain the same, so evaluation is relatively straightforward.

I see this quite often in the SPARQL world. First few years of a product is with engine A, then some annoyance in production show up. Engines B-D are evaluated and a different one is chosen. Or sometimes both are run at once for different query workloads on the same data. Which is relatively cheap in the SPARQL environment. But because of custom engineering in the property graph world to move from engine 1 to engine 2 it is much more expensive in engineering hours.


Basically Neo4J thrives on vendor lock-in and standards lower the lock-in.


It would be nice if there were layered language approach, with a core Datalog/SPARQL-like core language specified first, then the sugary language that target it.


SPARQL is purpose built for the RDF world where you're mixing and matching a zillion different vocabularies, all of which have to be painstakingly declared and name spaced every time.

For most of us working not on the "semantic web", we typically only have 1-2 vocabularies, which is our data model, and SPARQL is super clunky to use.


I don't suppose you could give me an example that demonstrates the clunkiness? I have not worked with SPARQL myself, but I'm very curious as someone who's been leveraging a graph database (dgraph)


I don’t suppose either, because an equivalent example in SPARQL would not be more clunky.


i can read Cypher having almost zero experience with it, i have no idea what a slightly-more-than-trivial SPARQL query does.


SPARQL requires buy-in into the world of the semantic web even when all you want to do is store and query graph data.

Also, property graphs wouldn't have managed to get the traction they have if SPARQL would have been sufficient. SPARQL simply suffers from being designed in a way that does not sufficiently address the needs of application developers, in expressivity, ease of use, let alone allowing easy migration of existing relational data by sharing the same type system with SQL.


Not true at all. You can query an RDF dataset with SPARQL and not have any RDFS/OWL schema. A schema/ontology/vocabulary gives you a domain model, but it’s optional.


Property graphs are mostly popular due to Neo4J marketing enabled by VC dollars. W3C has no such backing.


SPARQL doesn't allow unbound recursive queries.


SPARQL suffers from Not Invented Here Syndrome.


SPARQL is for RDF data, which not every graph database conforms to.


And every graph database conforms to Cypher or GQL? These are not even standards, ar least yet.


SPARQL is not as easy to read.

I can show SQL or Cypher to some of my product managers who have experience with Excel, and they actually sort of get it. That’s not the case with SPARQL.


That is more a question of formatting your queries. I can show SPARQL to our data analyst and biologists and they get them. Equivalent Cypher is supper confusing because the queries tend to be quite large and the space between the MATCH and WHERE sections as done in the queries I have seen makes it hard to follow.

  MATCH (p:Person)
  WHERE 3 <= p.yearsExp <= 7
  RETURN p
or

  PREFIX :<my_bussines_vocab>
  SELECT ?p
  WHERE { ?p a :Person ;
             :yearsExp ?ye .
          FILTER (3 <= ?ye && ?ye <= 7)
  }
Yes there is a prefix. But if well used they can be super useful. Filter interspersed with the graph pattern helps once the queries get very complicated.

I have never seen a real study of Cypher being easier to read than equivalently well formatted Sparql.

  MATCH (j:Person {name: 'Jennifer'})-[:LIKES]->(:Technology {type: 'Graphs'})<-[:LIKES]-(p:Person),
      (j)-[:IS_FRIENDS_WITH]-(p)
  RETURN p.name
or

  PREFIX :<my_bussines_vocab>
  SELECT ?pName
  WHERE {
   ?j a :Person ;
      :name          'Jenifer' ;
      :likes         ?graphTech .
   ?p a :Person ; 
      :name          ?pName ;
      :likes         ?graphTech ;
      :isFriendsWith ?s .

   ?graphTech a :Technology ; 
      :type          'Graphs' .
  }
The Cypher is difficult in this example because it goes both directions in the linkage, first part is read left to right, second right to left. Making it hard to decompose, and the < or > in the directions are easily missed once the queries get larger.

  PREFIX :<my_bussines_vocab>
  SELECT ?pName
  WHERE { ?j a :Person ; :name 'Jennifer' ; :likes [ a :Technology ; :type 'Graphs'] ; ^:likes ?p . 
          ?p a :Person ; :isFriendsWith ?j }
Would be a very close version layout wise to the earlier example from the Cypher tutorial. But I don't think anyone would write that without spending some serious thought about obfuscation ;)


FWIW in your second example i was able to understand the Cypher query without so much as a double take and i still don't quite get the SPARQL query.

i really have major difficulties a query that would be easier to read in SPARQL than in Cypher.


Ok, fair enough. I suspect the typo in the sparql query doing the key join isn't helping :(

The SPARQL is more paragraph based. We are looking for ?j a :Person named 'Jenifer' and ?p a person who is a friend of ?j . Both need to be interested in graph technology.

This kind of sections of defining a query ends up easier when the relations are multitude between nodes. e.g. friends, colleagues and past having non identical lunch preferences.


I have never seen either of those languages and Cypher is much easier to understand.


There is something really special about the graph database space. For as long as the space has been around (15 or so years), every vendor and dedicated practitioner has taken solid jabs at trying to realize "the best way" to think about graph traversals.

This behavior seems particular to the graph space (vs. document, wide-column, relational, key/value, etc.). While this speaks to the complexity of the type of problems you can solve with graphs, thinking back, I believe this was a cultural anomaly. When it was Neo4j, OrientDB, TinkerPop: the language trifurcation occurred.

I'm excited that Neo4j is continuing to take the query language seriously. In an age when software development is about making it easy for the 90% of developers out there with REST APIs, GraphQL, and overly SQL'd embeddings, ... graph is still searching for "that best way."

I, personally, have moved on from language-level. However, our new work is going to help my fellow data system colleagues get there languages exposed to as many developers as possible regardless of data model. It is important to me that people can come to respect the numerous ways in which we think about data and has the language we use is so important. The difference between living in Plato's Cave or not.

In an effort to support query languages in general, I'll be working on mm-ADT designing a new cluster-oriented virtual machine architecture for storage, processing, and query language developers. I see a veritable Tower of Babel on the horizon!

Congrats Neo4j on reaping the benefits of your hard work. I hope our work will converge in for a positive collaboration in 2020.


For some reason you neglect to mention RDF graph data model and its SPARQL query language, which have been W3C standards since 1997 and 2008, respectively.

They have a healthy ecosystem of both open-source and commercial software, and unlike property graphs were designed with data interchange in mind from the outset.

SPARQL was the first and still is the only standard NoSQL query language.


> , which have been W3C standards since 1997 and 2008, respectively.

While RDF forms a graph, it is not a general graph model. In particular, two things that are enabled by graph databases (having data in the node itself, and associating data with edge) are not directly possible and need tricks such as relationship reification.

On the social side, the sectarism and bikeshedding tendencies of the community can be another turn off.


RDF is a directed graph model. Nowhere does it say properties have to have properties in a graph model. You can always model them as intermediary nodes.


Using intermediate node is changing the structure of the graph. RDF is not a general graph model. The fact that a node cannot hold data by itself (a link to literal is required) is risible.

Naming of nodes with IRI is also a ridiculous, especially in Linked Data where changing your hosting domain (and even the protocol to access data) requires changing the data itself.


Are you kidding? IRIs is what sets RDF above and beyond other data models. Global identifiers are crucial if you want to work with data interchange on a web scale.

Show me a definition of a "general graph model"? Nothing shows up on Wikipedia.

There is on the other hand a directed graph model: https://en.wikipedia.org/wiki/Directed_graph This is exactly what RDF is, with labeled vertices and edges.


You can have global identifier without IRIs, for example using GUID (that’s the solution used for COM components for instance).

The thing is you can’t represent "A"->"B" directly in RDF. You need to stretch the representation to at least use a resource node as subject, but also for the object too if this node could be used as subject in another predicate. And of course both these resources must be encoded as IRI and then linked to the literals that contains the data. That’s four nodes and three links instead of two and one. It’s a lot of boilerplate code for a simple problem. And as said before, putting key:value on an edge require reification of the edge... which undermine the simple SPO model because it’s now SP(O as P)PO.

You may think it the same but it’s not. Usability matters and that’s why the SemWeb stack has almost zero adoption since its inception while graph databases are trendy.


> SemWeb stack has almost zero adoption since its inception while graph databases are trendy

Where is my LOL GIF :D

RDF Knowledge Graphs used by: Elsevier, Bloomberg, Thomson Reuters/Refinitiv, Uber, Zalando, Microsoft, Apple, Amazon, HSBC... do I need to continue?

If you mean "mainstream developer" adoption, that is a poor and irrelevant indicator, because mainstream is not where the innovation happens.


It just wasn't "the scene" at the time. RDF/SPARQL had its world -- permeated throughout academia and some enterprise deployments with quad/triple-stores like AllegroGraph. But it wasn't going to make a PoW in the software industry because they didn't ride the razor-edge well enough.

OWL was a foolish mistake. Triple is a clean and simple idea, but in practice -- reification and URI character hell becoming mind numbing. The RDF-guys did nail it with SPARQL. That is such a pretty query language. Simple, intuitive. I haven't studied the recent path expression advances. I should.

But yea, lots of ways of fugglin' with the data.


why did Tinkerpop's Gremlin not work out ? anyone has a summary of the discussion from a language design perspective ?

A lot of the Google-able references talk about how Gremlin is more optimizable than Cypher, etc.


Gremlin was written by a genius level developer to be used by other genius level developers. There are maybe a handful of Gremlin experts in the entire world and less than 100 that are any good at it.

It is extremely powerful, but after a few lines, the mental acrobatics needed to understand what the query does is beyond your average developer.

My first paid Neo4j gig 7 years ago was writing a rules engine in gremlin. It was about 25 lines of code. If you were to ask me today what each of those lines did, I would be at a loss. So would anyone who didn't live in those specific queries day in and day out.

Graph adoption was severely limited by its use. Cypher can be learned in a day, and "business people" can look at a cypher query and understand what is going on for the most part.

It takes about a week to "bolt on" Gremlin to any database. I've done it myself, that's why you see it so often. It takes months to be any good at it.


Far from a genius myself, I've quite enjoyed using Gremlin and the Apache Tinkerpop project. If you start thinking of it as a functional language, I believe that helps clarify quite a bit what is going on.

Unfortunately when using it with a distributed backend (Cassandra, for example), having to write query templates to take advantage of bit-wise comparisons for parameterizing, all in Groovy, was extremely painful mostly because I found Groovy to be very awkward.

But there are language-native protocol implementations of Gremlin now.

The little self-contained Apache Tinkerpop project was always fun to play with toy graphhs in.

And there was an ambitious project to implement the Tinkerpop engine on top of a Redis backend (akin to RedisGraph which is now an official module using Cypher) but it is far from a straightforward project. There is even a Tinkerpop implementation on top of PostgreSQL which looks interesting.

I'm not sure this is a death blow for Tinkerpop so much as a marketing coup the grace for Neo4J. They have a strong product (despite the index-free adjacency!) and an even strong branding behind it. Anything that brings graph DB/technologies into the mainstream is always nice, though. Not that everything in the world is a graph problem...but surely there are plenty that can be classified as such.


Gremlin traversal language is a piece of a complete database query as run by TinkerPop's Rexster database. You can see it as a lazy sequence or stream API (think srfi-41 or r7rs scheme generators) with sugar syntax optimized for property graphs.

To take complete advantage of TinkerPop Rexster you really need to embed the Gremlin DSL inside a Turing Complete language (like groovy) and execute that.

I think Gremlin failed because a) the similar look to SQL of cypher queries b) long running and massive marketing campaign by the company behind cypher. c) since tinkerpop developers were hired by the company behind Cassandra, tinkerpop (and Janus graph) have lost momentum.

All this narrow data expert systems that persist data on-disk (!) are doomed to fail! The future is ordered key-value store and multi-model databases with ACID transactions.


Gremlin is not a failure, it is supported by far more databases (e.g the latest one from Microsoft https://docs.microsoft.com/en-us/azure/cosmos-db/graph-intro... ) and has far more users than opencypher. It is far faster on average.

It's imperative while Cypher is declarative. Mostly: if you want the most performant and expressive langage: choose gremlin. If you want the easiest one and what you implement is standard and not very complex, then use Cypher.


Gremlin is everywhere, it's a horrible name and sort of difficult to reason with at first but it works, it's fast and there are a lot of databases that have support for it.

I do prefer the declarative approach that cypher uses, but for most things gremlin got the job done easier and faster.


I have 101 level of experience with both. Cypher is amazingly intuitive and simple, Gremlin, not as much. Just from a dumb user perspective (like me), Cypher left more like Python, Gremlin more like C++. Both are great, just different learning curve and entry bar.


I have used Gremlex (Gremlin in Elixir) for querying a DB that supports Gremlin (Neptune) and found it really pleasant.

https://github.com/Revmaker/gremlex


I'm rather uninitiated on this... what's the difference between a graph database and a traditional relational database that makes them need different query languages?


SQL generally doesn't have primitives that let you traverse arbitrary relationships. You can join multiple tables together, but the nature of those relationships is expressed directly in the SQL statement you're writing.

Some RDBs provide extensions to the SQL standard to do some basic graph traversals (Oracle's CONNECT BY, for example), but these are just the most-basic sorts of graph traversals.

In a graph database, the primary notions are nodes and edges and properties associated with them, whereas in a relational database, the primary notions are tables (aka "relations"), foreign keys that allow you to connect together multiple tables, and rows in those tables. Those differences in primary notions surface in the query languages as well.


> Some RDBs provide extensions to the SQL standard to do some basic graph traversals

This has been standardized since the SQL-1999 standard.


Are you referring to recursive common table expressions? These do allow graph queries. They are quite different to the proprietary extensions, though, and much more awkward.

I would love to see SQL grow a concise navigation/join syntax like JPQL's [1]; imagine writing:

  select e.name, e.department.head.assistant.name from employees
Instead of:

  select e.name, a.name
  from employees e,
  join departments d using (department_id)
  join employees h on d.head_id = h.employee_id
  join employees a on h.assistant_id = a.employee_id
Given that dots already mean something in SQL, maybe it would have to be:

  select e.name, e->department->head->assistant.name from employees
Once you've got that, it would not be a huge leap to add a recursive version of this operator.

[1] https://www.objectdb.com/java/jpa/query/jpql/path#Navigation...


> I would love to see SQL grow a concise navigation/join syntax

Yeah, what you need is a specification of a name to use for the referenced table in implicit joins when you sepcify an FK, then if you reference the linked table, it performs the join (to bring in a different subthread, such an implicit join specification could be the mechanism of opting-in for index-free adjacency on that FK, too, so you'd combine syntactic support with the supporting query optimization.)

Maybe something like:

  REFERENCES <table> [(<key columns>)] AS <link-name>
So, your example would be be supported by (assuming all the FKs are to PKs so you don't need columns specified on the FKs):

  FOREIGN KEY department_id REFERENCES departments AS department

  FOREIGN KEY head_id REFERENCES employees AS head

  FOREIGN KEY assistant_id REFERENCES employees AS assistant
...on the relevant tables.


Yet SQL still hasn’t standardized “CREATE INDEX” yet, has it? How many databases have implemented graph traversals, and do they even use the same syntax?

The various implementations of SQL are so different from each other, and from the spec, that I no longer find it useful to consider them the same language.

Is there a caniuse.com for SQL?


> How many databases have implemented graph traversals, and do they even use the same syntax?

Recursive CTEs with standard syntax are in virtually every current-versiom notable SQL-based RDBMS, including SQLite. I think MySQL 8.0 was the last of the major ones to add support.


CREATE INDEX is fundamentally an implementation detail of the RDBMS engine; there can be no meaningful standardization there.


> Is there a caniuse.com for SQL?

Not exactly the same, but: https://modern-sql.com/


Thanks for the response. I'm having trouble seeing what kinds of queries come up of this nature that SQL wouldn't be fit for, though. What would be an example of a realistic graph query one would want to do that would be too difficult to do in SQL?


Anything where you need transitive closure, that is, you need all the nodes that are reachable from a given starting node.

You can do it in SQL dialects that support recursive common table expressions.

In a previous job, I dealt with Bill of Materials data, which basically a parts list for a hardware product. Parts contain other parts, which contain other parts, and so on, recursively.

The data was in SQL Server. I could have moved the data to Neo4j and gotten a much more concise and convenient syntax for traversing the parent-child relationships. But I decided it wasn't worth the trouble of moving data, so I bit the bullet and just wrote the recursive CTE in SQL.

That's the biggest problem with graph databases. If your data's already in a SQL database, it often isn't worth it to _move all your data_ just to get a better query syntax.


Huh, interesting, thanks!


Let's assume we're playing the Kevin Bacon game. We've got actors, movies, and roles. Here's some sample data:

    INSERT INTO actors (id, name) VALUES (1, 'Kevin Bacon')
    INSERT INTO actors (id, name) VALUES (2, 'Frank Langella')
    INSERT INTO actors (id, name) VALUES (3, 'Cameron Diaz')
    
    INSERT INTO movies (id, name) VALUES (1, 'The Box')
    INSERT INTO movies (id, name) VALUES (2, 'Frost/Nixon')
    
    INSERT INTO roles (movie_id, actor_id) VALUES (1, 2)
    INSERT INTO roles (movie_id, actor_id) VALUES (1, 3)
    
    INSERT INTO roles (movie_id, actor_id) VALUES (2, 1)
    INSERT INTO roles (movie_id, actor_id) VALUES (2, 2)
Given the above data model, it's not possible to select all the actors that are within 6 degrees of Kevin Bacon with a SQL statement.

(Data from the incredible Oracle of Bacon, which uses custom graph traversal to serve up results: https://oracleofbacon.org/how.php)


This is an elegant and simple formulation of the problem and shows how bad SQL is at this. However, you can do this in SQL.

    with recursive bacon_movies as (
        select
            a.id as actor_id
            , 0 as bacon_degree
        from
            actors a
        where
            a.name = 'Kevin Bacon'
        union
        select
            r1.actor_id
            , b.bacon_degree + 1 as bacon_degree
        from
            bacon_movies b
        join roles r on r.actor_id = b.actor_id
        join roles r1 on r.movie_id = r1.movie_id and r1.actor_id <> b.actor_id
        where b.bacon_degree < 6
    )
    select
        a.name
        , min(b1.bacon_degree) as bacon_degree
    from bacon_movies b1
    join actors a on a.id = b1.actor_id
    group by a.name
    order by bacon_degree asc;
...not saying it's a good way to do this


> ...not saying it's a good way to do this

It really depends. If you do these kinds of queries all the time, it's a good idea to use a graph DB, but if these queries are rare, it often doesn't make sense to add a new technology to your project.


Maybe I have the brainworms from too much SQL but that seems pretty fine? It's an idiomatic CTE that more or less directly translates back to the problem statement. (nitpick: I'd exclude already-found rows from the union to do shortest-path in the CTE instead of doing it later)


jfc... sql :(

i've written thousands of lines of cypher queries. i just can't do sql though, it's not like i refuse either... it's that my brain just doesn't get it.

practice would surely help. i took an intro class once, and that improved my skills a lot (i.e. 0 -> 0.1)... teaching about the math of what's being done and giving me some ways to think about it... but it all just seems so incredibly tedious.


It may not be obvious, but it should be possible. It's relatively easy to get those within 1 degree of separation, put that in a view, then repeat that, but instead of hardcoding Kevin Bacon do "within 1 degree of those within 1 degree of Kevin bacon". Repeat a few more times and you have your query.

Of course if you want to know the degree between two random actors, or all actors within n degrees of Kevin Bacon (where n is unknown ahead of time) then it becomes really difficult (potentially still possible using common table expressions though).


Yes -- put more clearly, it's not possible in the relational calculus to fetch the number of degrees of separation between two actors.


It’s possible with recursive CTEs, but very slow.

https://stackoverflow.com/questions/52674380/improving-postg...


There's nothing essentially, inherently slower about recursive CTEs in SQL databases compared to graph database queries, it depends how they're implemented, particularly whether they take advantage of indexing and caching. They're both going to be doing fundamentally the same type of algorithm under the hood.

I've gotten recursive CTEs in SQL Server to perform quite well just by indexing the join key columns.


It's inherently slower because of the explosion in join table size as your depth increases and the slowness of using b-tree indexes instead of direct pointers to neighbor nodes ("index-free adjacency") or the newer sparse matrix transforms (GraphBLAS) that graph databases use.

What kind of database sizes have you used in your SQL Server experience? The linked StackOverflow question uses Postgres with 1M records, 50M relationships, and does index the join key columns. Maybe I'll have to replicate it on SQL Server in case there's some secret sauce but I'm highly skeptical.


Here's a super-basic test using MySQL CTEs w/1.6M nodes and 30M edges that I did a few years ago - https://intertubes.wordpress.com/2017/11/28/benchmarketing-n...


There is also this [1] paper claiming that

> [...] solutions using relational database technology [...] can offer performance superior to that of the dedicated graph databases.

[1] https://event.cwi.nl/grades2013/07-welc.pdf


Note that this is an old paper. They are using Neo4j 1.8.2 for this (and the community version at that), which was released in February of 2013.

Neo4j is at 3.5.9 currently, with 4.0 on the horizon, so there have been 2 major releases, nearly 3, since the paper, and dozens of minor releases and patch releases.

I'm sure Oracle has been making improvements as well (all of the authors are from Oracle), so in any case the paper itself is too outdated to be useful.


The benchmark [1] that Neo4j uses in its advertising [2] is from 2014, so not much newer either.

[1]: https://github.com/opencredo/neo4j-in-action

[2]: https://neo4j.com/business-edge/connected-data-cripples-rela...


"the slowness of using b-tree indexes instead of direct pointers to neighbor nodes"

If you don't use a b-tree, you still need some other in-memory data structure like a hash table to hold the pointers. "Direct pointers" don't help you if you have to read a node from disk before you can dereference its pointers.

In my case, I was working with a graph consisting of about 10M nodes and 35M rows of edges (parent-child relationships, where a parent can have many children and a child can have many parents). Many of those 10M parts weren't actively used anymore, so a minority of nodes had a majority of the relationships. I don't have access to the data anymore, but I recall being able to retrieve the transitive closure on a subgraph of ~10k nodes in 5-10 seconds, and the entire graph in about 7-8 minutes. This was on a small, possibly underpowered SQL Server instance. It's possible that Neo4j would do it faster, but I decided it was not worth moving all my data into a different database and introducing that ETL latency, to improve upon performance that was already acceptable for the business need.

I have heard anecdotally that SQL Server's CTEs are more performant than PostgreSQL's, but I don't have evidence of this, and it could be outdated if Postgres has made improvements since I heard that.


In Neo4j at least the node ids are offsets into the node and relationship stores, so you are literally pointer hopping through the store files from node structure to relationship structure to node structure. No need for a hash table or b-tree index (excepting finding your starting nodes in the graph before beginning traversal.)


Wouldn't that make deletes super expensive, similar to deleting nodes from a doubly-linked list on disk in terms of complexity? How could it delete a node with millions of relationships if it needs to read all those blocks from disk to traverse all the pointers?

And doesn't it cause more cache misses when the on-disk pointers refer to nodes that are spread out across different file blocks?

And why does Neo4j have indexes if it claims to have no need for them? https://neo4j.com/docs/cypher-manual/current/schema/index/

I don't know how Neo4j is implemented, but I'm skeptical that it's purely index-free adjacency, I suspect there is some hybrid data structure backing it.


Deletes do have an extra cost as relationships do need to be deleted first, and there are some batching approaches for handling this case. For graph databases there's not much choice on this unless you want to deal with dangling relationships (and resulting inconsistencies).

Note that deleting of nodes does not have to create new relationships between the adjacent nodes, so not quite like deleting nodes from the middle of a doubly-linked list.

A large pagecache is recommended for optimal speed, and SSDs are also recommended. Hardware continues to become cheaper.

Relational databases and Neo4j use indexes differently, which I think is part of your confusion here. We both use indexes for looking up nodes, true, but Neo4j only uses this for finding certain starting (or end) nodes in the graph. The important (and more complicated and costly) part of a query isn't finding your starting nodes...it's expanding and traversing from these nodes through your graph.

Neo4j uses index-free adjacency for traversing the graph. Relational dbs need to use table joins. One of these is only dependent on the relationships present on the nodes traversed (or rather only the relationships you're interested in, if you've specified restrictions on the relationship type and/or direction). Table joins are dependent on the size of the tables joined (then of course you must consider how many joins you must perform...and how to do these joins if there's nothing restricting which tables to join in the course of traversal).

Again, index-free adjacency does not mean that we must adhere to this in the most literal sense. Ideological purity is not the point. Graph traversals are the most complex part of a graph query, and this is where index-free adjacency is used to the advantage of native graph dbs.

And just to note, we certainly can join nodes based on property values, just like a relational database, and yes we can even use an index to speed that up, in the same manner as relational dbs. In fact you may indeed need to do this in order to create the relationships that you'll use later in your queries. Graph dbs are optimized such that if you do need to use joins, you'll perform them early, and once, so that you can take advantage of index-free adjacency during traversal in your read queries. Traversal speed and efficiency is the point of index-free adjacency.


Graph databases like Neo4j have a very important performance characteristic called "index-free adjacency".

This means that a traversal from node to node (similar to a JOIN in a relational database) does NOT use an index. Instead, it's more like chasing pointers, jumping directly to an offset. Whereas relational databases do use an index to perform JOINS - it's essentially a set comparison, using an index to see where two sets overlap.

What this means is that the performance of joins in relational databases is dependent on the overall size of the tables, while the performance of traversals in a graph database that implements index-free adjacency is not dependent on the overall size of the data (rather just the connectedness of the nodes being traversed), because an index is not used for traversals.


I don't buy the index-free hype. Leaving aside the problem of a distributed implementation, imagine you have a node with lots of labeled edges going out, and you're only interested in following some of those edges. You have to find the right ones somehow - full scan of all the edges, or - an index. Or, for a more detailed comment (not mine), see eg https://www.arangodb.com/2016/04/index-free-adjacency-hybrid... .


There are of course certain structures you can use to ease the selection of the relationships you want to follow, and Neo4j does use some of its own, breaking down relationships on a node first by type, and then by incoming or outgoing, allowing a quick selection of relevant relationships to follow depending on what's desired.

Whether additional structures are used at this level or not, the point of index-free adjacency is you do not need to utilize an index at the table/label level. The cost of performing each expansion is not dependent upon the total number of relationships or nodes in the graph, as it would be for table joins. You are only ever considering the relationships on each specific node at a time as you traverse.

Seems to me as long as you have that then you've got (table) index-free adjacency.


There's no reason that a relational database couldn't also have index-free adjacency. Alongside every foreign key, you store a file pointer to the row with that key.

I assume that the reason relational databases don't do this is that the speedup isn't worth the complexity and performance cost of keeping those pointers up to date for the kinds of queries that relational databases usually do.


There is a older, but seemingly relevant, question/answer for this on stack overflow: https://stackoverflow.com/a/5611541/92359

> if you shrink a table, or update a partitioned table (causing a row to move to another partition) or if you are rebuilding a table, or export/import a table, or... or... or... the rowid will change.

If the rowids are not stable across db operations, it wouldn't make sense to use them for implementing index-free adjacency. Do any alternatives remain? If not you're back to joins.

One of the reasons why Neo4j can use index-free adjacency is that the ids used for nodes and relationships are pointers to the location of the nodes and relationships in the relevant store files. Those are stable across updates and deletes of other data, and when you delete a node, all its relationships must be deleted first so there are no hanging relationships.


interviewed with a know-it-all type sr. engineer once, and he asked me an incredibly defensive question like, "why would i use a graph database instead of postgresql, it cant do anything that can't be done in postgres!?!?"

i'm no expert on any of this, but my response was still sufficient to get hired: you don't need a graph db, but it's easier to understand what cypher is doing than what sql is doing... that means more productivity to your lower-cost junior developers.


OTOH, if you could have it opt-in on a per-FK basis in an RDBMS, the cost could be avoided where not needed, but the performance benefit available where needed.


Ahhh, this is the kind of thing I was looking for. Thanks!


Here's the classic and humorous letter critical of relational databases (RDBMS) penned in 1991 by Henry Baker and published in the Communications of the ACM:

http://home.pipeline.com/~hbaker1/letters/CACM-RelationalDat...

Almost an "RDBMS Considered Harmful" screed, Baker's letter may seem humorous in today's environment and even ludicrous to those who did not live through the transition from older file-systems/DBMS to RDBMS. To those of us who did experience that transition, Baker was painfully on-point and his letter served as a scourge to improve systems performance for more than a decade beyond that period.

More recently the major RDBMS vendors have complemented SQL with directives that address the storage and handling of hierarchical data, e.g., in POSTGRESQL:

https://www.technobytz.com/closure_table_store_hierarchical_...

Baker was no prophet: his letter showed high hopes for object-oriented databases, another technology that noisily rose and quietly fell in popularity.


> More recently the major RDBMS vendors have complemented SQL with directives that address the storage and handling of hierarchical data

Oracle could do that 30 years ago, and recursive CTEs are part of SQL:1999.


Graphs have arbitrary depth. As a result some queries need an unknown number of joins on an unknown number of keys. Representing this in just SQL would be impossible. You could do it in a stored procedure but that is a superset of SQL not vanilla SQL.

You can also do it by just making the SQL calls over and over in your code as you get each row but that's not going to be very optimized.


The theoretical basis for SQL is relational algebra. I wonder if there is a similar calculus underlying graph query languages?


I don't know much about graph databases but there is a whole branch of mathematics called algebraic graph theory [1] which allows for the treatment of graphs algebraically. I took intro graph theory in second year so I have only seen combinatorial approaches but from what I gather it's a pretty powerful field.

[1] https://en.wikipedia.org/wiki/Algebraic_graph_theory


There's a recent category-theoretic framework for them:

https://arxiv.org/abs/1909.04881


In theory, a graph database is a relational database + recursion. In practice, some relational databases support recursion but SQL is not designed for it and the facility is both limited and often very awkward for many elementary graph operations. Nothing about the design and structure of SQL is optimized for this type of (ab)use case.

Consequently, managing a large-scale, complex graph data model in SQL ends up being an unmaintainable mess and very difficult to optimize for performance. SQL is a select-centric query language, whereas graph data models strongly recommend a query language that is join-centric.


One really clear cut one is that a graph database might need to compute the transitive closure which is impossible with plain SQL/relational algebra (technically possible with recursive SQL). One example of this might be you have some relation of parents to children and you want to select the ancestors of some child where an ancestor is either a parent or an ancestor of a parent.


Relational databases stink at trees, dags, etc. Anything that involves n-depth self-joining heavily in a SQL server is excruciating.


Tables as abstract and man-made, graphs and networks describe the real world much more naturally.

Also schema is optional in graph DBs, which gives them flexibility.

And in the case of RDF, global identifiers (URIs) and zero-cost dataset merge operation make it a unique fit for data interchange. It’s the only data model designed to be web-native.


That's a pretty vast question. Graph DBs and Relational DBs are obviously somewhat different, and therefore require different query semantics? As in: try querying SQL for "anything with an in-degree >= 4" without a four-way join.


Confused, why can't you just do something like this?

  SELECT name FROM user
  CROSS JOIN friend ON id2 = user.id
  GROUP BY id2 HAVING COUNT(id2) = 4


That can only query for a depth of 1. It doesn't let you query for friends of friends at an arbitrary depth.


Should GQL be pronounced with a hard or soft G?

Is "geequel" going to take off akin to "sequel" for SQL?


oh good, I wonder if they realize developers frequently refer to GraphQL as gql.


So will neo4j switch to gql? and will spark support switch to gql?


Yup, you can think of ISO GQL as the future of Cypher.


So if I want to learn this GQL, where do I even start? I'm also confused about the naming, is there more than one language that could be called GQL?


While the ISO GQL standards draft is being prepared, the best way for a developer to learn and prepare for ISO GQL is to learn the Cypher query language.[1] Many of the concepts from Cypher will be carried over into ISO GQL.

[1]: https://neo4j.com/developer/cypher-basics-i/


Isn't xquery basically a graph query language similar to GraphQL? If so, why are we not using xquery to query objects and sub-objects?


I'm not familiar with xquery, does it handle general graphs or only DAGs? Your mention of "objects and sub-objects" makes me think you're expecting a different use-case than actual graph databases, which deals more with peer-to-peer connections (like knowledge graphs) than hierarchical models.


XQuery is for the XML data model, which is a tree and not a graph.


Anyone know the difference with ArangoDB's AQL ? (I haven't use neo4j as I use ArangoDB)


AQL is a vendor-specific multi model language that lately has picked up some ideas from openCypher, like pattern matching.

GQL is a project for creating an International Standard language in the process of being created by experts from many countries that represent the interests of multiple vendors.


I haven't had a look at GQL yet but the obvious one would likely be that as ArangoDB aims to support native multi-model, AQL is not just a graph querying language.

Graph databases generally only consider edges between vertices (identified by IDs) and those edges typically only have a handful of attributes. In ArangoDB edges and vertices are stored as documents, so they can be represented as arbitrary JSON objects.

So if you want to work with a mix of data or want to access your data in different ways, I wouldn't expect GQL to support that whereas with AQL you can mix and match your queries even within the same request.

Full disclosure: I'm an external contributor to ArangoDB but this is my personal take.


Huh.

Now if only we could do this for configuration management, service mapping/scheduling/coordination, resource allocation, monitoring, alerting, logging, access control, artifact packaging, and execution pipelines.


What advantages/limitations does it have compared to SHACL? https://en.m.wikipedia.org/wiki/SHACL


Graph query languages are nice and all, but what about Linked Data here? Queries of schemaless graphs miss lots of data because without a schema this graph calls it "color" and that graph calls it "colour" and that graph calls it "色" or "カラー". (Of course this is also an issue even when there is a defined schema; but it's hardly possible to just happen to have comprehensible inter or even intra-organizational cohesion without e.g. RDFS and/or OWL and/or SHACL for describing (and changing) the shape of the data)

So, the task is then to compile schema-aware SPARQL to GQL or GraphQL or SQL or interminable recursive SQL queries or whatever it is.

For GraphQL, there's GraphQL-LD (which somewhat unfortunately contains a hashtag-indeterminate dash). I cite this in full here because it's very relevant to the GQL task at hand:

"GraphQL-LD: Linked Data Querying with GraphQL" (2018) https://comunica.github.io/Article-ISWC2018-Demo-GraphQlLD/

> GraphQL is a query language that has proven to be a popular among developers. In 2015, the GraphQL framework [3] was introduced by Facebook as an alternative way of querying data through interfaces. Since then, GraphQL has been gaining increasing attention among developers, partly due to its simplicity in usage, and its large collection of supporting tools. One major disadvantage of GraphQL compared to SPARQL is the fact that it has no notion of semantics, i.e., it requires an interface-specific schema. This therefore makes it difficult to combine GraphQL data that originates from different sources. This is then further complicated by the fact that GraphQL has no notion of global identifiers, which is possible in RDF through the use of URIs. Furthermore, GraphQL is however not as expressive as SPARQL, as GraphQL queries represent trees [4], and not full graphs as in SPARQL.

> In this work, we introduce GraphQL-LD, an approach for extending GraphQL queries with a JSON-LD context [5], so that they can be used to evaluate queries over RDF data. This results in a query language that is less expressive than SPARQL, but can still achieve many of the typical data retrieval tasks in applications. Our approach consists of an algorithm that translates GraphQL-LD queries to SPARQL algebra [6]. This allows such queries to be used as an alternative input to SPARQL engines, and thereby opens up the world of RDF data to the large amount of people that already know GraphQL. Furthermore, results can be translated into the GraphQL-prescribed shapes. The only additional requirement is their queries would now also need a JSON-LD context, which could be provided by external domain experts.

> In related work, HyperGraphQL [7] was introduced as a way to expose access to RDF sources through GraphQL queries and emit results as JSON-LD. The difference with our approach is that HyperGraphQL requires a service to be set up that acts as a intermediary between the GraphQL client and the RDF sources. Instead, our approach enables agents to directly query RDF sources by translating GraphQL queries client-side.

All of these RDFS vocabularies and OWL ontologies provide structure that minimizes the costs of merging and/or querying multiple datasets: https://lov.linkeddata.es/dataset/lov/

All of these schema.org/Dataset s in the "Linked Open Data Cloud" are easier to query than a schemaless graph: https://lod-cloud.net/ . Though one can query schemaless graphs with SPARQL, as well.

For reference, RDFLib has a bunch of RDF graph implementations over various key/value and SQL store backends. RDFLib-sqlachemy does query parametrization correctly in order to minimize the risk of query injection. FOR THE RECORD, SQL Injection is the CWE Top 25 #1 most prevalent security weakness; which is something that any new spec and implementation should really consider before launching anything other than an e.g. overly-verbose JSON-based query language that people end up bolting a micro-DSL onto. https://github.com/RDFLib/rdflib-sqlalchemy

Most practically, I frequently want to read a graph of objects into RAM; update, extend, and interlink; and then transactionally save the delta back to the store. This requires a few things: (1) an efficient binary serialization protocol like Apache Arrow (SIMD), Parquet, or any of the BSON binary JSONs; (2) a transactional local store that can be manually synchronized with the remote store until it's consistent.

SPARQL Update was somewhat of an out-of-scope afterthought. Here's SPARQL 1.1 Update: https://www.w3.org/TR/sparql11-update/

Here's SOLID, which could be implemented with SPARQL on GQL, too; though all the re-serialization really shouldn't be necessary for EAV triples with a named graph URI identifier: https://solidproject.org/

5 star data: PDF -> XLS -> CSV -> RDF (GQL, AFAIU (but with no URIs(!?))) -> LOD https://5stardata.info/en/


Linked Data tends to live in a semantic web world that has a lot of open world assumptions. While there are a few systems like this out there, there aren't many. More practically focused systems collapse this worldview down into a much simpler model, and property graphs suit just fine.

There's nothing wrong with enabling linked data use cases, but you don't need RDF+SPARQL+OWL and the like to do that.

The "semantic web stack" I think has been shown by time and implementation experience to be an elegant set of standards and solutions for problems that very few real world systems want to tackle. In the intervening 2 full generations of tech development that have happened since a lot of those standards were born, some of the underlying stuff too (most particularly XML and XML-NS) went from indispensable to just plain irritating.


> Linked Data tends to live in a semantic web world that has a lot of open world assumptions. While there are a few systems like this out there, there aren't many. More practically focused systems collapse this worldview down into a much simpler model, and property graphs suit just fine.

Data integration is cost prohibitive. In n years time, the task is "Let's move all of these data silos into a data lake housed in our singular data warehouse; and then synchronize and also copy data around to efficiently query it in one form or another"

Linked data enables data integration from day one: enables the linking of tragically-silo'd records within disparate databases

There are very very many systems that share linked data. Some only label some of the properties with URIs in templates. Some enable federated online querying.

When you develop a schema for only one application implementation, you're tragically limiting the future value of the data.

> There's nothing wrong with enabling linked data use cases, but you don't need RDF+SPARQL+OWL and the like to do that.

Can you name a property graph use case that cannot be solved with RDFS and SPARQL?

> The "semantic web stack" I think has been shown by time and implementation experience to be an elegant set of standards and solutions for problems that very few real world systems want to tackle.

TBH, I think the problem is that people don't understand the value in linking our data silos through URIs; and so they don't take the time to learn RDFS or JSON-LD (which is pretty simple and useful for very important things like SEO: search engine result cards come from linked data embedded in HTML attributes (RDFa, Microdata) or JSON-LD)

The action buttons to 'RSVP', 'Track Package', anf 'View Issue' on Gmail emails are schema.org JSON-LD.

Applications can use linked data in any part of the stack: the database, the messages on the message queue, in the UI.

You might take a look at all of the use cases that SOLID solves for and realize how much unnecessary re-work has gone into indexing structs and forms validation. These are all the same app with UIs for interlinked subclasses of https://schema.org/Thing with unique inferred properties and aggregations thereof.

> In the intervening 2 full generations of tech development that have happened since a lot of those standards were born, some of the underlying stuff too (most particularly XML and XML-NS) went from indispensable to just plain irritating.

Without XSD, for example, we have no portable way to share complex fractions.

There's a compact representation of JSON-LD that minimizes record schema overhead (which gzip or lzma generally handle anyway)

https://lod-cloud.net is not a trivial or insignificant amount of linked data: there's real value in structuring property graphs with standard semantics.

Are our brains URI-labeled graphs? Nope, and we spend a ton of time talking to share data. Eventually, it's "well let's just get a spreadsheet and define some columns" for these property graph objects. And then, the other teams' spreadsheets have very similar columns with different labels and no portable datatypes (instead of URIs)


> Can you name a property graph use case that cannot be solved with RDFS and SPARQL?

No - that's not the point. Of course you can do it with RDFS + SPARQL. For that matter you could do it with redis. Fully beside the point.

What's important is what the more fluent and easy way to do things is. People vote with their feet, and property graphs are demonstrably easier to work with for most use cases.


“Easier” is completely subjective, no way you can demonstrate that.

RDF solves a much larger problem than just graph data model and query. It addresses data interchange on the web scale, using URIs, zero-cost merge, Linked Data etc.


> “Easier” is completely subjective, no way you can demonstrate that.

I agree it's subjective. While there's no exact measurement for this sort of thing, the proxy measure people usually use is adoption; and if you look into for example Cypher vs. SPARQL adoption, Neo4j vs. RDF store adoption, people are basically voting with their feet.

From my personal experiences developing software with both, I've found property graphs much simpler and a better map for how people think of data.

It's true that RDF tries to solve data interchange on the web scale. That's what it was designed for. But the original design vision, in my view, hasn't come to fruition. There are bits and pieces that have been adopted to great effect (things like RDF microformats for tagging HTML docs) but nothing like what the vision was.


What was the vision?

The RDFJS "Comparison of RDFJS libraries" wiki page lists a number of implementations; though none for React or AngularJS yet, unfortunately. https://www.w3.org/community/rdfjs/wiki/Comparison_of_RDFJS_...

There's extra work to build general purpose frameworks for Linked Data. It may have been hard for any firm with limited resources to justify doing it the harder way (for collective returns)

Dokieli (SOLID (LDP,), WebID, W3C Web Annotations,) is a pretty cool - if deceptively simple-looking - showcase of what's possible with Linked Data; it just needs some CSS and a revenue model to pay for moderation. https://dokie.li/


> property graphs are demonstrably easier to work with for most use cases.

How do you see property graphs as distinct from RDF?

People build terrible apps without schema or validation and leave others to clean that up.


> How do you see property graphs as distinct from RDF?

This is the full answer: https://stackoverflow.com/a/30167732/2920686


I added an answer in context to the comments on the answer you've linked but didn't add a link from the comments to the answer. Here's that answer:

> (in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )

> When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"

> Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction. (:book1 schema:hasPart ?o), (?o schema:isPartOf :book1), (?s schema:hasPart :chapter2)

> It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.

> Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)

> Can every neo4j graph be represented with RDF? Yes.

> RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.

> Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.


And then, some time later, I realize that I want/need to: (3) apply production rules to do inference at INSERT/UPDATE/DELETE time or SELECT time (and indicate which properties were inferred (x is a :Shape and a :Square, so x is also a :Rectangle; x is a :Rectangle and :width and :height are defined, so x has an :area)); (4) run triggers (that execute code written in a different language) when data is inserted, updated, modified, or linked to; (5) asynchronously yield streaming results to message queue subscribers who were disconnected when the cached pages were updated


Neo4j would become the standard while being far less used than Gremlin? This is nonsense isn't it?


Neo4j is by far used more than any other graph database out there. Regardless, the standard is not Neo4j specific. It was a collaboration from many companies in the space. They all agree it's a good move.


Well I've no statistics about tinkerpop vs neo4j and other usage, I may be wrong. But I know for sure that far more graph databases supports gremlin than supports opencypher. E.g the latest graph database from Microsoft https://docs.microsoft.com/fr-fr/azure/cosmos-db/graph-intro... Also, Cypher has performance and expressivity issues.


Just looking at https://db-engines.com/en/ranking/graph+dbms clearly shows Neo4j as the most prominent graph db.

I think even Cosmos as number 2 might be misleading here, as Cosmos is a multi-model product and db engines doesn't distinguish between graph and non-graph use in the ranking. (Same for ArangoDB)


> E.g the latest graph database from Microsoft

As a contrast, the recently introduced RedisGraph only supports Cypher.


Hmm... There's this big, great, 'perfect', heavyweight graph query language (GQL) that is on process of standardized while an alternative (GraphQL) language is more readable (much more lightweight syntax IMO), has gained much more traction, etc...

While GQL and GraphQL's target is different(one is for interacting with Graph DBs while the other one is interacting with backends), there is a lot of overlap ongoing, and I just can't erase the feeling of the overlap between XML & JSON (where while XML was more 'perfect', JSON won the war).

Edit: Ok, GraphQL is insufficient for GraphDB Querying. Thanks for everyone's clarification.


> while an alternative (GraphQL) language is more readable

It's also entirely insufficient. GraphQL is nice to query APIs but almost pointless to query databases unless you map an entire DSL on top of it.


GraphQL has nothing to do with graph databases.


Why? GraphQL is about modeling business domain objects in graphs [0] and querying them. Isn't that basically what Graph DB languages do too?

[0] https://graphql.org/learn/thinking-in-graphs/


It is a query language for APIs, name like RestQL or JsonQL would be much more suitable.

The expressiveness of GraphQL is severely limited, it doesn't allow you to express even simple patterns like triangles, depth n, paths etc..


Exactly. GraphQL solves the same problem as REST. It is not a graph query language despite its unfortunate name.


aside from google search results for GQL :)



I think linking to dgraph may actually support "has nothing to do with". DGraph's query language was inspired by graphql but it also explicitly diverged from it in order to make a more suitable db language.


Not to muddy the waters here, but you can use GraphQL to build APIs with graph databases like Neo4j. There are integrations that use GraphQL type definitions to define the database data model, translate GraphQL queries to Cypher, and extend the expressiveness of GraphQL by mapping Cypher queries to GraphQL fields:

https://grandstack.io/docs/guide-graphql-schema-design.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: