How did you decide to use an SQL-like query language rather than a declarative query language like Neo4j uses (Cypher query language)? What do you see as the pros and cons of that decision?
Do you have any plans currently to design an ETL tool to make extracting data from a RDBMS and loading into NebulaGraph easier?
I didn't see this in any of the documentation (though I did admittedly skim), but is there any sort of visual front-end built in? If not, do you have any plans to make one?
SQL is a declarative language. Cypher is somewhat SQL-like, also declarative language which exists to make the expression of graph queries with predicates over nodes and edges easier to express.
Thank you for the correction. You are correct, SQL is a declarative language. It would've been more accurate for me to refer to Cypher as 'a more expressive declarative query language' (specifically for the graph database paradigm) than a typical SQL-like query language.
Looks interesting. Good luck! Couple of questions: 1) why did you decide to create your own graph query language instead of trying to follow recent graph query languages standardization trend (e.g., see https://www.tigergraph.com/2019/02/25/the-road-to-standardiz... and https://www.tigergraph.com/2019/03/15/the-road-to-a-standard... 2) do you plan to support semantic features [i.e., RDF/S, Gremlin, SPARQL, inference]; 3) do you plan to have SDKs for popular languages [e.g., Python, TypeScript]; 4) did you benchmark (or plan to do so) Nebula Graph against competition in terms of performance [i.e., data import, query throughput and latency, inference speed, if/when supported]; 5) have you figured out which features will remain open source and which ones will be enterprise-only (I assume that you plan to follow the Open Core business model)? Thanks!
If anyone involved would appreciate access to Graphistry to experiment with a gpu client/cloud visual analytics side (e.g., jupyter & react), let me know (leo@....com). Would love to see them together!
The submitted title ("Show HN: An open-source distributed graph database written in C++") led to lots of arguments, so we changed it in accordance with HN's rules about baity titles. That's in https://news.ycombinator.com/newsguidelines.html.
I gave the tutorial using docker a try. The SQL like query language is OK until it gets to doing queries with where clauses on the data. Some form of help and auto completion would be great.
It looks really neat! What's your plan on making this project viable in the long run? Do you envisage to monetize hosting, or maybe create a community vs. paid edition?
Glad you liked the project!
Hosting service would be the main monetization method. In addition, we will be providing consulting, training and all sorts of enterprise services.
We help bring gpu visual analytics & investigation automation to users of all sorts of graph DBs (think tableau & servicenow for graph), so based on our enterprise/big tech/gov/startup interactions:
1. Shortlist (and in no order): Neo4j, AWS Neptune, Datastax Graph, TigerGraph, Azure CosmosDB, and JanusGraph (Titan fork) are the ones we see the most in practice, and not in production but rumor-mill, Dgraph, RedisGraph, & ArangoDB. The three-and-four-letter types seem to roll their own, for better or worse. There are also some super cool ones that don't get visibility outside of the HPC+DoD world, like Stinger & Gunrock. Interestingly, the reality is a ton of our graph users aren't even on graph DBs (think Splunk/ELK/SQL), and for data scientists, just do ephemeral Pandas/Spark. As someone from the early days of the end-to-end GPU computing movement, we're incorporating cuGraph (part of nvidia rapids.ai) into our middle tier, so you get to transparently benefit from it while looking at data in any of the above.
2. I now slice graph DB's more in terms of OLTP (neo4j, janus, neptune, maybe tiger) vs OLAP (spark graphx, cugraph) vs batch (janus, tiger) vs friendly BI/data science (neo4j) vs friendly app dev / multi-modal add-on (CosmosDB, Neo4j, Arango, Redis). Curious to see how this goes -- given the number of contributors, I'm guessing it's doing well in at least one of these. +1 to hearing reports from others!
Thanks, I really appreciate the comprehensive write up of what your team is seeing. Any chance of a longer blog post that expands on this, especially pro-cons and performance?
For someone who just wants to run some (intensive) OLAP graph queries on the “graph formulation” of a relational or hierarchical dataset every once in a while (maybe batch, maybe user-initiated, but either way <1QPS), but doesn’t yet have a graph DB and doesn’t really want to maintain their data in a canonical graph formulation, which type of graph DB would you recommend as the simplest-to-maintain, simplest-to-scale “adjunct” to their existing infra?
I.e. what’s the graph DB that best fits the use-case equivalent to “having your data in an RDBMS and then running an indexer agent to feed ElasticSearch for searching”?
My default nowadays is minimize work via "no graph db": csv/parquet extract -> jupyter notebook of pandas/cugraph/graphistry, and if that isn't enough, then dockerized (=throwaway) neo4j , or if the env has it, spark+graphistry. The answers to some questions can easily switch the answer to say "kafka -> tigergraph/janusgraph/neptune", or some push button neo4j/cosmosdb stuff:
* Primary DB: type / scale, and how fresh do the extracts need to be (daily, last minute?)
* Are queries more search-centric ("entities 4 hops out") or analytics ("personalized pagerank")?
* Graph size: 10M relations, or 10B? Document heavy, or mostly ints & short strings?
* Is the client consuming the graph via a graph UI, or API-only?
* Licensing and $ cost restrictions?
* Push-button or inhouse-developer-managed?
The result of (valid) engineering trade-offs by graph db dev teams means that, currently, adding a graph db as a second system can be tricky. The above represent potential mismatches between source db / graph stack / workload and team burden. Feels like this needs a flow chart!
Happy to answer based on the above, and you can see why I'm curious which areas Nebula will help straddle :)
(Sherman here. I'm the founder of Nebula) Nice to meet you here, Manish. Nebula is actually inspired by the Facebook internal project Dragon (https://engineering.fb.com/data-infrastructure/dragon-a-dist...). Fortunately I was one of the founding members of the project. The project was started in 2012. We never heard of dgraph at that time. So I'm not sure who was inspired :-)
The goal of Nebula is to be a general graph database, not just a knowledge graph database. There are some fundamental differences between the two.
We welcome any positive feedback and technical discussion. We would love to learn to the community and to provide a product which truly satisfies customers' needs.
Yes, Nebula Graph supports multiple backend storages by design. So theoretically you are able to use whatever storage you want for whichever graph space in Nebula Graph.
Thanks for asking! Sorry I missed this question earlier.
Nebula doesn't store data multiple times for index.
And here's how the indexing works in Nebula Graph:
You are allowed to create multiple indexes for one tag or edge type in Nebula Graph. For example, if a vertex has 5 properties attached to it, then you can create one index for each if it's necessary for you. Both indexes and the raw data are stored in the same partition with their own data structure for quick query statement scanning. Whenever there are "where" clause/syntax in the queries, the index optimizer decides which index file should be traversed.
One of the most interesting picks: RDF4j (java based). It can connect to a lot of different SPARQL servers, but the rdf4j Native Store should be good enough for data sets in the order of the "100 million triples", according to the docs.
I don't know much about it, but not long ago they announced integrated support for "federated queries", which means that if you data set can't fit in a single node, they have a solution to query different servers in the same query [2].
I'm slowly learning through the forest of related technologies, one of the most useful is SHACL [3], which is a language to validate and extract pieces of the graph that match a pattern (very loosely, think a "schema" for graphs).
Before using RDF for graphs one should inform themselves on the differences between labeled property graphs and triple stores, and choose the model that best fit their use case.
Good point. Funny you mention that article: I remember encountering both that article and another one that provides some counterpoints! [1]
Also, both those articles are a bit old: RDF* ([2],[3]) is a new extension for RDF that makes it easier to accomplish the same kind of things you can do with property graphs. RDF4j has support for RDF* in the roadmap! [4].
To me, the fact that RDF is 1) a simpler and more general model and 2) an open standard with multiple free and commercial implementations; makes RDF a more a attractive option than locking into a single proprietary implementation like Neo4j.
RDF is an interoperability mechanism, it has nothing to do with the architecture you use internally for your database. You can have a PostgreSQL database and offer an endpoint for querying it via RDF.
Currently the project has been deployed in multiple leading internet companies in China, including Tencent, MeiTuan (Chinese Yelp), Red (Chinese Pinterest), Vivo, and so on.
That's pretty impressive. I'd love to see some details blog posts about setting it up, or using it in production (things to watch out for, good practices for provisioning hardware, etc.).
Glad you loved the query language. Simplicity and versatility are our design goals for the language.
Currently the project has been deployed in multiple leading internet companies in China, including Tencent, MeiTuan (Chinese Yelp), Red (Chinese Pinterest), Vivo, and so on.
Honestly, I don't see what's wrong with expecting payment for your work if someone else decides to sell it. Why should 'open source' get conflated with free (as in gratis)?
For me, open source has been an incredible way to learn software - it's syntax, it's architecture, it's control flow, it's gotchas.
From my understanding of the license [1], you can see the code, learn from it, do whatever you want with it, modify it if you so please, improve on it, whatever. The only thing you cannot do is sell it. Because you've taken someone else's idea in the first place.
I see this happening all the freakin' time and it pisses me off no end. If I suggest a software to someone, the first thing they as is 'Is it open source?' What they really mean is 'Is it free?' Why? If someone is expecting to get paid for creating software for others, why is the feeling not reciprocated towards the person who's created the software in the first place?
From what I've seen, most managers and software engineers, expect to get paid for their work but all the software which helps them make that money, they expect for free.
I find that attitude extremely hypocritical, honestly.
Why should open source get conflated with things that are NOT open source? Putting restrictions around "commercial" use (which is notoriously hard to define) is not open source. Discriminating against fields of endeavor is not open source.
If you want to get paid for developing genuine open source software, there are things you can do to that effect. Get paid for support (even maintaining the code is support). Offer to highlight companies that support your software (even if the highlighting is quite trivial, this is enough to unlock 'marketing' expenses and make it easier for business-oriented entities to support you). Start a Patreon page. There are lots of things that can be done without adding any licensing restrictions.
That would imply public domain. Every license has some licensing restrictions. MIT, BSD, and associated ones are closest to that, but still have restrictions. "Open source" in the literal sense in English is where the source is open to be looked at by everyone. Lots of software is like that, even fully commercial offerings. AGPL, GPL, and co have pretty drastic limitations on commercial usage (much more than the Commons Clause), but are obviously open source. The author should decide licensing, and if the source is available to be perused-- the English language would tend to call that, "open source". I think "OSI Approved Open Source License" would be a better phrase than the linguistically vague "open source". English has proper nouns for that sort of thing, and if we can go around writing "GNU/Linux", I think specifying the _type_ of open source license really isn't too much to ask for.
There are some licenses effectively like public domain, such as zero-clause BSD, CC0, WTFPL, Unlicense, etc.
GPL does not restrict commercial use any more than non-commercial use. What it does restrict is adding additional restrictions, it requires source code to be distributed, and it does not allow disallowing the user to substitute their own version.
If the source is available to be perused I think it is called "shared source" (or "source available"); "open source" is a subset of that, and is according to the OSI definition. "Free software" is also a subset of "source available". "OSI approved" is a subset of "open source" because OSI approved does not include public domain, even if it is still open source (which in some cases it is) (also some stuff that meets the OSI definition (by both words and intention) might not be OSI approved because OSI has not looked at it yet). And then there is also "FOSS".
> Cloud servic-ization is the virtualization of hardware modification locks.
Hence why the FSF advocates the AGPL for software that's designed to be performed "as a service" over a computer network. But "no tivoization" and AGPL clauses do not deny these uses; they simply enable the end user of the software to exercise her rights with respect to it.
I get that Commons Clause isn't "open source" but I really love the concept. If a company wishes to productize a creator's work it seems reasonable to pay the creator to alternatively license it (if the company doesn't want to put their secrets out to the public).
Meanwhile the creator gets to share their work freely with anyone who wishes to use it as a component of their own product/software in the spirit of open source.
As far as I know, Common Clause can be attached to any open-source projects. The main purpose is to prevent cloud providers monetizing from the project without contributing back. So Nebula Graph's main license is Apache 2.0, meaning that to most users it is open source, no different than any other open source projects. :)
The real question that has to be answered, and this is the hard one, when does the product begin to be monetized?
Let’s say it’s a full DB option as part of AWS RDS (or whatever that graph DB equivalent is). That probably is clearly monetizing the product. But what if they completely abstract the API and not expose the original one, it’s just the backing engine for a graph DB product?
Now moving away from a direct product, what if it’s just the backing DB AWS uses for managing all of their infrastructure? It’s not being directly monetized at that point but it might be the most critical component for the AWS operations, which means that it is helping them monetize other products. Do they owe in this case? (I’m speaking about the license here, not whether or not they should or should not based on goodness or feature improvements they want to pay to see).
As the DB moves further away from profit centers in an organization, at what point is it no longer being monetized?
Personally, I’d like to see a model where the OSS developers can and are paid in all of these cases for their work, but I’m not always sure there is anything better than a contract to support and build new features (classic OSS support model).
Discriminating by field of endeavor is contrary to the definition of open source software, and has been since before the term even existed. It's not open source, it's effectively Shared Source and developers who care about open source should stay away from this.
There's a subtle difference between AGPL and the Commons Clause licenses.
AGPL requires network-accessible code to be disclosed & licensed under an AGPL-compatible license.
The Commons Clause license outright prohibits SaaS-style offerings of the licensed code.
A lot of startups licensing their code under AGPL might still have AWS et al. eat their lunch, becuase all Amazon needs to do to remain compliant is to publish any modifications made to the AGPL-ed code.
AGPL is super-banned at all companies because lawyers deem it a huge risk. In particular it's banned at Amazon. IIRC you aren't even allowed to have AGPL software on your laptop at all.
It seems to me the difference is not so subtle. As I understand from https://commonsclause.com/ you can use Commons Clause licensed code-library as part of your commercial application without having to make your source-code available whereas with AGPL you would have to. But I may be wrong?
Commons Clause is added ontop of an exisiting FOSS license. So it will be whatever requirements the base license plus an anti-commercial restriction preventing others from offering SaaS services.
Another approach I have seen is BSL (Business Source License), which is kind of like Commons Clause in that it prohibits commercial offerings of the software, but after a rolling time limit, converts to an open source license. I might be wrong, so please correct me.
Yes, if you're choosing between Commons Clause and BSL please choose the latter. Because (1) it has a way less confusing name and mechanism of action, and (2) it acknowledges that some people may care about an actual OSS license for your software, and makes it clear how that might be achieved.
Oh wow, this really is just source available, isn't it?
> For purposes of the foregoing, "Sell" means practicing any or all of the rights granted to you under the License to provide to third parties, for a fee or other considerationon (including without limitation fees for hosting or consulting/support services related to the Software), a product or service whose value derives, entirely or substantially, from the functionality of the Software.
So, you cannot pay a contractor to set this up, because they can't deliver to you if they charge for setup or hosting?
Really appreciate the feedback and discussion! We will seriously consider the license issue.
Please DO let us know if you have any better license options than Common Clause that can help provide an open-source project for the community while stop cloud vendors from monetizing without contributing back?
I follow a few licensing blogs, and I know people are talking about and working on alternatives to Common Clause. As far as I can tell, all will disappoint those who believe the OSI's definition of "Open Source" is the one true open source.
Nice to meet everyone here. As a newcomer, I would like to introduce ourselves a little bit. Nebula is inspired by the Facebook internal project Dragon (https://engineering.fb.com/data-infrastructure/dragon-a-dist...). Fortunately I was one of the founding members of the project. The project was started in 2012. Since then I've been spent all my time working on the graph databases.
The goal of Nebula is to be a general-purposed, distributed graph database. We welcome any positive feedback and technical discussion. We would love to learn to the community and to provide a product which truly satisfies customers' needs.
While our original intention is to provide a real open-source graph database project for the community, we also want to prevent cloud vendors from monetizing the project without contributing back to the community. Exactly like what's explained in this TechCrunch article: https://techcrunch.com/2018/09/07/commons-clause-stops-open-...
That being said, Common Clause seems to be the only license that can be used. Quote the article: "Academics, hobbyists or developers wishing to use a popular open-source project to power a component of their application can still do so. "
However, we will seriously consider the license issue. Please do let us know if you know any better licenses that can be used.
I just thought it seemed a bit odd to me to have the project tied to large organizations in China so much. I’m not saying to “buy American” but I do think it’s reasonable to be perplexed.
I don't want to defend that company or the product, or the country they operate from, but the source code is all on github under a permissive license and thus can easily be auditioned for government backdoors. Where's the problem?
How did you decide to use an SQL-like query language rather than a declarative query language like Neo4j uses (Cypher query language)? What do you see as the pros and cons of that decision?
Do you have any plans currently to design an ETL tool to make extracting data from a RDBMS and loading into NebulaGraph easier?
I didn't see this in any of the documentation (though I did admittedly skim), but is there any sort of visual front-end built in? If not, do you have any plans to make one?