This is awesome. I have experience with running a CitusDB cluster and it pretty much solved a lot of the scaling problems I was having at the time. For it to go open source now, is of huge benefit to the future projects I have.
> With the release of newly open sourced Citus v5.0, pg_shard's codebase has been merged into Citus...
This is fantastic, sounds like the setup process is much simpler.
I wonder if they have introduced the Active/Active Master solution they were working on? I know before, there is 1 Master and multiple Worker nodes. The solution before was to have a passive backup of the Master.
If say, they released the Active/Active Master later on this year. That's huge. I can pretty much think of my DB solution as done at this point.
We're working on making Citus masterless. In all openness, we evaluated two different approaches to this in the past six months, and wrapped up the design for one. This design works well on the cloud, and we already demonstrated a working version: https://youtu.be/_nun2S6EdWo?t=411
Since the Citus master executes distributed queries by sending regular SQL queries to the Citus workers, you could already use BDR servers as workers and replicate the data between pairs of workers in different data centers and copy over the metadata on the master manually. However, some distributed joins and data loading features wouldn't work.
For all features to work, and to replicate the master, you would have to compile Citus against BDR, which probably requires a few code changes.
You could always compute a geohash and use that as a shard key... I'm not familiar enough with Citus' specific approach here, but using a limited geohash would give you close to what you're looking for.
On PostgreSQL language support, we're updating our FAQ to have more information: https://www.citusdata.com/frequently-asked-questions Since the PostgreSQL manual (and its feature set) spans over 4K+ pages, we found that the best way to think about Citus' capabilities is from a use-case standpoint. If your workload needs distributed transactions that span across machines, or large ETL jobs, Citus currently isn't the best fit.
Citus supports sharding and replication out of the box (#4, #5). On #6, reads go through a master node (metadata server) and you see what you write.
We don't have #7. The way in which we implement this also has implications on your other questions. Multi-master (no single metadata server) is by far the biggest feature request that we receive: https://news.ycombinator.com/item?id=11353866
If we go with the approach in https://github.com/citusdata/citus/issues/389, you will be able to configure #3, #6, #7 through PostgreSQL's streaming replication settings. We still won't support distributed transactions that span across multiple machines.
On #8, could you elaborate a bit more? Do you mean a logical identifier for the node?
Also, it's hard to write a concise reply on a topic that requires so much context. I'd love to grab coffee with anyone who's interested in diving deep into distributed databases. Feel free to shoot me an email at ozgun@citusdata.com
As soon as they're built in PGDG! Our Docker image just builds on the PostgreSQL 9.5.1 image, then installs a .deb we built.
I've been wrapping up all our packaging work during the past week, but not having a OSS release yet was the final blocker for getting into well-known repos. We'll probably have a post about this in the near future.
That's correct, and with such a significant license change I think the term 'unfork' is being used inappropriately in the title.
Edit: the PostGIS extension is GPL, and that license choice has been very successful. Hopefully the AGPL works out at least as well for Citus, I'm just not familiar enough to know what the implications will be in this context.
But then it doesn't really belong or aim to be in PostgreSQL proper, at least IMO, so I think that's fine.
The functionality (i.e. distributed processing and horizontal scaling) that Citus has done is something that I predict will eventually be part of standard PostgreSQL, but it will not be Citus's code (unless they change the license, of course).
Citus owns the copyrights, and the CLA ensures that they will continue to. That means that they can reassign it under a different license if they so choose, and they could at some point choose a liberal license.
And a CLA in place also means many contributors will shy away from doing so because they have no idea what happens with their contribution in the next version. Either it's a community project owned by everyone under an agreed upon license or it's more like a record deal for new artists.
Not only that: running the app requires copying it into memory; editing files requires making copies; deploying it to your production servers requires making copies.
I remember that exception, and it was very explicit that you needed to already have obtained copyright permission to use the software. The legal theory behind the "copy to memory" exception is that if a end-user already have been granted permission to use a copyrighted work, then it make sense that they also have permission to "copy" it to their computers memory.
If you have a copyright license that adds a condition to the grant of permission, its hard sell to argue that the exception trumps the condition.
It says, explicitly, that making copies for the purpose of executing it is not an infringement. I do not need permission if it is not an infringing act.
"Notwithstanding the provisions of section 106, it is not an infringement for the "owner" of a copy of a computer program"
If you do not have a licensed permission from the author, you now have to prove that you are the legal owner of the copy. It could be said that they intended to address people who pirated a copy from a torrent site (or a BBS, since this is before the time of torrents), but its a major claim that someone can own a copyrighted work without, 1) purchasing it, 2) having a copyright license to it.
When it comes to GPL in particular, the question is also if the recipient is a owner or a licensee. The license text specifically call "you" to be equivalent with "licensee", which makes the claim of ownership even more dubious.
I don't think acquiring a copy is an act of distribution by the acquirer. Besides, the copy you received while acquiring it was not modified by you, so that would not affect the modified version which was never published. AGPL cannot change the definition of the word publish to be different than that of the U.S. Copyright Code.
You don't own the copyright. You don't have the right to download it on to your hard drive (aka "reproduce the copyrighted work in copies"), except to the extent AGPL allows you to.
As far as I know it means it is contagious over the network.
If not I have misread and quite a few other people have misunderstood as well I guess.
See for example MongoDB and others that distribute the core under AGPL and the drivers under MIT or something so that you can actually use it for something without having to make your product AGPL licensed as well.
Edit: did some quick googling and found this in another HN thread about some other nice but AGPL licensed software I've never heard about again:
"""The Opa web site says that I cannot: "Firstly, `‘if I'm using AGPL Opa to develop an app does it need to be AGPL, too?’'. Long story short — yes, it does." """ [0]
IANAL but being a nerd I have read a lot of licenses and it seems pretty clear to me that thats what companies has in mind when they license something under AGPL.
Unless someone can prove me wrong I hope we can stop spreading the misconception that AGPL somehow isn't viral.
Based on what GNU says in the first link it seems kind of reasonable although I will not dare to use it without checking carefully with lawyers as this obviously this isn't what a lot of people think.
Unfortunately very few of those people have actually read the license. As another commenter said, I don't see how you can read the license and come to the conclusion that a user of an AGPL program over the network would have to AGPL their code. It just doesn't make any rational sense.
That might be what some companies have in mind, but those companies are wrong to use the AGPL because that isn't what the license says. I don't really know how one could read the AGPL and think that it requires clients to be AGPLed.
The only difference between the AGPL and GPL, other than the name, is that the AGPL requires you to offer to provide the source of your modified version of the software to all of its users. In short: network interaction equals distribution.
It says nothing about what licenses may apply to any software that interacts with the AGPLed software over a network. Anyone that suggests otherwise is misleading you or are mislead themselves.
Edit:
Note that MongoDB's drivers are MIT-licensed because they are made "part of" the client software. (A)GPL virality would come into play for drivers incorporated into clients.
Note that MongoDB's drivers are MIT-licensed
because they are made "part of" the client
software. (A)GPL virality would come into play
for drivers incorporated into clients.
OK. Makes sense. I might have misunderstood in which case I feel I almost owe GNU some kind of apology. If I become convinced I can at least try to add that fact to later discussions about AGPL.
That said it seems more than one company thinks it is viral over the net (Odoo, OPA comes to mind) and I would really be interested if someone who really understands licenses would explain.
Odoo thinks it is viral over the net? I've never seen that argument. It was only viral to modules running on the Odoo server. If you used XML-RPC to talk to it over the net, I don't think Odoo S.A. would claim you had to distribute your code.
Possibly I'm wrong again. Actually happy about it.
If the network part of AGPL only affects the original AGPL software then it even kind of makes sense although I still wouldn't use it myself except as an extortion scheme :-P
Well, I can't say I agree with your position on the AGPL. But as a side note, Odoo has recently been re-licensed as LGPL, so one can now distribute proprietary modules for it (and even sell them on the official Store).
Since citus is just an extension, the normal drivers for postgres just work. And those are BSD licensed (well, it's the postgres license, but the difference is essentially nonexistent).
No license is contagious, or viral. It's a license, not a law, you are under no obligations to honor it. It is not an implicit contract between you, or anybody else. It simply states what rights are granted to you, and under which conditions. If you fail to comply you simply don't have those rights, and it's just normal copyright violation.
Should you redistribute a modified copy, and not license your modification accordingly, the recipients do not get an implicit license to you modifications.
The only way your code will get licensed to others is when you explicitly say so.
Opa is a really bad example since every program that uses the standard library will be utterly depended on that library. The program can not work at all without the copyrighted work of the library authors, and as such, do not even have basic aspect like independence.
If I write a program that connect to a database, my program will happy work fine without the database software. I could replace the database with a different one, and the program would work identically. Outside some database specific quirks, programs that connect to a database do so as independent works.
An other aspect is that if you change the standard library of a programming language, the programs that uses it will utterly change too. If you change the database software, most well written software won't be effected and will run identically as if it connected to a older version of the database.
And last, software that connects to a database won't share memory with the database software. Programs that uses a standard library will share memory with the standard library. The connection between a program and it's standard library is significant closer than the connection between a program and a database.
If anyone from Citus is reading this: how does this affect your business model? I remember when I asked at Strata conf a couple of years ago why isn't your stuff Open Source, the answer then was "because revenue". So what changed since then?
Several things have changed over the last two years that allowed us to make this happen: Most importantly, we've continued building out the product for a more broad user base, grown with more customers and users, received further funding as validation, and expanded both our team and product to offer additional revenue generating services. All put together, open sourcing Citus is something we've always wanted to do, and we are excited to continue building on it for many years to come, with the help of the community and our enterprise customers.
Hi, Craig from Citus here. We have some premium features in our Enterprise edition. Many of these are features that larger enterprises will want to pay for such as security features around roles, a tool for automated cluster resizing, and enhanced load balancing tools, and of course support.
Beyond that, we have a few other things in the works for the future that will cover other revenue models.
Companies of any appreciable size will be happy to pay for support if they choose to make Citus a part of their critical infrastructure. And the industry reached an inflection point where there are enough companies want as much of their infrastructure to be open source as possible, that you can run a company where most of your stuff is open source, while still making a ton of money (like RedHat, CoreOS, Docker, etc)
AGPL only requires open sourcing any modifications you make to the software when you give users direct access to a server running the software, which seems like something you would never want to do in case of a database.
This is awesome! Tebrikler (congrats) on the release of 5.0 and going OS, definitely great news.
Can you publish competitive positioning of Citus vs Actian Matrix (nee ParAccel) and Vertica? I'd love to compare them side by side - even if it's just from your point of view :-)
Second the request for comparison to Vertica. I've recentishly become a user at work, and I wonder how this compares. A quick googling didn't yield anything too informative.
(Part 3/3 - please see two comments below as the starting point) Aside from the different use-cases they address, there is one other, important difference between Citus and Redshift (and any other distributed database in the world, for that matter). Citus does not fork the underlying database, PostgreSQL. Instead, Citus extends PostgreSQL to transform it into a parallel processing, distributed database. We use PostgreSQL's powerful extension APIs to accomplish this (you simply CREATE EXTENSION Citus on PostgreSQL's latest version, 9.5, to get your distributed PostgreSQL database).
While this might appear as an implementation piece at first, it has important product implications, and might even impact how you might want to think about your database stack. By not forking the core database, you are choosing to always stay with the core PostgreSQL product. For starters, you get the uber-cool (and uber-fast) JSONB type that came with 9.4, or the recently checked in UPSERTs, or the popular PostGIS extension for geospatial capabilities. More philosophically, the moment you use forks of database, you know you'll be diverging over time. And when you introduce new databases and/or piece together many different ones to build one application, your development cycles will only get costlier and more complex over time.
This was a long answer to a short question, but hopefully useful. Let me know if you have questions, or any feedback using Citus – would love to hear your thoughts!
I think that it would be better for you to position CitusDB by comparing it to other products in terms of use cases.
If the data is big and I need to run analytic queries then I think I have to use a columnar storage format because row-oriented formats cause too much overhead for aggregation queries that usually need to process single column efficiently. If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc. As you said, they're suitable for offline data but Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql? Maybe I can push hot data to a table that use row-oriented format and move the data periodically to another table that uses cstore_fdw and execute queries that fetches data from both cold storage and hot storage tables? If CitusDB makes it easy for me, then I think this is huge.
I guess another use case is using CitusDB as distributed data store and executing filter queries such as "SELECT * FROM table WHERE partition_key = x and predicate1 = y ...". Instead of using multiple Postgresql instances and routing the queries in application level, I can just use CitusDB that takes care of replication && query routing && sharding etc. I think it can also be comparable to databases such as Cassandra, Mongo (using jsonb) since they also have similar use-cases.
Or should I think CitusDB as distributed Postgresql?
> If I use CitusDB as an analytical database, then it's comparable with Redshift, Hive etc.
A particular difference is in response times and concurrency. Data warehouses and Hive are great for reporting use-cases, but not for use-cases that require fast responses and have many users like analytical dashboards. This is a use-case for which Citus is particularly well-suited (see for example the CloudFlare dashboard).
> Can I use cstore_fdw in CitusDB and able to take advantage of real-time nature of Postgresql?
Yes, since cstore_fdw and Citus are both developed by Citus Data we made sure they're fully integrated. We've even seen some deployments that use a mixture of columnar- and row-based storage in a single distributed table.
We find that row-based storage generally has better ingestion performance and more indexing possibilities. Citus can do very fast execution of analytical queries by parallelizing over row-based shards and using the indexes on each of them. However, if you only need a small number of columns and have analytical queries that are not very selective, you can use columnar storage just as easily and even mix and match (might require some support).
> I guess another use case is using CitusDB as distributed data store
Yep, Citus can definitely be used for that by using hash-partitioned tables.
(Part 2) Now comes the storage engine, and cstore_fdw as it relates to PostgreSQL. Built by the Citus Data team, cstore_fdw is entirely a separate component from the Citus product above. It enables columnar storage for your vanilla, single-node PostgreSQL to provide data compression for faster analytics. As such, cstore_fdw does not come with any of the parallelism I've described above that Citus (or Redshift, Vertica etc.) provides.
Precisely because cstore_fdw is built for PostgreSQL, and Citus is PostgreSQL (see Part 3), however, you can still choose to use cstore_fdw as the storage engine for your Citus cluster. Citus will still parallelize the queries as you'd expect it to, but instead of hitting row- based tables, they will hit columnar ones. cstore_fdw has certain limitations, importantly it is not updatable; so we don't consider it as an alternative to a data warehouse. Rather, it is useful if you are archiving your quickly growing timeseries / event data on PostgreSQL or Citus.
To load or append data into a cstore table, you have two options:
You can use the COPY command to load or append data from a file, a program, or STDIN.
You can use the INSERT INTO cstore_table SELECT ... syntax to load or append data from another table.
Note: We currently don't support updating table using DELETE, and UPDATE commands. We also don't support single row inserts.
So I think you can certainly mutate tables, but the focus is on bulk-inserts, rather than individual append actions.
Umur from Citus here. For purposes of this question, I’ll bucket traditional data warehousing (DWH) solutions like Redshift, Vertica, Greenplum together, although there are many nuances among each of them of course.
First, Citus is not a traditional data warehouse. We position Citus as the real-time, scalable database that serves your application under a mix of high- concurrency short requests and ad-hoc SQL analytics (i.e. think both random and sequential scans for a customer-facing analytics app). The default storage engine for Citus is the PostgreSQL storage engine, which is row-based. This is in contrast to many data warehouses, which often use a column store and/or batch data loads, and are focused purely on analytics. The trade-offs you get are:
- Citus vs. DWH performance: DWH and Citus both have a similar parallelization for analytics queries (multi-core, multi-machine), but most data warehouses typically use a columnar storage engine instead of a row-based one. Columnar storage is designed for faster analytics queries, so that makes columnar DWH generally faster on longer running analytics queries. However, this comes at the expense of (1) concurrency and (2) short-request performance (think simple lookups, updates, real-time data ingest) vs. Citus' row-based storage. If you've tried having 10s of concurrent connections to Redshift for short lookups, or performing 100s/1000s of inserts/updates to power your application, these limitations will be familiar. This is to be expected, as Redshift is not designed as a real-time operational database, but an offline data warehouse.
In essence, the two classes of products are more complimentary than substitutes, even while they have some overlaps in their analytic capabilities. Something like Redshift will give you fast offline analytics, after you move your data in batch (via S3); Citus will directly power your analytic apps in real-time; without ETL'ing your event/user data back and forth between separate OLTP and OLAP databases. Both can be extremely fast: Redshift can run complex data warehousing queries that take an hour in a few minutes, Citus can scan and aggregate 100 million records in a few seconds, while simultaneously ingesting your events in real-time.
I hope that provides some clarification on the workloads. There is a lot more, including columnar storage and product approach (re: implications of extending Postgres 9.5 vs. forking Postgres 8.x), and I’ll dive into those in separate comments as well.
Thank you for the answers, Umur. I've used both Vertica and ParAccel in production environments for the traditional Data Warehousing projects and have come to appreciate both good and the bad that analytic RDBMS engines bring to the table.
Currently, my favorite is Vertica, but I do have concerns about its future under the stewardship of HPE.
I'm quite interested in what Citus brings to the market and will be following its progress closely. Once you have a more rounded story for the traditional Data Warehousing purposes, I can recommend it to my clients for evaluation purposes.
In terms of a sweet spot for you, here's a free tip for your sales: target customers of Unica (well, IBM Unica now). That's one application that would definitely benefit from your Operational Analytics positioning - lots of data ingested throughout the day, lots of queries to run for the analytics.
"Currently, my favorite is Vertica, but I do have concerns about its future under the stewardship of HPE."
HPE employee here (Not a vertica team member, though)
Many teams within HP are very excited to use Vertica. Many more teams are looking to use Vertica for our own product offerings. There's no reason, in the short term, for HPE to shift away from Vertica. On the other hand, I'd say that HPE will invest more into Vertica.
Unforking is a very smart decision. Postgres also has gained a lot of favour since MySQL was bought by Oracle. Altogether Citus has earned a lot of kudos for that move, at least with me, for all that may count!
That depends on your setup, for the master instance you'd run it just as you would for other setups. Streaming replication is common there. For the sharded instances, Citus has the ability for you to set what your replication factor is. Here Citus is then aware of when a node fails and will automatically redistribute the data to a new node, essentially taking care of that for you.
Since the move to open source, more recent upstream changes have been slowly merged in the code base, though they seem to be still on a 8.3 base, still a couple of years worth of code to go through.
Greenplum is a fork of Postgres codebase, Citus is not; it's an extension that leverages community Postgres's extensibilty APIs. This point seems to be highlighted in their post.
Citus can parallelize SQL queries across a cluster and across multiple CPU cores. How does it compare with the upcoming 9.6 version of PostgreSQL which will support parallel-able sequential scans, parallel joins and parallel aggregate ?
AFAIK all the parallel work done in 9.6 refers to parallel operations on a single node (but multiple cores).
This would be complimentary to what Citus does, which is distributing the load across multiple shard instances (each with their own cores, benefiting from the parallel work in 9.6).
Yes, but Citus can also parallelize on multiple cores when used on a single machine ("If you’re running Citus on a single machine, this will scale queries across multiple CPU cores. and create the impression of sharding across databases."). Will this functionality becomes obsolete with the 9.6 ?
Fair point - I assume this will be merged together in some way (i.e. the Citus stuff building on top of the parallel scan infrastructure), but probably a better question to ask on #citus IRC / open a Github issue.
Some Postgres committers work on Citus as well (e.g. Andres Freud), so I'm sure this has been thought through before.
Yikes I keep calling it citrus. The word citrus is so ingrain I'm actually having a hard time pronouncing citus when I see the word. I didn't even notice I was doing it wrong until I saw your comment.
This is fantastic news! Postgres does not have a terribly strong High Availability story so far and of course it also does not scale out vertically.
I have looked at CitusDB in the past, but was always put off by its closed-source nature. Opening it up seems like a great move for them and for all Postgres users. I can imagine that a very active open-source community will develop around it.
Are there any limitations you have run into?
E.g. can you still use all index types that Postgres offers or are there any special distributed index types that CitusDB adds perhaps?
Correct. CTEs are pretty much a no-go, and you also can't do things like join between distributed and non-distributed tables.
Your queries tend to break down into ones where you're hitting a small number of shards (such as when we serve data for the analytics page), or else ones where you tend to aggregate up into temporary, non-distributed tables for further analysis (such as you'd do for infrequent business reporting).
All told, it's been one of the easier tools for us to operationalize, largely thanks to the fact that it's "just PostgreSQL." The one thing I wish we had was better documentation of what kind of consistency guarantees to expect, although for an append-only store like our current use case, that's less of a concern. And to be fair, Ozgun and the other guys at Citus have always been really happy to answer any questions we have.
With this code now going open source, it should be pretty easy to look into these sorts of internals.
My recollection from the last time I played with it, some bits from core postgres are unsupported; things like sequences and recursive CTEs. Maybe all CTEs?
It doesnt scale vertically either. Postgres is single-threaded meaning it can't make use of multiple CPU cores on the same machine. There have been some slow improvements to this and 9.6 seems to hint at some parallel aggregation changes but overall Postgres is strong in features but weak in scaling (in any direction).
It looks good, will have to wait and see what actually makes it into 9.6 and how well it supports all the possible queries. This is something that many commercial databases have solved so Postgres is pretty behind in vertical scaling.
British English vs. American English? I dunno if the poster is British but I am. I read that they haven't done enough to call pg vertically scalable but recent work suggestions (hints) that they are working toward it. The expression is more metaphorical than literal, an Atlantic divide my American colleagues explicitly warn me about when I am writing my perf reviews.
I'd very much like to see what algorithm these systems are using to enable transactions in a distributed environment. Are they just using straight two-phase commit, and letting the whole transaction fail if a single server goes down? Or are are they getting fancy and doing some kind of replication with consensus?
One thing I'm having trouble with is finding information about transactional semantics. If I make several updates (to differently sharded keys) in a single transaction, will the transaction boundaries be preserved (committed "locally" first, then replicated atomically to shards)? Or will they fan out to different shards with separate begin/commit statements? Or without transactional boundaries at all?
In fact, I can't really find any information on how CitusDB achieves its transparent sharding for queries and writes. Does it add triggers to distributed tables to rewrite inserts, updates and deletes? Or are tables renamed and replaced with foreign tables? I wish the documentation was a bit more extensive.
My guess is that Citus is making enough money from consulting that they don't need to keep this code closed source when they can profit from free community-driven growth while they are expanding their sales pipeline through consulting.
Hi, Craig from Citus here. In addition to the open source Citus, we have some premium features in our Enterprise edition. Many of these are ones that larger enterprises will want to pay for such as security features around roles, a tool for automated cluster resizing, and enhanced load balancing tools, and of course support. Beyond that we have a few other things in the work that will speak to various revenue models for the future.
Since I heard last year at PgConfSV that you will be releasing CitusDB 5.0 as open source, I've been waiting for this moment to come.
It makes 9.5's awesome capabilities to be augmented with sharding and distributed queries. While this targets real-time analytics and OLAP scenarios, being an open source extension to 9.5 means that a whole lot of users will benefit from this, even under more OLTP-like scenarios.
Now that Citus is open source, ToroDB will add a new CitusDB backend soon, to scale-out the Citus way, rather than in a Mongo way :)
I don't have a ton of experience scaling out and using different flavors of PostgreSQL but I had run across Postgres-XL not long ago; does anyone know how this compares to that?
The BSDL does not make much economic sense to the company open sourcing their code; a new competitor would fork the code, make closed improvements, and merge any changes from the open source code. That means that the competitor is always gaining by a one-way flow of improvements.
To use open source code, the more permissive the license the better. But to actually open your own code, BSDL is a very tough sell.
That's also why they use the AGPL. With database systems, even if they were under the GPL, some competitors could just modify the system and run it on their own server with improvements, and offer just the service to their clients. Again, the improvements go one way only: since the competitor would not distribute the modified system, as it's running on their servers, they would not need to distribute source changes. With the AGPL, that loophole is closed.
>> The BSDL does not make much economic sense to the company open sourcing their code
If this were true then Cloudera, Horton and a whole bunch of other companies would be out of business, yet in reality they are doing really well. All that AGPL is doing for Citus is:
1. Turning away people (customers) who are religious about licenses.
2. Eliminating any possibility of this code ever being integrated into PostgreSQL
I hope that at least Cloudera will profit and thrive. Having used plain Apache components, then Hortonworks, then Cloudera... They are far and away the superior distribution, regardless of whether or not you are an "enterprise" customer.
My $employer paid Hortonworks for a support contract and I have no qualms declaring publicly that it was a total and utter scam. (We are a Java shop and know our shit)
If I go into too much detail I'll write countless pages like my internal report on why we needed to switch, but the bottom line is that Cloudera's products are well and honestly documented, while, as of last year, Hortonworks' products are simply one land-mine after another.
Their management platform (Cloudera manager) being closed-source is barely a mark on the comparison analysis when you compare it to Ambari in practice. Ambari is a bad joke and I'd rather do without it after spending significant time using it and trying to extend it.
And I could go into excruciating detail as to how Hortonworks abuses the Apache License to try and force lock-in. It's disgusting and pathetic.
I have no experience with Hortonworks, but Cloudera certainly tries hard, even though it too has plenty of issues, but then the whole "Hadoop Echosystem" is in such fast-moving flux, it's understandable.
Cloudera also makes (Apache Licensed) Impala, which IMHO is a pretty cool product.
Another company worth mentioning is Databricks, which leads Apache Spark development.
So you take a BSDL codebase, fork it, close it, make proprietary changes, profiting from the BSDL codebase, then slap the PG community in the face by open sourcing it under a more restrictive license hoping to benefit from the community you just slapped in the face but restricting competition.
They are of course free to release their code under any license they wish. I just think releasing code under the *GPL when you profited from a liberal BSDL is a douche nozzle thing to do. But knock yourself out! This tells me all I need to know about the company.
I don't agree with you. The PG community is, I think, fine with that: that's exactly what the BSDL allows that the GPL doesn't, and they chose the BSDL. If the PG community don't like that, I really don't understand why they chose the BSDL.
If the pg community did not intend their database to be used that way, they would've chosen a copyleft license. Have you considered that this is part of their intention?
Would CitusDB have been created at all were it not because they could sell a propietary fork as they have been doing in the past?
I also have qualms about distributing changes to non-copyleft license code under a copyleft license, but it seems strange to make this the "slap in the face" moment - wasn't distributing them under a proprietary license even worse?
No that IS what the BSDL allows for. I wasn't arguing that they shouldn't have made a commercial product on BSDL code. That also was within their rights using said BSDL code. It's just my opinion that to do that, then release your proprietary "bits" under the GPL or variation thereof with more restrictions than what you started with instead of the same BSDL you used to start your business is a dickheaded douche nozzle thing to do. But again they are free to do that, it's within their rights! Just don't be shocked when people like me call you out for what you are. Keep on rocking Citus! Stay classy.
I understand you weren't saying they don't have the right to do so, I just find it weird that you consider distributing it under the AGPL to be douchy, while not considering the distribution under a proprietary licence to be (even more) douchy.
A relicense is a relicense. When you impose new rules on a product, especially if you weren't the original author, is rude.
I'm a GPL zealot, to the point that I've used the GPL as a weapon and as a shield against others in multiple capacities ("You own all my code when you employ me? Sure, as long as I get to dictate the license"). However, I would never take someone's 2 or 3-clause BSD-licensed product, and relicense. Those of us that value sofware freedom value the rights of other licenses that believe the same.
We all see why it was done in this case, however; In order to ensure software freedom in the cloud (someone else's computer), there isn't another license to use. the BSD license completely breaks down in this use scenario, and the best we have is AGPL.
I'd say cloud usage is to 3-clause BSD what Tivo was to GPLv2.
Does CitusDb fit in olap analytical workloads to do aggregations on hundreds millions of records using varying order and size of dimensions (eg druid) in max of 3 seconds response time using as few boxes as possible - Or there are other techniques have to be used along with Citusdb? Can you shed a light on your experience with CloudFlare in terms of cluster size and queries perf?
Hundreds of millions of records in <=3 seconds is not really a big challenge with a good data model and proper indexing on even a single server.
I work for a BI consultancy and we don't even bat an eye until we hit billions of records in a primary fact table.
Certainly the DB server does need to scale vertically to some extent as you pass through the orders of magnitude > 10M. A good columnstore engine is also worthwhile to consider.
Great product - If would be nice to have a Admin interface like RethinkDB where you can clearly define your replication and Sharding settings.
Any documentation around how to do this from command line ?
I recently switched back to MariaDB because I didn't see a clear/easy path for Postgres scalability in case the project i am working on takes off. I am under the assumption there are at least two fairly simple approaches to scale MySQL; master-master replication using Galera and Aurora from AWS. What do you guys think? Am I right in thinking MySQL is easier to scale given I want to spend the least amount of time maintaining it.
Being burned before,I will never use an OS infrastructure project that has enterprise features you need to pay for. They always try to move you to paid and make the OSS version unpleasant to use over time as soon as the bean counters take over to milk you
"For customers with large production deployments, we also offer an enterprise edition that comes with additional functionality"
This is the model used by many companies backing OSS. The fact that you have been burned before means the actor in that case (or cases) acted badly, not that the model is wrong.
Software isn't free to produce, and the need to make money off software isn't something companies should be ashamed of. In fact, nowadays I'm leaning towards trusting OSS with clear financial sustainability over software whose long term existence seems shaky.
Do you often make big decisions based on extrapolation from so few data points?
I use a major open source system with enterprise features and support but don't pay for any of those options. I've used it for 3 years and it's been invaluable. No pressure to start paying for anything. Some of its premium features have actually become free over that period. But I wouldn't decide that all open source systems with premium features are safe based on that experience.
If this costed roughly a million dollars, then yes. Especially if you're locked in like with a DB. I use nginx because even though it has this mode it would be easy to replace with something else.
I definitely view "open core" products with greater skepticism than truly open source ones, but I think it comes down to the community surrounding (and engendered by) the sponsoring company/foundation. These Citus guys seem to be really enthusiastic about contributing their work to the community. That attitude mitigates any concerns I have, because to me it seems that they are really a part of the larger PostgreSQL community---not just trying to take advantage of it like certain companies whose names we won't mention.
What are some alternatives for paying their employees to develop these products that are 100% free? There are some excellent commercial organizations that drive some of the tech that many or even most of us have came to rely on. This goes all the way down to Linux and BSD itself.
One must thank them for open sourcing this, and cannot blame them for using a different license, but using a different license makes me think calling this "unfork" is bending the truth a little bit.
The "unfork" part is primarily about not forking the postgres codebase anymore, as done before citus 5.0 (i.e. we modified parts of postgres, to make it citus). Citus now entirely works as an extension to postgres, using the extension facilities postgres provides.
Perhaps I'm missing something, but this is just an extension that works with standard postgres, there are no code changes in postgres itself, so it doesn't look like it ever was a fork.
Yes, that's what you're seeing right now, but in the past Citus (used to be "CitusDB") was a superset of the entire PostgreSQL codebase. During the lead-up to the open source release, we removed the use of any static methods or internal machinery and rewrote the installation process to use the PostgreSQL CREATE EXTENSION command. Additionally, we moved all of pg_shard's DML functionality into Citus to unify the product line.
So ultimately CitusDB was a fork but is now entirely an extension.
> With the release of newly open sourced Citus v5.0, pg_shard's codebase has been merged into Citus...
This is fantastic, sounds like the setup process is much simpler.
I wonder if they have introduced the Active/Active Master solution they were working on? I know before, there is 1 Master and multiple Worker nodes. The solution before was to have a passive backup of the Master.
If say, they released the Active/Active Master later on this year. That's huge. I can pretty much think of my DB solution as done at this point.