We (Pivotal, for whom I don't speak in any official capacity) have also opensourced Apache HAWQ (incubating)[1], which is an SQL front-end for Hadoop that was extracted from Greenplum, as well as Apache Geode (incubating)[2] which was based on GemFire.
This was part of a general announcement we made in February that our intention was to opensource our data products as it became possible to do so.
Incidentally, we are hiring engineers across our entire data suite, including PostgreSQL specialists. The Data division solves seriously heavyweight distributed problems. You can look up the listings on http://pivotal.io/careers, or if you like helping a fellow engineer collect a referral, email me: jchester@pivotal.io and I'll work out which office and division to send you to.
Oh man this is huge. Are you guys opening Chorus and the Pivotal HD stuff too? I'd be shorting HP stock right now, because Vertica just lost all of its appeal. Teradata and it's Hadoop H-SQL or whatever must be shaking in their boots too. Are you guys going to attempt to upstream this or is it forked to the point of no return?
I'd love to work on a project like this but I've got no PGSQL experience (though I have worked on HFT both with KDB+ and writing a fairly decent functional knock-off operating in production and I bet the distribution and data set problems were similar). How is EMC to work for as a parent company?
"Greenplum Database is based on PostgreSQL 8.2 with a few features added in from the 8.3 release. To support the distributed nature and typical workload of a Greenplum Database system, some SQL commands have been added or modified, and there are a few PostgreSQL features that are not supported. Greenplum has also added features not found in PostgreSQL, such as physical data distribution, parallel query optimization, external tables, resource queues for workload management and enhanced table partitioning."
Are there plans to eventually have full feature parity with ongoing PostgreSQL development (as in adding features from versions of PostgreSQL newer than 8.2), or is Greenplum going to mostly be taking its own trajectory?
> Are there plans to eventually have full feature parity with ongoing PostgreSQL development, or is Greenplum going to mostly be taking its own trajectory?
I asked about this today while I was at work.
The gist was: it will depend on a lot of factors, so it'd be unwise to nail any documents to any doors. But watch this space for further developments.
In the mean time, Greenplum speaks the PostgreSQL wire protocol. All the tools that can speak to PostgreSQL can speak seamlessly to Greenplum.
Disclaimer: While I work for Pivotal, I work in Pivotal Labs. I'm just an engineer and I don't have any input into product direction for our data products, which is done in another division. If pain persists, consult your doctor.
Look like a worthy competitor to Elasticsearch for analytics: it has custom partitioning strategies, true parallel querying, and support for columnar table storage, and has much of Postgres' rich SQL implementation.
The parallel loading looks good, but I'm concerned that the single master means it's a bottleneck for writes -- is this the case, or is there a way to distribute writes across segments without involving the master?
The other concern is that this isn't just "horizontally partitioned Postgres". It's forked from a very old version (8.2) of Postgres, and so doesn't have things like hot standby, streaming replication, the JSON datatype, GIN indexes (does it have GiST?), arrays, etc. It looks like it's optimized for parallel workloads, not for general use.
Anyone here with any experience with Greenplum who can perhaps speak about their work and the things that Greenplum is good at?
The analytics it is built for aren't the kind that ElasticSearch is typically used for.
Greenplum/datawarehouse style analytical are things like: What is the average spend of female customers with 2 children who live in postcode AAAA or BBBB. Break it down by day of week and group by their marital status. Now for the top cohort of buyers on Mondays, give me a breakdown of the 3 most popular products and our profit margin on each one. This is often called "Business Analytics" (BI)
You could make ElasticSearch do that of course, but it wouldn't be much fun.
Given that context, the single master limitation turns out not be be a huge problem. Typically Greenplum is setup to load data from the OLAP-style online system, and the amount of data loaded is very predictable.
Same with things like hot standby and replication. In the Datawarehouse world it isn't uncommon for there to be nightly periods where a batch dataload occurs and/or nightly reporting is done, and the availability of external interfaces to the DB during that time maybe restricted.
Thanks, that makes a lot of sense. We have those types of queries, too, which is where ES definitely breaks down. We wouldn't mind switching those parts (the "BI") of the analytics into something with higher latency.
You can then write SQL, Scala, Python, R to interact with ElasticSearch. I can't recall the performance but against Cassandra, HDFS, HBase, MongoDB etc. it is very fast.
Hey, forgive my ignorance, I'm the guy who uses this stuff, not the guy who builds it.
My understanding was that, typically, the DW is where the data gets ETLed from various places and lives in a normalized environment. Then from that big pool, departments get built columnar or OLAP databases for specific analyses they want to do, and I connect not to the DW but to the OLAP instance (we call it Datamart here). Was I living a lie? At first I thought that your master-segment system is the same, but then I read your comment. I looked into Greenplum a while back and wanted it because of Madlib, but my guys said "nope" and went with MS.
It can be done with a dimensional scheme or with a normalised schema. Ralph Kimble argues for the dimensional approach whilst Bill Imon argues for the normalised approach.
For myself, I tend to go with Kimble, but there are advantages and disadvantages to both approaches. Wikipedia has a decent summary of this here:
> The parallel loading looks good, but I'm concerned that the single master means it's a bottleneck for writes -- is this the case, or is there a way to distribute writes across segments without involving the master?
Yes, this is what gpfdist[1] is for. Used properly, data can be bulk loaded in parallel across worker nodes without a master bottleneck. I've never used it, but our field engineers tell stories of heroism and victory that typically starred gpfdist as the magic sword.
In terms of what it's good at: really big queries on really big data sets. Note in particular the integration with the Apache MADlib (incubating) machine learning toolkit[2], which was opensourced and donated to the ASF earlier this year.
I guess I was too subtle in my comment; what I mean is, this looks like geared towards long, periodic bulk loads and not granular OLTP/webapp-type workloads where lots of clients write small transactions.
We're using ElasticSearch for analytics, and being able to write to any node is really nice. We stream events in real time, and only do bulk loads when we need to change the schema or reprocess the data.
But ES has its share of problems, and somethings like Greenplum would be preferable, if it can handle the write pattern.
I used greenplum until about three years ago when that employer replaced it. There is no way to describe it other than brittle and unsuited for production work. My peers discovered multiple ways to kill the db, all with data loss. Before you use it you should find a current user. It is one of a handful of techs on my personal "never again" list.
Licensing was also ludicrously expensive, including an attempt to multiply our license cost by 10 one year. I'd be very leery of lock-in given the owners. I also suspect this is the result of a failed attempt to sell it and is an open-source hail mary.
The owner is now Pivotal (it used to belong to EMC). Almost all our product code is opensource. The culture comes from Pivotal Labs DNA, which is a resolutely opensource shop.
> I also suspect this is the result of a failed attempt to sell it and is an open-source hail mary.
we ported dremel to our heavily customized hadoop (it is, even now, at least 10x more performant than the public hadoop code, drawing from experiences of running on 5k+ physical nodes, plus yahoo is where software engineering goes to die)
we also ended up using spark
while neither of the above is as convenient as greenplum could be, they also didn't have 2+day outages every other month while hundreds of tb of data had to be regenerated and reloaded for the reporting powered by greenplum. They also didn't cost millions of dollars a year that emc wanted. Overall, I think we were very happy ending gp.
Gpfdist is really just a binary format over HTTP, so it does handle streaming use cases well.
Spring XD implemented a Java reactive version of the GPFdist protocol, so you could use that to wire it up to your existing app pretty easily (or just copy the Sink code which is pretty straightforward).
I've attempted to load huge amounts of data into ElasticSearch. It is a bit fiddly to disable indexing and the sheer write performance is poor against dedicated databases. It is a testament to how good ElasticSearch is that people do use it as a database.
ES performs well, but we don't use it as a primary data source, and I would advise heavily against using it as one.
Things are getting better with the new transaction log, but it's still extremely prone to falling over and to getting consistency issues, and some of the built-in automation can bite you in the ass. Recently we had a node — fortunately an expendable testing box — run out of disk space. ES stopped working, which is fine, but when we restarted it, several of the shards were corrupt, and there was no way to get it working again without dropping the index and recreating it.
In another instance, we started getting client downtime because ES had suddenly detected low disk space on one node (its "high watermark" setting is extremely conservative and not based on a statistical regression of when it's going to run out), and had started relocating shards. When it was done, one of the shards were stuck in "RELOCATING" but marked as done; only restarting ES fixed that issue. When we freed up some disk space, it gave us more downtime when it decided to move the shards back. In theory, ES was behaving correctly, but the automation behaviour that kicked in was unnecessary and not desirable in that use case.
ES has lots of good parts, but it also has a surprising number of bad parts — ugly stuff you just won't find in Postgres, for example.
My issues with ES are entirely about ergonomics. My pet peeve is the mapping system. ES is often described as schemaless, but this is of course untrue; ES has a schema, and a very strict one, but its default mode is to automatically learn the schema from the data you feed it.
By learn I mean that the first time it receives a document, it will guess the shape of the data and create a schema from it. You can't ever change mappings, only add new ones, which means that running in "dynamic" mode can (and, mostly, will) result in bad mappings.
This comes from (1) that ES' guesswork is by definition incomplete with regard to data types, (2) that ES can't, by definition, guess things like text analysis settings, and (3) sometimes input data is just inconsistent with itself.
#2 means that even if your data is completely consistent, if you submit a text field, it will get the default index settings. #1 means that, for example, if you have something you think is a date, which ES doesn't recognize as a date, then it will be assumed to be a string. Dates are among ES' weakest areas (though 2.x makes some improvements here).
You can mitigate some of these problems with the "_default_" mapping, which can catch fields and assign defaults based on name patterns. But it's not a complete solution.
Another problem is that for historical reasons, ES indexes share field names across all document types, and don't allow conflicting mappings. For example, if you have "articles" and "sections" and both have a "title" field, ES requires that the mapping be the same — because internally, they are stored in the same Lucene index. This is partly a feature (it allows you to search across all types on the field "title" and guarantee that it will work consistently) but really a principle-of-least-surprise-violating misfeature (you can accomplish the same thing with a synthetic field and "copy_to", which is more explicit and less surprising).
Another problem: Posting a conflicting mapping to ES does nothing; it will silently ignore it. There's no way to determine if your own mapping changes will have any effect or not. Here's why it's a problem: Say you made some changes to your app. You have an ES index in production. Now you want to determine if your new mappings can be applied without an index migration, or whether they will conflict. Turns out you can't. Some mapping properties (e.g. name, type, analyzer, store flag) cannot be changed and should conflict. Some don't. Since there's no set of rules encoded anywhere about what are conflicts and not, you can't progammatically diff your mappings.
This is exacerbated by the fact that ES will "collapse" mapping properties that equal the default. For example, let's say you post a mapping with the properties {"type": "object", "store": true}. If you read that back afterwards, it will be {"type": "object"}. Why? Because "store" is true by default. So if you try to "diff" a mapping, it will look like ES has different mappings than you have. ES tends to prefer implicit over explicit, which is a bad, bad design philosophy.
You can only give up and assume that anything that isn't an add (a new document type mapping) will require a migration. Which leads the next problem: ES can't migrate indexes. You're totally on your own. Which is ironic because by default, ES will store the original source (the "_source" field) of every document you index. There's no reason ES couldn't create a whole new index, based on new mappings, from a different index. But no, you have to write a client and run the data from ES into the client and then back again, just to do a mapping migration.
If I were to summarize this essay, it's that there's an icky impedance mismatch between mappings, logical ES indexes and physical Lucene indexes that have existed since the beginning, and whose consequences are increasingly felt. Lucene is used both for indexing and for document storage, and the distinction isn't adequately abstracted.
Here's an example of where an abstraction could have helped immensely. In ES, unlike relational databases, there's no such as "creating an index on a field". A field either has an index or it doesn't, and field names map directly to Lucene field names. There's no indirection. So if you set up a field "name", that also creates a Lucene inverted index called "name", and because the field is now tied to a physical index coding (e.g. tokenized strings), it can never be changed. You can't create more indexes on the same field, and you can't drop the index or change it.
The better solution would be to map it to an internal field name, and let each field support multiple indexes. For example: Say I have a field "name". Sometimes I want to search as tokenized text. Sometimes I want to treat the whole thing as a single token (e.g. when aggregating into buckets). If there were an indirection between document fields and indexes, I could accomplish this by creating two indexes on "name". ES could figure out which index to use from the query operators I used, and it could also offer a way to explicitly specify which index to use.
This type of indirection would solve the mapping strictness because ES would no longer be tied to a single type representation for every value.
I have other complaints about ES, but fortunately, some of them are being resolved in 2.0 (beta) and 2.1 — they are removing tons of cruft in the APIs, and seem to have realized that a lot of the design choices (especially the ones that promote implicit over explicit information) were a bad idea.
It's been a long time coming for this database niche to reach open source, which is the tech behind the big analytical prowess of countless companies (before and after Hadoop). Teradata, the first massively parallel database, was released in 1984.
Hmm; this might make it more attractive to investigate Chorus, the Rails-based analytics application that Pivotal also open-sourced (https://github.com/Chorus/chorus).
It was always awkward to figure out what components you needed to build a development environment for it. It depended on having a virtual machine with a specific version of Greenplum Database running in it, but that version of the database wasn't easy to find. An open source Greenplum Database might make it easier to get started.
What's the story behind greenplum? Is it an old startup that has been bought by pivotal some time ago? I found an article about greenplum raising 20 mln $ in 2008 [1]
I was wondering whether anyone used greenplum in production and with what kind of loads. With today's in-memory fad, I am also interested in whether gp model supports loading everything into memory in a MOLAP fashion.
This is exciting news. Greenplum's is one of the few instances of a Postgres-based MPP that I've had good experiences with.
The experiences and tuning in the query plan re-writer for MPP might be helpful in rolling in parallel operations support to the base. Not necessarily the "how" since the GP strategy won't necessarily mesh, but the "what".
I am looking for a new report solution for my company, but it seems you guys have a different set of users in mind. But your dashboard building solution looks great, I wish BI suites out there would be that easy
That's probably why it was open-sourced. It'd take a huge effort to 'move' something as entrenched as Greenplum onto PostgreSQL 9.x and the fact that so many features and enhancements are missing from 8.2, it's putting off potential users.
What do you mean blackbox? It has explain analyze, it has rich query meta info, it has a web interface to query stats. You even know on what hardware it runs. Genuinely not sure what's blackbox about Redshift.
We (Pivotal, for whom I don't speak in any official capacity) have also opensourced Apache HAWQ (incubating)[1], which is an SQL front-end for Hadoop that was extracted from Greenplum, as well as Apache Geode (incubating)[2] which was based on GemFire.
This was part of a general announcement we made in February that our intention was to opensource our data products as it became possible to do so.
Incidentally, we are hiring engineers across our entire data suite, including PostgreSQL specialists. The Data division solves seriously heavyweight distributed problems. You can look up the listings on http://pivotal.io/careers, or if you like helping a fellow engineer collect a referral, email me: jchester@pivotal.io and I'll work out which office and division to send you to.
[1] http://hawq.incubator.apache.org/
[2] http://geode.incubator.apache.org/