One of the things that is continually great about PostgreSQL is how they consistently deliver a solid v1 of new features that are reliable, well engineered, and semantically sane. I'm sure partitioning will be another example of this kind of thing where in a few years you will have both all the bells and whistles and also firm footing in solid RDBMS engineering.
As a nit, postgres has had tablespace style partitioning for some time, but these feature adds much more control. Not sure if that was your meaning or not.
As a long time user of MySQL, I've switched to PostgreSQL for my most recent startup precisely because the PostgreSQL team continue to innovate in a sensible way.
One of the core requirements is a stable, fast and reliable initial import of existing data from 3rd party services, and because of throttling concerns that led us to a batched, retryable solution - and a requirement for simple, fast job control.
We've built this around a parallelised solution using the FOR UPDATE/SKIP LOCKED feature from PostgreSQL 9.5 onwards.
It's not bleeding edge, but for us it was a great example of how the pg team pick up the useful features from e.g. Oracle, and incorporate them into the product in a solid way, without all the enterprisey marketing crap that Oracle make you swallow before you can understand the actual feature-set.
[the startup is a SaaS search tool that helps you index across Trello, email, GitHub, Slack, Drive - https://getctx.io if you're interested.]
FOR UPDATE ... SKIP LOCKED solved a problem for us recently as well. It's a great feature for an adhoc sort of queue situation without needing a dedicated queue / workers.
It might be a good choice to switch to postgres, but it isn't like MySQL isn't innovating too. 8.0 will have a ton of features. This includes a bunch of "catch up" things like CTEs and SKIP LOCKED, but also stuff like synchronous replication built in.
I know the defaults have been changed to be safe, but I will forever feel like my trust was broken with MySQL the day I learned how bad the defaults are. Not just inefficent/generic, but actually bad.
Additionally, I feel that postgresql is better engineered, more featureful, and has support for a much larger variety of use-cases (e.g. hstore, json, postgis, pg-routing, uuids, trigram indecies, &c). So between engineering perks, and the loss of trust, I only use MySQL in legacy applications that I cannot port.
For a product like yours, don't you think a NoSQL solution would be a better fit? If yes, why didn't you opt for it? If no, then why(especially considering the fact that your schemas are always in a state of flux)?
PS: This is under the assumption that you have used Postgres for storing all the data from different sources.
Not OP, but having structure to a datastore is important. It decreases development and maintenance time and doesn't let vauge bugs fester for years. NoSQL is a premature optimization that many people make without understanding the downsides.
Completely agree. But in the above case, the structure of third party service API will always change and each change may require a migration. I worked on a similar problem in my last stint and used Postgres. Here I opted for predefined fields and handled the extra logic in application code. In the above case, I just wanted to know if there is a better way to solve the problem in Postgres.
I get a lot of mileage out of using real columns for the data with fixed structure, and Postgres JSONB columns for the parts of the data whose structure varies.
Hi devj,
You're absolutely correct that an RDBMS would be an odd solution for a search index *
Anyway, on CTX I use PostgreSQL for storing things that deserve to be relational - job control, batching, users, accounts, billing, invitation codes...
All the indexed content is in an Elastic Search cluster (http://elastic.co if you're unfamiliar - it's a specialised search indexing data store built atop Apache Lucene)
* though actually PostgreSQL has a really good capability as a JSON store that means people do sometimes use it as a NoSQL solution
> Its text search solutions come close to ElasticSearch, with - surprisingly - even better performance.
PG search may be good enough for certain use cases, but it doesn't come close to the power of Lucene/Solr/ES. PG only recently added support for phrase search which Lucene has had for years. Lucene has extremely flexible analysis pipeline, BM25, "more like this" queries, simple custom ranking, great language support, "did you mean?", autocomplete, etc. ES in particular is built from the ground up as a clustered solution.
As I said PG may be good enough for some use cases, but to claim it is as good as ES is laughable. And PG certainly doesn't make sense if your product is literally a search product.
All things you mentioned can be done in PGSQL with similar performance, actually.
I have to build a search product in PGSQL, and I’ve done exactly that (and compared it with ES, Solr and Lucene).
For datasets below 50GB the performance is basically the same, you get about the same features, and it works well enough.
But you are right, as soon as you want to build more complicated search products, or as soon as you get larger amounts of data, you want a dedicated solution.
tbh I haven't looked into PostgreSQL as a direct competitor to ES in the text search space, mostly because I have quite a lot of ES experience of things like aggregation for faceting, stemming and textual analysis, and I had a (perhaps incorrect) assumption that PostgresSQL was less capable in those more specialised areas.
One thing I would like to understand is how the PostgreSQL search experience scales with data volume.
My main cluster for CTX will potentially get (very) big and I don't know enough about PostgreSQL, multi-master and full-text search across large amounts of data.
> tbh I haven't looked into PostgreSQL as a direct competitor to ES in the text search space, mostly because I have quite a lot of ES experience of things like aggregation for faceting, stemming and textual analysis, and I had a (perhaps incorrect) assumption that PostgresSQL was less capable in those more specialised areas.
I think your assumption is/was correct. And I'm a postgres dev, so I'm biased as hell...
PostgreSQL is very capable at stemming and textual analysis, but the issue as you likely know starts with large amounts of data – because then you suddenly need multiple replicas, and that’s where the issues start.
You might want to look at https://github.com/zombodb/zombodb. It extends PG to use ES as an external source for indexes. So you can do an ES based text search and return a data stored in PG along with doing joins to other tables. Zombo even works with Citus to handle PG level sharding.
Thanks @rorygibson. Yes, that design will definitely work. Just one doubt: Are you syncing data between ES and PSQL for common entities like users? If yes, how are you doing it?
The direction being taken with data partitioning is great. I truly believe this will solve the issues of many companies when it comes to "big data"... This, to me, is the middle ground we have been missing between rdbms and nosql.
Congrats to the postgres team, what an interesting product.
On the other hand, PostgreSQL had some notion of partitioning essentially since it's inception. Ten years ago PostgreSQL's rules and table inheritance based partitioning was probably only DBMS-integrated solution that was actually usable.
This looks very good but after using Cassandra I'd be scared of bumping into features that are not supported or suddenly very slow if used. For example I'm sure sorted aggregations and group by queries across partitions will take a long time.
I think there should be a list of Postgres partitioning gotchas somewhere to accompany this, e.g. partitioning goes hand in hand with denormalisation (due to the effects of the chosen partition key being so important). With Cassandra I can simply add more machines as I increase the data duplication, I'm not sure how easy it is to rebalance data when adding new Postgres instances?
I'm not sure what you mean by "adding new Postgres instances." Postgres is not a distributed database, at least not on it's own. There are ways to scale using e.g. hot standbys (streaming replicas for read-only queries) and multiple projects that implement distributed databases on top of Postgres (Citus, Postgres-XL, ...). But then it depends mostly on that project, not on Postgres.
Sure, Partitioning does negatively affect some queries. But the thing is - it's a significant improvement compared to the previous partitioning implementation, which had almost no insight into the partitioning rules. And so optimizer could not really do advanced tricks (e.g. partition-wise joins) etc.
I'm not sure what the point of partitioning would be if the data isn't distributed between nodes? I'm probably not understanding.
All I'm saying is I can go and read about the pain points in say Cassandra for this type of stuff because of where it comes from and it tries to protect you from doing things that are slow. With Postgres I think it's going to be harder to know when I can't use a feature that's suddenly going to kill my performance.
It would be great if there was a resource for how to avoid such issues in partitioned Postgres but it'd take a lot of work to write such a guide - I'm guessing things like JSONB queries, Geo, Transactions, Aggregations across partitions and many extensions are all going to cause pain points?
Partitioning mean your data can be spread across more physical areas (or FS partitions, or whatever) at lower level, giving the database (and you) more control of performance by optimising what gets stored where at a lower level.
- Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
- When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
- Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
- Seldom-used data can be migrated to cheaper and slower storage media.
The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
Okay - so the partitioning gives better per machine performance by allowing you to parallelise certain things and spread out on disk. I thought this was similar to Cassandra's ring based partitioning that allows data to be distributed around many machines. Thanks for clarifying.
That will eventually be coming in some form. In the OP link, there was a mention of foreign tables participating in partitioning:
> While foreign tables can be partitions in v10, they are currently subject to an unfortunate limitation: tuple inserts that route to a foreign table will fail, since this feature has not been implemented yet. Etsuro Fujita (NTT) is working to address this limitation by revising code originally written by Amit Langote.
So there is work on getting this to scale on multiple machines, rather than just single machine. We're just not there yet.
Partitioning is not about distribution, it's about carving up the data to improve performance.
For example, let's say you store a table "events", where each event has a timestamp. You could partition this by day, for example. Every day would get its own physical table. If you have a year's worth of data and have a query that only needs data from a single day, it would only need to look at 1/365 of the complete dataset. Both tables and indexes would only need to capture the subset of data for each partition. Row selectivity tends to become problematic for large datasets.
You can of course do this stuff manually — create one table per day, make sure you insert into the right table, always select from the right tables based on which dates you're looking at — and people do this. What partitioning brings to the table is automation; the partitions are still "normal" tables, but Postgres will automate them. You just define the partitioning keys/functions, and Postgres can both handle insertion (you insert into the "parent" table) and the querying (you just include your partitioning column in your query, and it will figure out which tables that need to be accessed).
There are solutions (such as Citus) that allow you to partition across multiple Postgres nodes.
This is a brilliant example and explanation. I can see why you would want to partition when you have lots and lots of data now without multiple machines. I know more about using Cassandra or Risk for this stuff than an SQL database.
Hopefully the work on better indexing of partitioned tables will also apply to inherited tables. I regularly work with tables with over 1000 child tables, using single value lookups which would be trivial in an index. But having to query 1000 indexes makes the inheritance useless and we have to jump hoops to query the right child table instead.
That this is a serious limitation has been recognized for many releases. But the limitation is still there.
> commercial RDBMS comes packaged with a lot more than just the db engine
It may be strength of psql and not weakness, they are focused on one specific niche, which actually covers 95% of business use cases, and trying to do it very well.
For all other cases, there are other products as well. Most of the data science world talks R and SQL now days, and not mdx.
While Postgres is indeed fantastic, Oracle and Microsoft SQL Server are both ahead of Postgres in many areas. For example, the feature that this article is talking about — table partitioning — has been in Oracle for at least 20 years.
Oracle's replication support is also extensive, and supports many scenarios such as multimaster, not to mention sophisticated solutions such as RAC and Dataguard. Then there's materialized views, flashback queries, clustered tables (this is manual in Postgres), parallel queries, etc.
So Postgres is still mostly catching up to other databases. It still has features others don't, of course, such as transactional DDL statements. And it's open source.
It's catching up fast which was the point :) if at present it's suitable for 85% (or whatever the proper estimate is) of use cases then at this rate it will reach 99% in not so distant future.
I would say it's catching up extremely slowly, but that it's also way past the "sweet spot" where it's as good as the commercial offerings for the stuff people use databases for.
Well it's pretty much all about support and integration at some point, Postgres is like MySQL they have integration with open source app but nothing in the enterprise world.
In my enterprise niche (logistics) it seems that PostgreSQL is the go to solution for essentially every new project (and surprisingly often is proposed by customers themselves).