Plans for partitioning in PostgreSQL v11

gfodor · on Aug 15, 2017

One of the things that is continually great about PostgreSQL is how they consistently deliver a solid v1 of new features that are reliable, well engineered, and semantically sane. I'm sure partitioning will be another example of this kind of thing where in a few years you will have both all the bells and whistles and also firm footing in solid RDBMS engineering.

X86BSD · on Aug 15, 2017

PostgreSQL and BSD are standouts in that regard.

ksec · on Aug 15, 2017

And I wish both are used more often.

cat199 · on Aug 15, 2017

.. agree with the overall sentiment.

As a nit, postgres has had tablespace style partitioning for some time, but these feature adds much more control. Not sure if that was your meaning or not.

rorygibson · on Aug 15, 2017

As a long time user of MySQL, I've switched to PostgreSQL for my most recent startup precisely because the PostgreSQL team continue to innovate in a sensible way.

One of the core requirements is a stable, fast and reliable initial import of existing data from 3rd party services, and because of throttling concerns that led us to a batched, retryable solution - and a requirement for simple, fast job control.

We've built this around a parallelised solution using the FOR UPDATE/SKIP LOCKED feature from PostgreSQL 9.5 onwards.

It's not bleeding edge, but for us it was a great example of how the pg team pick up the useful features from e.g. Oracle, and incorporate them into the product in a solid way, without all the enterprisey marketing crap that Oracle make you swallow before you can understand the actual feature-set.

[the startup is a SaaS search tool that helps you index across Trello, email, GitHub, Slack, Drive - https://getctx.io if you're interested.]

bpicolo · on Aug 15, 2017

FOR UPDATE ... SKIP LOCKED solved a problem for us recently as well. It's a great feature for an adhoc sort of queue situation without needing a dedicated queue / workers.

grogers · on Aug 15, 2017

It might be a good choice to switch to postgres, but it isn't like MySQL isn't innovating too. 8.0 will have a ton of features. This includes a bunch of "catch up" things like CTEs and SKIP LOCKED, but also stuff like synchronous replication built in.

jimktrains2 · on Aug 15, 2017

I know the defaults have been changed to be safe, but I will forever feel like my trust was broken with MySQL the day I learned how bad the defaults are. Not just inefficent/generic, but actually bad.

Additionally, I feel that postgresql is better engineered, more featureful, and has support for a much larger variety of use-cases (e.g. hstore, json, postgis, pg-routing, uuids, trigram indecies, &c). So between engineering perks, and the loss of trust, I only use MySQL in legacy applications that I cannot port.

rhaas · on Aug 15, 2017

The first release of PostgreSQL that had synchronous replication was released in September 2011.

https://www.postgresql.org/docs/9.1/static/release-9-1.html

devj · on Aug 15, 2017

For a product like yours, don't you think a NoSQL solution would be a better fit? If yes, why didn't you opt for it? If no, then why(especially considering the fact that your schemas are always in a state of flux)? PS: This is under the assumption that you have used Postgres for storing all the data from different sources.

jimktrains2 · on Aug 15, 2017

Not OP, but having structure to a datastore is important. It decreases development and maintenance time and doesn't let vauge bugs fester for years. NoSQL is a premature optimization that many people make without understanding the downsides.

devj · on Aug 15, 2017

Completely agree. But in the above case, the structure of third party service API will always change and each change may require a migration. I worked on a similar problem in my last stint and used Postgres. Here I opted for predefined fields and handled the extra logic in application code. In the above case, I just wanted to know if there is a better way to solve the problem in Postgres.

bjt · on Aug 15, 2017

I get a lot of mileage out of using real columns for the data with fixed structure, and Postgres JSONB columns for the parts of the data whose structure varies.

rorygibson · on Aug 15, 2017

Hi devj, You're absolutely correct that an RDBMS would be an odd solution for a search index *

Anyway, on CTX I use PostgreSQL for storing things that deserve to be relational - job control, batching, users, accounts, billing, invitation codes...

All the indexed content is in an Elastic Search cluster (http://elastic.co if you're unfamiliar - it's a specialised search indexing data store built atop Apache Lucene)

* though actually PostgreSQL has a really good capability as a JSON store that means people do sometimes use it as a NoSQL solution

kuschku · on Aug 15, 2017

PostgreSQL has really good capability not just as JSON store, but also for search and indexing.

Its text search solutions come close to ElasticSearch, with - surprisingly - even better performance.

rpedela · on Aug 15, 2017

> Its text search solutions come close to ElasticSearch, with - surprisingly - even better performance.

PG search may be good enough for certain use cases, but it doesn't come close to the power of Lucene/Solr/ES. PG only recently added support for phrase search which Lucene has had for years. Lucene has extremely flexible analysis pipeline, BM25, "more like this" queries, simple custom ranking, great language support, "did you mean?", autocomplete, etc. ES in particular is built from the ground up as a clustered solution.

As I said PG may be good enough for some use cases, but to claim it is as good as ES is laughable. And PG certainly doesn't make sense if your product is literally a search product.

kuschku · on Aug 15, 2017

All things you mentioned can be done in PGSQL with similar performance, actually.

I have to build a search product in PGSQL, and I’ve done exactly that (and compared it with ES, Solr and Lucene).

For datasets below 50GB the performance is basically the same, you get about the same features, and it works well enough.

But you are right, as soon as you want to build more complicated search products, or as soon as you get larger amounts of data, you want a dedicated solution.

rorygibson · on Aug 15, 2017

tbh I haven't looked into PostgreSQL as a direct competitor to ES in the text search space, mostly because I have quite a lot of ES experience of things like aggregation for faceting, stemming and textual analysis, and I had a (perhaps incorrect) assumption that PostgresSQL was less capable in those more specialised areas.

One thing I would like to understand is how the PostgreSQL search experience scales with data volume. My main cluster for CTX will potentially get (very) big and I don't know enough about PostgreSQL, multi-master and full-text search across large amounts of data.

I'll definitely do some digging though, thanks!

anarazel · on Aug 15, 2017

> tbh I haven't looked into PostgreSQL as a direct competitor to ES in the text search space, mostly because I have quite a lot of ES experience of things like aggregation for faceting, stemming and textual analysis, and I had a (perhaps incorrect) assumption that PostgresSQL was less capable in those more specialised areas.

I think your assumption is/was correct. And I'm a postgres dev, so I'm biased as hell...

kuschku · on Aug 15, 2017

PostgreSQL is very capable at stemming and textual analysis, but the issue as you likely know starts with large amounts of data – because then you suddenly need multiple replicas, and that’s where the issues start.

jacobr1 · on Aug 15, 2017

You might want to look at https://github.com/zombodb/zombodb. It extends PG to use ES as an external source for indexes. So you can do an ES based text search and return a data stored in PG along with doing joins to other tables. Zombo even works with Citus to handle PG level sharding.

devj · on Aug 15, 2017

Thanks @rorygibson. Yes, that design will definitely work. Just one doubt: Are you syncing data between ES and PSQL for common entities like users? If yes, how are you doing it?

joaodlf · on Aug 15, 2017

The direction being taken with data partitioning is great. I truly believe this will solve the issues of many companies when it comes to "big data"... This, to me, is the middle ground we have been missing between rdbms and nosql.

Congrats to the postgres team, what an interesting product.

cwyers · on Aug 15, 2017

Every other major RDBMS has had this for years. The only people who have been missing this are Postgres users.

dfox · on Aug 15, 2017

On the other hand, PostgreSQL had some notion of partitioning essentially since it's inception. Ten years ago PostgreSQL's rules and table inheritance based partitioning was probably only DBMS-integrated solution that was actually usable.

andy_ppp · on Aug 15, 2017

This looks very good but after using Cassandra I'd be scared of bumping into features that are not supported or suddenly very slow if used. For example I'm sure sorted aggregations and group by queries across partitions will take a long time.

I think there should be a list of Postgres partitioning gotchas somewhere to accompany this, e.g. partitioning goes hand in hand with denormalisation (due to the effects of the chosen partition key being so important). With Cassandra I can simply add more machines as I increase the data duplication, I'm not sure how easy it is to rebalance data when adding new Postgres instances?

pgaddict · on Aug 15, 2017

I'm not sure what you mean by "adding new Postgres instances." Postgres is not a distributed database, at least not on it's own. There are ways to scale using e.g. hot standbys (streaming replicas for read-only queries) and multiple projects that implement distributed databases on top of Postgres (Citus, Postgres-XL, ...). But then it depends mostly on that project, not on Postgres.

Sure, Partitioning does negatively affect some queries. But the thing is - it's a significant improvement compared to the previous partitioning implementation, which had almost no insight into the partitioning rules. And so optimizer could not really do advanced tricks (e.g. partition-wise joins) etc.

andy_ppp · on Aug 15, 2017

I'm not sure what the point of partitioning would be if the data isn't distributed between nodes? I'm probably not understanding.

All I'm saying is I can go and read about the pain points in say Cassandra for this type of stuff because of where it comes from and it tries to protect you from doing things that are slow. With Postgres I think it's going to be harder to know when I can't use a feature that's suddenly going to kill my performance.

It would be great if there was a resource for how to avoid such issues in partitioned Postgres but it'd take a lot of work to write such a guide - I'm guessing things like JSONB queries, Geo, Transactions, Aggregations across partitions and many extensions are all going to cause pain points?

rorygibson · on Aug 15, 2017

EDIT: oops, link updated. Thanks justinclift!

Performance, mainly.

Partitioning mean your data can be spread across more physical areas (or FS partitions, or whatever) at lower level, giving the database (and you) more control of performance by optimising what gets stored where at a lower level.

Official pg doc here: https://www.postgresql.org/docs/10/static/ddl-partitioning.h... - the pertinent bits are as follows:

---

Partitioning can provide several benefits:

- Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.

- When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.

- Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.

- Seldom-used data can be migrated to cheaper and slower storage media.

The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.

justinclift · on Aug 15, 2017

Might be better to use the v10 version of that doc, instead of the v9.1 one:

https://www.postgresql.org/docs/10/static/ddl-partitioning.h...

9.1 is unsupported these days, etc. :)

andy_ppp · on Aug 15, 2017

Okay - so the partitioning gives better per machine performance by allowing you to parallelise certain things and spread out on disk. I thought this was similar to Cassandra's ring based partitioning that allows data to be distributed around many machines. Thanks for clarifying.

Tostino · on Aug 15, 2017

That will eventually be coming in some form. In the OP link, there was a mention of foreign tables participating in partitioning: > While foreign tables can be partitions in v10, they are currently subject to an unfortunate limitation: tuple inserts that route to a foreign table will fail, since this feature has not been implemented yet. Etsuro Fujita (NTT) is working to address this limitation by revising code originally written by Amit Langote.

So there is work on getting this to scale on multiple machines, rather than just single machine. We're just not there yet.

atombender · on Aug 15, 2017

Partitioning is not about distribution, it's about carving up the data to improve performance.

For example, let's say you store a table "events", where each event has a timestamp. You could partition this by day, for example. Every day would get its own physical table. If you have a year's worth of data and have a query that only needs data from a single day, it would only need to look at 1/365 of the complete dataset. Both tables and indexes would only need to capture the subset of data for each partition. Row selectivity tends to become problematic for large datasets.

You can of course do this stuff manually — create one table per day, make sure you insert into the right table, always select from the right tables based on which dates you're looking at — and people do this. What partitioning brings to the table is automation; the partitions are still "normal" tables, but Postgres will automate them. You just define the partitioning keys/functions, and Postgres can both handle insertion (you insert into the "parent" table) and the querying (you just include your partitioning column in your query, and it will figure out which tables that need to be accessed).

There are solutions (such as Citus) that allow you to partition across multiple Postgres nodes.

andy_ppp · on Aug 16, 2017

This is a brilliant example and explanation. I can see why you would want to partition when you have lots and lots of data now without multiple machines. I know more about using Cassandra or Risk for this stuff than an SQL database.

btilly · on Aug 15, 2017

Hopefully the work on better indexing of partitioned tables will also apply to inherited tables. I regularly work with tables with over 1000 child tables, using single value lookups which would be trivial in an index. But having to query 1000 indexes makes the inheritance useless and we have to jump hoops to query the right child table instead.

That this is a serious limitation has been recognized for many releases. But the limitation is still there.

qaq · on Aug 15, 2017

The rate of evolution in PG is truly inspiring at this rate commercial RDBMS will not have much market left for them to play in.

systems · on Aug 15, 2017

if i take ms sql server for example it covers a lot more use cases out of the box compared to postgresql

ms sql is making significant progress for in memory optimization and in memory oltp .. i dont think postgres have any plans in this area

also ms sql, comes with ssis, ssrs and ssas, all under the same license you pay for ms sql server

ssas has no serious competition in the open source world and i would even argue it has no serious competition in the commercial world

the point is, commercial RDBMS comes packaged with a lot more than just the db engine .. postgresql will never compete with that

riku_iki · on Aug 15, 2017

> commercial RDBMS comes packaged with a lot more than just the db engine

It may be strength of psql and not weakness, they are focused on one specific niche, which actually covers 95% of business use cases, and trying to do it very well.

For all other cases, there are other products as well. Most of the data science world talks R and SQL now days, and not mdx.

atombender · on Aug 15, 2017

While Postgres is indeed fantastic, Oracle and Microsoft SQL Server are both ahead of Postgres in many areas. For example, the feature that this article is talking about — table partitioning — has been in Oracle for at least 20 years.

Oracle's replication support is also extensive, and supports many scenarios such as multimaster, not to mention sophisticated solutions such as RAC and Dataguard. Then there's materialized views, flashback queries, clustered tables (this is manual in Postgres), parallel queries, etc.

So Postgres is still mostly catching up to other databases. It still has features others don't, of course, such as transactional DDL statements. And it's open source.

qaq · on Aug 16, 2017

It's catching up fast which was the point :) if at present it's suitable for 85% (or whatever the proper estimate is) of use cases then at this rate it will reach 99% in not so distant future.

atombender · on Aug 16, 2017

I would say it's catching up extremely slowly, but that it's also way past the "sweet spot" where it's as good as the commercial offerings for the stuff people use databases for.

rpedela · on Aug 16, 2017

Postgres has materialized views.

Thaxll · on Aug 15, 2017

Well it's pretty much all about support and integration at some point, Postgres is like MySQL they have integration with open source app but nothing in the enterprise world.

dfox · on Aug 15, 2017

In my enterprise niche (logistics) it seems that PostgreSQL is the go to solution for essentially every new project (and surprisingly often is proposed by customers themselves).