PostgreSQL 15: Stats Collector Gone? What’s New?

jandrewrogers · on Aug 29, 2022

I look forward to trying this out and seeing how this architecture change works in practice. For large Postgres databases (tens of TB in my experience), the old stats collector architecture was a reliable source of operational bugs that didn't have any real fixes, and this has been the case for a long time. If those issues start being addressed by these changes, that is a huge boon for people with large Postgres instances and will significantly improve its scalability story.

This could be a really important change, especially for people with large instances.

eric4smith · on Aug 29, 2022

We are running pg11 for fairly big databases (almost terabyte size data directory).

I was waiting to upgrade to pg12 then 13 then 14.

With this change I seriously think I’ll just upgrade to 15 at the end of the year.

alberth · on Aug 29, 2022

Do you really want to upgrade to a .0 release?

Now is the perfect time to update to pg14 because it’s on rev 5 (14.5).

pilif · on Aug 29, 2022

To give some anecdata: I have been bitten twice by going to a .0 release with Postgres, once by index corruption and once by some subqueries returning wrong results in some cases.

I have since decided to always wait for a .1 with Postgres before updating.

The good news is that this is only a few weeks past the initial release.

Now if only more people upgraded during the RC phase already, then everybody could go to a .0 release.

And conversely, if everybody follows my (and your) advice, then .2 will be the new .1.

It’s never easy

PikachuEXE · on Aug 29, 2022

Depends on how great you trust the quality of those released. `.1` is usually good enough for PSQL except the following issue.

14.4 was released with a fix on silent data corruption when using the CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY commands.

https://www.postgresql.org/about/news/postgresql-144-release...

pilif · on Aug 29, 2022

That was a nasty one, but then again, my recommendation would still be to use `pg_repack` over reindex concurrently because that one also gets you concurrent clustering and that one was not affected by this bug.

I'm not downplaying the issue and index corruption is really bad, but I would wager a guess that admins who do need concurrent reindexing would also be aware of `pg_repack` and would prefer that anyways because of the other benefits it provides.

This is probably why it took 6 months for the issue to be reported and fixed.

bluetech · on Aug 29, 2022

What advantage does pg_repack have when you only rebuild an index? Or do you mean it has advantage when pg_repack is run on the entire table?

jtc331 · on Aug 29, 2022

pg_repack has some significant downsides in its implementation; I question whether it’s really a default over re-indexing concurrently. I’ve certainly not gotten that impression, and we maintain many very large Postgres clusters.

inferiorhuman · on Aug 29, 2022

Interesting! Were the queries reliably returning wrong results?

pilif · on Aug 29, 2022

it was in the 2012/2013 time frame, so I can't find the relevant release note any more, but it was reliably returning wrong results for a specific sub query pattern.

Not all of them were broken, but the broken one was returning wrong results 100% of the time.

Index corruption shows the same symptoms, but, of course, it is a different cause.

mike_hock · on Aug 29, 2022

It sounds like Postgres's test coverage could be better ...

jeltz · on Aug 29, 2022

Yes, PostgreSQL needs more tests, but no it is not really a coverage issue. Most of the serious bugs have been related to concurrency or other things which simple test coverage can find. Finding these bugs is usually not trivial.

endisneigh · on Aug 29, 2022

This is something foundationDB does pretty well. They built a simulator that tests such things. Doubt you could port to Postgres easily though.

jeltz · on Aug 29, 2022

That is exactly the kind of tools PostgreSQL needs more of. There are some tools but more are needed. More plain old test coverage will not help much if at all.

leaflets2 · on Aug 30, 2022

You meant "cannot" right:

> which simple test coverage cannot find.

paulryanrogers · on Aug 29, 2022

Major version jumps are always fun. Recently I discovered RDS recommends pglogical over built in replication for a reason, the latter doesn't work well in RDS with larger than RAM replica logs.

olau · on Aug 29, 2022

Aurora doesn't work with a query that computes a temporary table larger than memory either. Amazon's Postgres things are not Postgres.

paulryanrogers · on Aug 29, 2022

I wasn't using Aurora at the time. And I don't expect non-Aurora RDS to be exactly the same as vanilla Postgres either. Still, it was surprising that the old Pg solution is supported (pglogical) while the new one wasn't yet, at least for my version jump.

PikachuEXE · on Aug 29, 2022

I have never used RDS.

Anyway here is a post for using built-in logical replication.

https://dev.to/pikachuexe/postgresql-logical-replication-for...

paulryanrogers · on Aug 29, 2022

Posts like this didn't help in my case. Though depending on versions being replicated and RDS settings it could work.

PikachuEXE · on Aug 30, 2022

Ya there are many managed services and limitations around those.

Only throwing out this as a reference.

taspeotis · on Aug 29, 2022

Hey, we use logical replication on RDS. Never considered pglogical. Do you have a link to Amazon’s recommendation of pglogical?

paulryanrogers · on Aug 29, 2022

So I was using their guide for minimal downtime upgrades [0], see option 'D'. And I was replicating Pg 11 to 14. Logical replication appeared to work in that case yet big tables would never catch up or appear to get truncated. If you're replicating among nodes at the same Pg version then built-in replication may work for you.

Once I switched to pglogical jumping major versions worked as intended. Now it's possible in the rush I used the wrong setting somewhere and built-in logical replication can work between major versions. Though I lost enough time experimenting I'll keep using pglogical until AWS officially recommends something else.

[0] https://aws.amazon.com/blogs/database/part-1-upgrade-your-am...

taspeotis · on Aug 30, 2022

Thank you!

MuffinFlavored · on Aug 29, 2022

What are you doing for replication, just curious?

Reads are easy with replicas, right? What are you doing to handle "writes" across all your apps/regions/etc.?

eric4smith · on Aug 29, 2022

Just standard replication on 2 other nodes.

We don’t have huge constant loads. Over-provisioned on massive bare metal so we can take a lot.

Also daily pg_basebackup which does not eat too many resources. Standard pg_dump is impossible.

About to also add log shipping when I upgrade. We don’t do super critical things like banking or anything like that. But I do want to move to the point where at worse we only lose a few minutes of data.

Now we can lose up to 24 hours if all the replicas die at once.

MuffinFlavored · on Aug 29, 2022

So that's 3 nodes total, one permanently master and the other two permanently read-only slaves? Is there any kind of "automatic rollover" if the master goes down where one of the slaves automatically promotes itself?

eric4smith · on Aug 30, 2022

Not necessary. As I said, we are not at all a critical service. If we are down an hour, it's not a big deal (like 95% of the rest of the inter-webs). We are not banksters.

If it goes down, we get a page, and we can have it switched by hand in a few minutes.

I guess if our service was super-critical, I would do that, but since it's not, the three of us that work as developers and sys-admins can deal with it very quickly.

the_duke · on Aug 29, 2022

I'm sure Postgres is full of these inefficiencies and suboptimal system designs. The process model is known to be pretty horrible.

Considering the huge engineering teams SqlServer and Oracle have, I'm always amazed how well Postgres works - despite the tiny number of full time developers.

jobinau · on Aug 30, 2022

Oracle is 25 million lines of code vs 1.3 million lines of code of PostgreSQL Still Oracle don't have all basic isolation levels. Only the Read-Commited works perfect. And DDL operations are still not transactional. We should think which is inefficient design. The process model may not be very "efficient" as thread model. but it is more "stable" and more "secure". Benchmark results are not bad either.

fullstop · on Aug 29, 2022

Oracle, in the past, was also a multi-process model on Linux. It looks like the multi-threaded model was an optional change at some point around Oracle 12.

To get around the inefficiencies of spawning many processes, I put pgbouncer in front of PostgreSQL.

riffraff · on Aug 29, 2022

> The process model is known to be pretty horrible.

isn't this still processes, just with shared memory?

bluedino · on Aug 29, 2022

Any recommended resources explaining the pitfalls of multiprocess vs multithreaded?

the_duke · on Aug 29, 2022

Postgres forks off a new process for each connection.

This introduces lot's of overhead across the board.

* Cross process communication is a lot more expensive (done via shared memory or as here via the file system)

* Switching between processes is a lot more expensive because each process has its own memory space, hence a switch flushes the TLB. Also more bookkeeping for the OS.

This is especially bad for a DB, which will usually spend most of its time waiting for IO, so can switch execution context all the time.

* Each process also has a distinct set of file descriptors, so those need to be cloned as well

* A dB needs lots of locks. Cross process locks are more expensive.

* ...

These things add up.

paulryanrogers · on Aug 29, 2022

Biggest one I'm aware of is connections aren't threaded. And Pg tries to preallocate resources at the time of the connection. Together these make connections expensive compared to Mysql's threaded connections. Many folks run a connection pooler in front of Pg for this reason.

One pro of non-threaded is simplicity and no need for thread safety everywhere.

spullara · on Aug 29, 2022

This makes we want to profile postgres if they have this kind of thing in version 14.

t6jvcereio · on Aug 29, 2022

Weird to see these kind of problems in a database that used to be considered the best.

eska · on Aug 29, 2022

I think if you were to dive into the mailing lists of other databases/software you'd realize that every piece of software has its lingering issues.

jobinau · on Aug 30, 2022

Every software has imperfections. thats why new versions comes with improvements. There is no end to it