PostgreSQL Bloat: origins, monitoring and managing

bsaul · on April 5, 2016

Does anyone knows of a good blog comparing pg against other major rdbms( oracle, mssql, mysql) and especially how they deal with transactions, and replication ( and why, for example pg is the only one in my knowledge having this vacuum issue) ?

ComodoHacker · on April 5, 2016

>why, for example pg is the only one in my knowledge having this vacuum issue

Because other DBMSs have different MVCC implementations. Basically they keep that "snapshot" row versions in other places, not in the table itself. For example, in Oracle they go to undo segment; in MSSQL to tempdb (or you can just disable MVCC); and in MySQL with InniDB to rollback segments similar to Oracle (not surprising :).

Other MVCC implementations have their own drawbacks of course.

dspillett · on April 5, 2016

> in MSSQL to tempdb (or you can just disable MVCC)

In fact you have to explicitly turn it on in MS SQL Server. Twice: enabling ALLOW_SNAPSHOT_ISOLATION at the database level and setting transaction isolation level to READ_COMMITTED_SNAPSHOT for your connection/transaction.

In MS SQL Server it is an optional enhancement (added in MSSQL2008 IIRC) that you chose to use if your use case fits it, rather than being the primary method as it is in postgres. A great many developers using MS SQL Server don't even know the option exists.

insulanian · on April 5, 2016

> A great many developers using MS SQL Server don't even know the option exists.

Yes, default behavior is so prone to locks - I learned it the hard way. Switching READ_COMMITTED_SNAPSHOT ON brought new life to our DB.

chris_wot · on April 5, 2016

You really want to ensure you've setup your disks very carefully if you have a database that you are using snapshot isolation on.

TempDB runs the following across all databases on SQL Server. I'll just quote Microsoft directly here:

  The tempdb system database is a global resource that is available to all
  users connected to the instance of SQL Server and is used to hold the
  following:

  Temporary user objects that are explicitly created, such as: global or
  local temporary tables, temporary stored procedures, table variables, or
  cursors.

  Internal objects that are created by the SQL Server Database Engine, for
  example, work tables to store intermediate results for spools or
  sorting.

  Row versions that are generated by data modification transactions in a
  database that uses read-committed using row versioning isolation or
  snapshot isolation transactions.

  Row versions that are generated by data modification transactions for
  features, such as: online index operations, Multiple Active Result Sets
  (MARS), and AFTER triggers. [1]

...and the following article [2] also notes that tempdb also handles "Materialized static cursors". The "internal objects" include:

  Work tables for cursor or spool operations and temporary large object
  (LOB) storage.

  Work files for hash join or hash aggregate operations.

  Intermediate sort results for operations such as creating or rebuilding
  indexes (if SORT_IN_TEMPDB is specified), or certain GROUP BY, ORDER BY,
  or UNION queries. [3]

... and yet another article [4] explains that tempdb is used:

  To store intermediate runs for sort.

  To store intermediate results for hash joins and hash aggregates.

  To store XML variables or other large object (LOB) data type variables.
  The LOB data type includes all of the large object types: text, image,
  ntext, varchar(max), varbinary(max), and all others.

  By queries that need a spool to store intermediate results.

  By keyset cursors to store the keys.

  By static cursors to store a query result.

  By Service Broker to store messages in transit.

  By INSTEAD OF triggers to store data for internal processing.

In other words, if you run snapshot isolation, it's possible that someone running a query that uses a largish temporary table or cursor can cause disk contention that will affect snapshot isolation. Similarly, if you run a largish query - or many queries for that matter - that involves a query where your plan shows a sort or spool (several join operators can cause this) then these can affect snapshot isolation also.

SQL Server is honestly the only database I know that puts all these operations into a single shared resource database. Oracle allows you to hive this sort of stuff off to other tablespaces and you can reconfigure and tune your disks to your hearts content.

This has been a known issue for a long time by Microsoft and pretty much any serious SQL Server DBA. You don't have to take my word for it, take a look at the following articles that go into a lot of detail about how to handle tempdb:

* Optimizing tempdb Performance (MSDN) explains some strategies for configuring tempdb as "the size and physical placement of the tempdb database can affect the performance of a system" [5] - I strongly recommend reading this article when you setup a new database server or have an opportunity to do serious database maintenance that allows you to reconfigure you disk setup

* Capacity Planning for tempdb [3] - actually, definitely read this one as it gives a comprehensive list of things done in the tempdb

* Working with tempdb in SQL Server 2005 [4] - yeah, it mentions SQL Server 2005, but I think a lot of it still applies

* Recommendations to reduce allocation contention in SQL Server tempdb database [6] - the symptom is:

  You observe severe blocking when the SQL Server is experiencing heavy
  load. When you examine the Dynamic Management Views [sys.dm_exec_request
  or sys.dm_os_waiting_tasks], you observe that these requests or tasks are
  waiting for tempdb resources. You will notice that the wait type and wait
  resource point to LATCH waits on pages in tempdb. These pages might be of
  the format 2:1:1, 2:1:3, etc.

And the cause is:

  When the tempdb database is heavily used, SQL Server may experience
  contention when it tries to allocate pages. Depending on the degree of
  contention, this may cause queries and requests that involve tempdb to be
  unresponsive for short periods of time.

1. https://msdn.microsoft.com/en-us/library/ms190768.aspx

2. https://support.microsoft.com/en-us/kb/307487

3. https://technet.microsoft.com/en-us/library/ms345368(v=sql.1...

4. https://technet.microsoft.com/en-us/library/cc966545.aspx

5. https://technet.microsoft.com/en-us/library/ms175527(v=sql.1...

6. https://support.microsoft.com/en-us/kb/2154845

bsaul · on April 5, 2016

""Other MVCC implementations have their own drawbacks of course.""

Thanks for the info, can you elaborate on the drawbacks of other technics ?

ComodoHacker · on April 5, 2016

I can for Oracle, which I'm most familiar with. Its approach adds much of complexity both in implementation and configuration/administation. DB engine must ensure integrity and consistency of data in different DB files in scenarios like backup, restore, crash recovery, DDL etc. DBA have to monitor and tune undo space according to database growing rate and workload. If not done properly, long-running SELECT can end up receiving "Snapshot too old" error, which means DB engine could not reconstruct all row data needed as seen at certain point of time.

EmielMols · on April 5, 2016

Rollbacks (canceling a transaction) on other systems usually take longer because it physically needs to rewrite the "undo log" back to the data pages.

ComodoHacker · on April 5, 2016

Yes, but usually this is not a problem, because rollbacks are rare in production.

AlterEgo20 · on April 6, 2016

Even a single rare rollback can kill a prod DB. We had a situation when a bug in our code caused a series of huge UPDATEs in Oracle DB and LOCKed some critical tables. After ~2 hours we found this session and killed it.

The problem - we had to wait for 2 more hours with dead prod while Oracle was rolling back everything. In Oralce killed sessions still hold their locks while rolling back.

In Postgres rollback is almost instant.

arthursilva · on April 5, 2016

Upcoming 9.6 has a great feature to speed up vacuum (thus keeping bloat controlled ).

pilif · on April 5, 2016

Not really. What's much faster is the wraparound protection vacuum which is a special more thorough vacuum that runs at a configured interval which defaults to every 200M transactions against a database.

Only that one is sped up to now run roughly at the same speed as the normal vacuum

arthursilva · on April 5, 2016

It looked to me that the freeze map allowed vacuum to skip pages not touched since last run, so avoiding a lot of work between runs.

anarazel · on April 5, 2016

That's not new though - the visibility map has done so for years (8.4 IIRC).

pritambaral · on April 5, 2016

Was looking for a mention of this in the article.

comex · on April 5, 2016

The second problem the author mentioned, where a normal (non-FULL) vacuum has added dead tuples to a free list but the free regions are still scattered throughout the database files, seems like basically the same phenomenon as fragmentation in a filesystem. The symptoms are the same: extra I/O operations required due to inefficient packing of files/rows in the partition/data file, and inability to shrink the partition/file to the amount of actual data being stored inside (in the filesystem case, if the user explicitly requested a shrink; in the database case, automatically). pgcompact seems analogous to a typical explicit defragmentation tool, and VACUUM FULL probably has filesystem equivalents, though I haven't heard of them.

But I've heard that in "modern filesystems", fragmentation is basically a solved problem, with defragmentation handled incrementally behind the scenes. Why do we need some random hacky script to do it with modern PostgreSQL?

Someone · on April 5, 2016

1. Scale. A 10 GB file has a few million 4kB blocks, a 10 GB database may have a billion records. Worse, moving a block takes a read, typically a seek, and a write; moving a record takes the same because the disk works at block scale (this could change with flash memory). Also, part of the strategy for fighting fragmentation is to grow a file a cluster of several blocks at a time, wasting quite a bit of space if the file stays small. That would be too much overhead for a database with small records.

2. MVCC is similar to a log-structured file system, and I don't think fragmentation is a solved issue there. Certainly, Wikipedia doesn't think so (https://sarwiki.informatik.hu-berlin.de/Log-Structured_Files..., https://en.m.wikipedia.org/wiki/Log-structured_File_System_(...) (reading the LFS page makes me think somebody should implement a generational garbage collector for it)

petergeoghegan · on April 5, 2016

VACUUM is often not the thing that cleans up most bloat in Postgres. There is also HOT pruning:

https://github.com/postgres/postgres/blob/master/src/backend...

Separately, certain index access methods, most prominently B-Tree, support on-the-fly deletion of index tuples:

https://github.com/postgres/postgres/blob/master/src/backend...

Neither of these things need VACUUM to run. They can happen entirely dynamically.

An important point is that there are fine distinctions between the kinds of bloat that exist, distinctions that matter if you want to account how things work for the purposes of a technical deep-dive, but often matter a lot less in the real world. For example, PostgreSQL isn't overly concerned about making free space reclaimable by the operating system, particularly in the shortest possible timeframe.

anarazel · on April 6, 2016

Heh. I was surprised that somebody mentioned/linked to source code details of hot pruning etc.. Then I noticed it was you Peter ;)

cbsmith · on April 5, 2016

PostgreSQL supports doing all this automatically in the background. Doing it explicitly is an option when you want that level of control.

manigandham · on April 5, 2016

PostgreSQL is a great modern database when it comes to operational features and functionality but is not very modern at the infrastructure level. Stuff like backups, replication, scaling, data file maintenance, etc., are still not well developed.

The answer to your question is that the work just hasn't been done yet. It seems like the pace has been picking up recently, and I certainly like the new features, but I do wish the team would focus more on the behind-the-scenes infrastructure stuff more.

ak4g · on April 5, 2016

What's the reference point here? To my mind a lot of the advantages in Postgres are around it's mature tooling for backups and replication; autovacuum is more than sufficient for fairly complex workloads when it comes to data file maintenance, and really the only somewhat-sore point is horizontal scaling - but even there, a lot of the pain comes from there being multiple viable solutions in-the-wild, with none having decisive buy-in across the userbase, and as such no out-of-the-box solution in the standard DB (which is, happily, recognized by the developers and very much a focus of current development work).

But, I'm comparing with mostly MySQL(/it's myriad descendants) and the whole NoSQL ecosystem; for all I know the tools available for PG are hopelessly primitive for people coming from of Oracle/IBM/SQL Server/Ingres/Teradata(/others? I pretty regularly learn of new DBMS engines with shockingly large companies and engineering orgs behind that that I've just never heard of; I'm sure there's plenty more.)

manigandham · on April 5, 2016

You have it covered in the last sentence. There are tons of 3rd party tools, scripts, packages but nothing seems mature and ready for production use. It all feels like hacking things together, especially when comparing to the bigger commercial databases, which is a major issue when that's the market that postgres is targeting now.

cbsmith · on April 5, 2016

PostgreSQL does backups & replication pretty well these days... as for scaling... not sure what you're getting at there... it does quite well. It took them so long to get there because they were trying very hard to focus on doing the behind-the-scenes infrastructure stuff correctly.

chris_wot · on April 5, 2016

Excellent article - it reminds me of this comment:

https://news.ycombinator.com/item?id=11322244

In Oracle you can set the amount of MVCC storage, can any experienced PostgreSQL DBAs confirm if bloating is a big issue they must deal with?

pgaddict · on April 5, 2016

We do provide PostgreSQL support/DBA for a lot of customers, and while I don't have any detailed stats I probably can make some general conclusions ...

I'd say that for most deployments bloat it's not a big issue. But of course, once in a while a customer gets bitten by it and we have to intervene.

There's a variety of reasons for that:

1. disabled autovacuum

Hey, we've disabled autovacuum because we can't risk it affecting production! Hey, we do know our workload patterns much better! Hey!

2. default autovacuum configuration (or bad tuning)

The defaults are very conservative, and generally require tuning (cranking up) on large databases. But people tend to do exactly the opposite, driven by the misconception that it will make autovacuum less intrusive.

3. doing things that are known to break things

Like, keeping transactions open for a long time.

4. pathological workload patterns

There are a few workload patterns that result in bloat (particularly in indexes), and autovacuum can't really fix that. For example large bulk deletes may cause this.

This is probably the one thing that can't be fixed by configuration changes, etc.

anarazel · on April 5, 2016

Re 4) Bulk deletes causing index fragmentation isn't really related to MVCC tho; and it's certainly something non mvcc engines also have troubles with.

ivoras · on April 5, 2016

It certainly exists, and in unexpected places, like system catalogues. It's especially visible if a (wrong) mgmt decision has been made to switch off autovacuum - then everything tends to blow up, and fast.

In short: explicit bloat management and autovacuum tuning are a necessity for all high transaction rate workloads in PostgreSQL, and the stats need to be monitored closely to detect workload changes.

colanderman · on April 5, 2016

To the author: those code snippets would be 10x more readable if you got rid of the useless "admin@aws-eu-west-1-portal:compose (session 1)" preceding each input line.

thwarted · on April 5, 2016

And/or get rid of the insanely narrow fixed-width centered column and let the user resize it along with the window. So many programming blogs put code samples in a fixed-width unresizable column.

ak39 · on April 5, 2016

Not sure about this suggestion. Long lines of text are difficult for eyes to track. That's why you'll see tools like "Clearly" format text into narrower columns. Apparently there is an optimum number of words per line.

Here is one write up on it: http://baymard.com/blog/line-length-readability

__david__ · on April 5, 2016

Short lines are definitely good for prose (books makers have known this centuries), but I find that code is different. I tend to find code that has longer lines is much easier to read than code that rigidly adheres to (say) an 80 column length limit.

I think it's because I tend not to read code in order. I like to see the shape of it, look at just the beginning of the line to see what might be going on in that line and only looking at the whole line if it seems relevant to what I'm doing. Normally just seeing the first 20 or so characters (or pulling some keywords out from syntax highlighting) is enough to understand the broad strokes.

chris_wot · on April 5, 2016

Doesn't help me when I'm reading on an iPad or on the train on an iPhone. Just sayin'.

codepope2014 · on April 5, 2016

Good point and I've removed the user/host part, but the session part is relevant. With Compose's Write Stuff programme (compose.io/write-stuff) we try to retain the contributing authors content and styling as far as possible.