Hacker News new | past | comments | ask | show | jobs | submit login
Parallelism in PostgreSQL (percona.com)
173 points by fanf2 on Aug 10, 2019 | hide | past | favorite | 26 comments



The section on parallel joins is a bit out of date. Shameless plug: I've written and spoken about this subject extensively, if you are looking for more information, with further links and pointers to other blogs and academic papers.

Parallelism in PostgreSQL in general:

https://speakerdeck.com/macdice/parallelism-in-postgresql-11

Deeper dives on hash joins:

https://write-skew.blogspot.com/2018/01/parallel-hash-for-po... https://speakerdeck.com/macdice/hash-joins-past-present-and-...


It is not outdated, it has less information. The parallel joins is a bigger topic itself. In this blog, I want to give a brief introduction about parallelism in PostgreSQL.

I will definitely read what you suggested. But keep in touch I will post more blogs for each sub-topic to cover it all.


Cool. I was just reacting to the fact that you highlighted merge joins in parallel queries with a partial plan on the outer side only, which was a new feature back in 2017 when 10 came out (and was already supported for hash joins and nested loops since 9.6), without mentioning that in 2018 when 11 came out we got parallel hash joins that can use a partial plan on both sides of the join.


I'm not sure why the performance benefits hit a limit with 10 workers on a 64-core machine. It doesn't look like there should be too much communication complexity given largely disjoint workloads.

Are we hitting memory/IO bottlenecks?


It’s not really because of Amdahl’s law but because the concurrency control used by Postgres is not infinitely scalable. IIRC, Postgres uses MVCC with some modifications to give full serializability if desired and that some of this is basically run optimistically such that it may have to abort a transaction if it cannot be committed in a serializable way. A consequence of this is that the number of aborts will grow very rapidly as you increase concurrency.

A particularly nasty behavior occurs when you get into a situation where given a set of concurrent transactions, all of them must be aborted. While 2PL is not very popular these days, it does have the advantage that you can often entirely eliminate deadlock via carefully written queries. Then you just have to deal with Amdahl’s law and you won’t run into quite as catastrophic performance given high concurrency.

Andy Pavlo wrote a paper looking at this comparing different concurrency control strategies under very high levels of concurrency.


The limitations of your concurrency primitives are Amhdahl's law. We call them "concurrency primitives" but they're really about shared state. They are the sequential parts of the operation.

The commit phase of optimistic locking is cheap except when it's a do-over, and by necessity it causes the second action to use the resource twice, the second time sequentially to the first action.


Are you sure about this? I'm not aware of any parts of postgres which use optimistic locking?


> Andy Pavlo wrote a paper looking at this comparing different concurrency control strategies under very high levels of concurrency.

I think you mean his paper "This is the Best Paper Ever on In-Memory Multi-Version Concurrency Control" which is available here: http://www.cs.cmu.edu/~pavlo/papers/p781-wu.pdf

By the way, his courses are really excellent. I highly recommend subscribing to the CMU Database Group channel on YouTube.


As you subdivide the workload you get to a point where the time it takes to issue work to a thread and collect/coordinate the results is greater than the time it takes for the thread to perform the work itself.

Back in the AMD Q6600 days, in my early experimentation in multi-threading I found that my MT quicksort routine hit this limit at about 50k elements. At that point further subdivision just made it go slower. The exact limit depends on how expensive the comparison operator is and the synchronization overhead for the thread pool/work queue, but there will be a limit.

Not sure if this is what's going on here, but is one aspect at least.


It’ll be massively I/O bottlenecked on a single spindle, using a table that’s double the size of the available RAM.

I don’t know what the blog post was trying to prove, to be honest. If I wanted to display the benefits of parallelism, i’d have at least tried to make the test heavily CPU constrained, given that that’s the entire point of being able to go parallel. I’d also have shown I/O performance and waits, along with CPU utilisation.


I think it is a combination of factors, but notably it is 1) 426GB table on 7200rpm hdd, 2) SeqScan of 8 rows which is not computationally expensive. It basically goes through a very big file on hdd, probably for the purpose of presentation, not to show some impressive per-core efficiency.


Agreed: Table: 426GB, RAM: 256GB, and then he says "and it is clear that having more than 10 workers will not give any significant performance boost". Quite expected if it's IO bottlnecked. It doesn't sound like it's about presentation. Maybe I'm missing something.


IO bottlenecking shouldn't prevent parallelism helping.

One theoretical solution is to have millions of parallel queries, and rather than doing random reads to satisfy them, just read the whole table sequentially.

Query latency is high, but throughout is then unconstrained.


> and rather than doing random reads to satisfy them, just read the whole table sequentially

OK, good point, I believe this is called a carousel read (spent 5 mins looking for a link, sorry no luck. I know MSSQL does a slightly fancier version of it).

I think the original article was unclear in what it was trying for, actual performance or exposition.

Edit: could have utterly sworn it was a carousel read, but no cigar. Here it is https://www.mssqltips.com/sqlservertip/4472/sql-server-enter...


Postgres does this, too; at least to some degree. They call it "synchronized sequential scans".



We've been investigating using these features for some compute intensive workloads to see if we can keep some things inside the database that would normally have to break out to a separate compute service. The biggest problem really is this:

> .... if a parallel plan is not produced, refer to the PostgreSQL documentation on parallelism for details.

That is, there are a number of incidental features of queries that can silently disable parallelism. It becomes an art of knowing what will trip it off and another thing you have to test and think about while working with your queries.

Other than that however, it's extremely powerful to be able to use these features and I'm hopeful we'll be able to avoid a whole lot of bespoke compute services we'd otherwise need to make just to do certain compute operations in a timely manner.


I would try to think of it less as a matter of "incidental features... that can silently disable parallelism", than one of "operations that are incompatible with parallel execution".

Writing anything [0], locking rows (e.g., "SELECT ... FOR UPDATE"), being run in the context of a cursor or loop, invoking "PARALLEL UNSAFE" functions, and a subquery in a query that is already parallelized are all cases where it is not meaningful for the planner to try to parallelize, so it punts.

Remember, also, that this is a young implementation. The PostgreSQL community's first priority is taking great care not to eat your data. They'll find more ways to safely parallelize things.

[0] In 11+, you can get a parallel plan on, e.g., "CREATE TABLE ... AS" and "SELECT INTO", if the query is otherwise parallelizable.


Thanks for the submission. This looks quite interesting by showing the benefits of parallel execution in a single connection compared to versions before 10.


Thanks for reading that.


The cookie acceptance banner for this site is one of the most obnoxious that I’ve seen. Basically accept them or leave (no problem with this as it’s their site). Thing is, the website had already stored data in my browser when I checked after deciding to leave. What’s the point of the banner then?


[flagged]


Can you cite a source for the claim that "choice can’t be accept or leave"? I would be quite surprised to learn that GDPR forces sites to accommodate users in this manner.


Whether this applies for google analytics may depend how/why it's being used. But this does seem to be the view for advertising related tracking.

https://ico.org.uk/for-organisations/guide-to-data-protectio...

"People must be able to refuse consent without detriment,"

"The GDPR is clear that consent should not be bundled up as a condition of service unless it is necessary for that service:"

ICO recently published some opinion on adtech https://ico.org.uk/media/about-the-ico/documents/2615156/adt... clarifies some things

"due to the use of cookies to process this information, consent (to the GDPR standard) is still required under PECR at the initial point of processing."

"cookies used for advertising purposes are not ‘strictly necessary' "

The Article 29 working party guidance expresses similar things

https://iapp.org/media/pdf/resource_center/20180416_Article2...

"If consent is bundled up as a non-negotiable part of terms and conditions it is presumed not to have been freely given. "


"let us store cookies, pay us $10 to compensate us for lost and revenue, or leave" would still be allowed though?


Not necessarily. From the Article 29 working party guidance linked above.

". If the customer’s refusal to consent to this processing purpose would lead to the denial of banking services, closure of the bank account, or, depending on the case, an increase of the fee, consent cannot be freely given."


A company that doesn't care enough to have a copy editor go over its blog posts does not inspire confidence. What other errors have crept into this post that arn't obvious?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: