PostgreSQL 9.6 Released

fabian2k · on Sept 29, 2016

Just from reading the documentation, the full text search features on Postgres already look pretty powerful. And it is encouraging that they are actively being worked on. I'm wondering how this compares to a dedicated search engine like Solr or Elasticsearch.

Are there huge differences in performance, features or search quality? At which scale does using Postgres for full text search still make sense?

brightball · on Sept 29, 2016

Having used all 3, Postgres search is my go to for most use case simply because I don't have to deal with managing deltas to an outside system and keeping things in sync. The search features are powerful and fast and PG's ability to combine multiple indexes in search results make it trivially easy to include a bit of full text search in a query right next to geographic distance filters or other conditions. You can also combine multiple types of searches on the fly if you're feeling whimsical.

IMO, the only time to reach for an outside system is when the data isn't being written to PG first (like log ingestion with elastic search) or when search is such a central part of your app that it mandates a separate dedicated system.

nnutter · on Sept 29, 2016

Are there any good options to support logic (and/or) and facets/fields with Postgres? We started using ES basically just for the "free" query language. (Obviously we would want something that is safe from sql injection.)

anewhnaccount · on Sept 29, 2016

SQL supports logic. Either escape manually, use the templating in your driver or use an ORM.

nnutter · on Oct 1, 2016

So, "No".

eyelidlessness · on Oct 2, 2016

I'm confused. Seems the answer is "yes"?

nugator · on Sept 29, 2016

What about when having different PG database instances that has data you want to join on? Would you still use PG as an aggregated read-only copy of the databases or would you use for example ES?

brightball · on Oct 5, 2016

It's situational. When you need a search across multiple data sources, standing up a dedicated search engine makes a lot more sense. Then again...PG Foreign Data wrappers would make that scenario pretty simple without the need for an aggregate.

I can't speak to performance in that situation though.

fatbird · on Sept 29, 2016

I used it at 9.4 for a document management system with thousands, not millions, of PDFs that got indexed on upload, and it worked extremely well at that scale--fast, and with all the basic text search features well-covered (tokenization, stemming, etc.). A big win for me was that doing it well in Postgres meant the site could stay a simple Django site rather than adding another service.

ngrilly · on Sept 29, 2016

Did you store the plain text of each PDF in PostgreSQL or just the ts_vector resulting from the plain text?

fatbird · on Sept 29, 2016

IIRC, I stored the plain text too because the engine can return contextually marked up plaintext after finding it in the ts_vector.

ngrilly · on Sept 29, 2016

You're right, PostgreSQL needs the plain text to highlight it with ts_headline. It's similar to Elasticsearch keeping the original document in the _source attribute. Thanks!

pumainmotion · on Sept 29, 2016

Curious to know since you mentioned that it was fast for thousands of PDFs... any rough timing information on some of your queries for that kind of dataset?

fatbird · on Sept 29, 2016

I'm really reaching here to recall, but the short version is that actual searches never took more than a second. All I really cared about was how noticeable a delay to expect, and it was never more than that.

On a bulk import of 1,000+, it took a couple minutes to ingest them. This was all on a $20/month VPS.

buro9 · on Sept 29, 2016

I have found Postgres to be good enough for search.

As in... it works well enough, and the advantage of not having to add other tech makes it a no-brainer, I've had zero support issues or customer complaints and most of my applications use full text search heavily.

The big advantage over other approaches, because it's SQL and it's there in my database where I also store users and permissions knowledge... I can permission-limit my fulltext searches.

batmansmk · on Sept 30, 2016

We have been using Postgres Full Text Search for about 3 years now in production. The app is an analytics dashboard, over a set of structured and unstructured data. We have about 20M documents, with hierarchies, dimensions, but also free text elements. It does work extremely well, and having the possibility to group by as one would do in SQL is a god send for tabular or graph based data. Performance are really good, in particular due to the parallel aggregations.

We tested recently to load our index to an Elasticsearch index for one particular use case (a weighted sum of the 20M rows based on a FTS critera) where postgres was underperforming in our opinion. On the same hardware, using all available RAM and CPUs, ES took 6s and PG took 0.7s.

So far, on the 30+ queries of our dashboard tool, we have yet to find a use case that Postgres didn't handle better than Lucene based solutions.

asimuvPR · on Sept 30, 2016

Mind sharing a table structure from your db? I'm using ES for a project and would prefer to keep things simple (already use postgres in another part of the system).

ngrilly · on Sept 29, 2016

We use Xapian to search over millions of documents. We are thinking of switching to PostgreSQL built-in FTS to simplify our system. We ran an internal benchmark which showed that PostgresSQL can be competitive with Xapian, except when you need to rank results (in that case the performance is bad).

joshberkus · on Oct 3, 2016

You'll be interested in ongoing work in this area, then. Oleg & Teodor are working on a new index type (RUM indexes, no less) which will speed up ranking operations conserably.

https://lwn.net/Articles/689387/

ngrilly · on Oct 4, 2016

Hi Josh. Yes, I'm aware of this new type of index. Oleg was even kind enough to answer my email asking a few question about it :-)

chishaku · on Sept 29, 2016

Does anyone have experience with ZomboDB?

"ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. In order to achieve this, ZomboDB implements Postgres' Access Method API.

In practical terms, a ZomboDB index appears to Postgres as no different than a standard btree index. As such, standard SQL commands are fully supported, including SELECT, BEGIN, COMMIT, ABORT, INSERT, UPDATE, DELETE, COPY, and VACUUM."

https://github.com/zombodb/zombodb

jameslk · on Sept 29, 2016

It looked promising when I was evaluating it a few months ago, but it's limited to use with Elasticsearch 1.x, which was not going to work for us.

zombodb · on Oct 10, 2016

As the developer, I do!

Feel free to email the mailing list (zombodb@googlegroups.com). I'd be happy to help answer any questions you might have

combatentropy · on Sept 29, 2016

I had used PostgreSQL for a decade, including full-text search, but just within apps that were already storing their data in Postgres.

The time came to replace our website search (tens of thousands of pages), and we decided to try rolling our own. Someone suggested ElasticSearch, and as I read through it, it seemed to do less than PostgreSQL. I still had the hard problems of (1) spidering the site and (2) converting all the file formats (.doc, .xls, .pdf, etc.).

I ended up just putting wget on a daily cron job to spider the site. Then I ran the saved files through a hodgepodge of scripts to extract the plain text and put it into PostgreSQL.

Once it's there, it's far easier to do the rest. Postgres has its own functions to search for matches, rank the matches, give you snippets, and even highlight the search words in the snippets. It's amazing.

Searches run in a split second. Well, at first, when I was testing, they often took a few seconds. But the weird thing is that after go-live it ran faster. My best guess is that so many users caused Postgres to cache more and more of itself into RAM. The whole server is still using less than 1 GB though, and it's running Apache and Postgres for the website and all its apps.

ngrilly · on Sept 29, 2016

What is the on-disk size of the table storing the plain text?

combatentropy · on Sept 29, 2016

42 MB for the table, which has columns for the address, title, plain-text body, and computed text vectors for 43,000 pages (web pages and office documents of average length). Then another 100 MB for the GIN index on the text-vector column.

ngrilly · on Sept 29, 2016

That's only 977 bytes per page (42 MB / 43,000 pages). Are you sure about the numbers? Maybe the plain-text body is stored in a TOAST table?

Anyway, it looks like the whole dataset can fit in RAM, which explains the excellent performance, even with relevance ranking.

combatentropy · on Sept 29, 2016

Whoops! Yes, there is a corresponding TOAST table that I had to track down.

  Table: 341 MB (main table + TOAST table)
  Index: 100 MB

rpedela · on Sept 29, 2016

Yes there are huge differences between quality and performance. Apache Lucene (powers Solr and ES) is still the best by far. However if Postgres search works well enough for your use case then great. As others have said, it is one less dependency to manage.

scjody · on Sept 29, 2016

This has been around for a while but it's a great summary of what Postgres can do: http://rachbelaid.com/postgres-full-text-search-is-good-enou...

In my experience performance is great if you're just doing text search, but if you combine that with other operators in the same SELECT it can be much slower than Elasticsearch since in many of those cases Postgres needs to fall back to a full table scan.

ngrilly · on Sept 29, 2016

If you create indexes for all the columns used as filters, which is somehow what Elasticsearch does, then PostgreSQL is able to combine them (generating a bitmap for each used index and ANDing them) and you should get decent performance, don't you?

pilif · on Sept 29, 2016

While it's ok for our purposes, I would wish for a bit better customisability of the text parser and it definitely needs better support for compound words to be perfect.

The first issue is with relation to https://www.postgresql.org/docs/9.6/static/textsearch-parser...: The documentation says

> At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications

and it really means it - changing the behaviour of this component is not possible unless you write a completely different parser in C which, while possible is no fun experience.

We're using the full text feature over product data and we're having to work around the parser sometimes too eagerly detecting email addresses and URLs which messes with properly detecting brand names which might contain some of these special characters.

The other problem is the compound support. A lot of our data is in German which like other languages likes to concatenate nouns.

For example, you'd absolutely want to find the term "Weisswürste" for the query "wurst" (note the concatenation and the added umlaut for the plural in wurst).

Traditionally, you do this using a dictionary and while Postgres has support for ispell and hunspell dictionaries, only hunspell has acceptable compound support, which in turn isn't supported by Postgres.

So we've ended up using a hacked ispell dictionary where we have to mark all known compounds which is annoying and error-prone.

Also, once you have to use a dictionary, you end up with a further issue: Loading the dictionary takes time and due to the way how Postgres currently works, it has to happen per connection. In our case, with the 20MB hacked german ispell dictionary, this takes ~0.5s which is way too long.

The solution for this is to use a connection pooler in front of Postgres. This works fine but, of course, adds more overhead.

The other solution is http://pgxn.org/dist/shared_ispell/, but I've had multiple postmaster crashes due to corrupted shared memory (thank you, Postgres, for crashing instead of corrupting data) related to that extension, so I would not recommend this for production use.

Lucene and by extension ElasticSearch has much better built-in text analysis features so we could probably fix the parser and compound issue, but that would of course mean even more additional infrastructure, plus, probably some performance issues as we, unfortunately, absolutely cannot return all the FTS matches but instead have to check them for other reasons why they must not be shown which, of course, uses the database again and I'm wary of putting all that logic somehow into ES as well.

This is why we currently deal with the postgres tsearch limitations. But sooner or later, we'd probably want to bite the bullet and go dedicated solution.

ngrilly · on Sept 29, 2016

Do you rank full-text search results using something like ts_rank? If yes, do you suffer from slow queries?

pilif · on Sept 29, 2016

We use it, but we don't suffer slow queries in our case.

ngrilly · on Sept 29, 2016

I'd like to know more about your case, because my own experience is that ordering by ts_rank causes a big slowdown.

PostgreSQL documentation says: "Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches."

Some PostgreSQL developers are working on improving this by using indexes only to compute the ranking, but the related patches are not done yet.

What is the size of your data set (number of rows and size on disk) and the average response time?

kuschku · on Sept 29, 2016

I do some searching with pgsql with tiny datasets and ts_rank, on a 10GB dataset of 11 million rows (mostly chat data), and get response times for ranking over all of it around 10-100ms on a cheap OVH 5€ VPS.

Query: https://paste.kde.org/pcxyg0fay | Explain: https://explain.depesz.com/s/jN3V (101ms)

Sometimes queries end up even a lot faster, for example the same as above, but searching for "c plus plus", runs in this plan + runtime: https://explain.depesz.com/s/NPOc (11ms)

ngrilly · on Sept 29, 2016

Thanks a lot for sharing this!

Last time I tried, it was on a machine with a spinning disk... It looks like I should try again with a SSD, which are a lot better with regard to random access.

Your search term is "Quassel". What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

kuschku · on Sept 29, 2016

> What happens if you search for a term that matches a lot of rows? This is the case where ts_rank is very expensive. I'd be curious to look at the explain of such a low-selectivity query.

That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

It’s far more problematic to actually load that data from disk.

ngrilly · on Sept 29, 2016

> That’s actually quite unproblematic, if you have the tsvector as its own column (not just as index).

Yes, it works, but it is slow because the tsvector is usually big enough to be stored in a TOAST table, and this produces a lot of random access reads.

This is why there is a project of storing additional information in the GIN (term positions) in order for the index to contain all necessary information for the ranking:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

kuschku · on Sept 29, 2016

> usually big enough to be stored in a TOAST table

Ah, luckily, in my case, that can’t happen – each row’s message contains one IRC message, so at most 512 bytes. That also automatically ensures we’ll never run into TOAST issues.

ngrilly · on Sept 29, 2016

Now I understand why you didn't suffer from the slow ranking issue. In my test case, text is longer and triggers the TOAST management code.

kuschku · on Sept 29, 2016

Yeah, if your vectors are in TOAST, you really have a huge issue with ranking. There’s no simple way to get around that, except with customized solutions like Lucene/Solr/ES

brightball · on Sept 29, 2016

Ordering can get expensive no matter what just base on how many things you're actually sorting. Ideally, if you can find a way to limit the size of the data set before the ranking sort you'll see a big improvement.

ngrilly · on Sept 29, 2016

Currently, the only way to make ranking tolerable is to limit the size of the data set before the ranking sort. This is what I did in my tests.

But it looks like it could be possible to massively improve ranking performance by storing all necessary information directly in the GIN index, as proposed here:

https://wiki.postgresql.org/images/2/25/Full-text_search_in_...

takeda · on Sept 29, 2016

Someone here mentioned ZomboDB[1]. Would that help you?

[1] https://github.com/zombodb/zombodb

memracom · on Sept 29, 2016

Usually an RDBMS like PostgreSQL is used in an environment that has a different usage pattern than search. SOLR can take advantage of specializing for that type of usage. However, Russia's largest search engine, Yandex, seems to like PG http://momjian.us/main/blogs/pgblog/2016.html#September_28_2...

elmigranto · on Sept 29, 2016

I doubt they use it for actual search, though. Given that core company product for end-users is Search Engine, Yandex probably has some in-house system, that Mail team can leverage for their searching needs.

garysieling · on Sept 29, 2016

For the small data use cases I've seen, Solr always returns results quickly. You're limited in how you can query it - it's modeled as one giant table, and the query syntax is much more idiosyncratic than SQL.

Postgres is really reliable, and I think a lot of the performance difference comes from robust transactions. For some use cases you can use both and replicate data or query one + the other in sequence.

ngrilly · on Sept 30, 2016

It looks like Postgres Professional is working on improving FTS. Here is a relevant presentation from Oleg Bartunov about the new RUM index and its benefits for FTS:

http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum...

taneliv · on Sept 29, 2016

Can pgsql fts do stemming or more complex lemmatisation for languages other than English? Or ranking of results based on Okapi BM25 or similar? I was looking into this about two years ago and those were the features in favor of Lucene (basis of ES and Solr).

kuschku · on Sept 29, 2016

Pgsql can do stemming and everything it can do in English also in several dozen languages, including German, French, Spanish, and any for which you install dictionaries. It’s quite useful

rpedela · on Sept 29, 2016

As far as I am aware, it does not support BM25. See pilif's comment about multiple language support.

snowwolf · on Sept 29, 2016

Please can the Postgres team put some major focus on completing logical replication [1]. It's the missing piece to making upgrading across major versions painless and quick on large databases so that we can take advantage of all these nice new features. We're on a Heroku's hosted Postgres service so can't install the pglogical extension.

1. http://blog.2ndquadrant.com/why-logical-replication/

rpedela · on Sept 29, 2016

It is being worked on. There is a good chance it will be in the next major version.

https://commitfest.postgresql.org/10/701/

mslate · on Sept 29, 2016

I don't think you would be able to take advantage of logical replication on Heroku Postgres regardless--they don't allow you to replicate to your own instances, only other Heroku-hosted instances.

This makes migrating off Heroku for Postgres a PITA and requiring down-time.

snowwolf · on Sept 29, 2016

True, that would be an extra bonus if Heroku started allowing replication to non Heroku instances, but as long as they support logical replication to a Heroku Follower instance then you can upgrade to new major versions with near zero downtime - set up logical follower running latest postgres version and then promote to master once it has caught up. Currently you can't have a follower that is a different version to master - meaning an upgrade requires either a full backup and restore to new version resulting in significant downtime if you have a large database or using the pg_upgrade utility which is generally not recommended as it is not guaranteed to work.

ignoramous · on Sept 29, 2016

A tangential question:

Everyone speaks about InnoDB and how performant and reliable it is... and multiple firms even use it as a KV-store (Uber/Pinterest/AWS) bypassing MySQL entirely. I have never heard much about storage engines in Postgres, why could this be so?

Wikipedia has a (stub) article on InnoDB, but nothing on Postgres' storage engines... just wondering why that is.

rwultsch · on Sept 29, 2016

The PG storage engine is not particularly awesome. It is basically COW (with exceptions) and compaction (called vacuum) has been quite painful for a long time. Every release it is supposedly fixed, but people keep complaining. This not to say PG sucks, their optimizer knows far more about their data than InnoDB and PG can perform far more types of execution plans.

We (Pinterest, I wrote most of the MySQL automation) make heavy use of MySQL replication which is vastly simpler to manage than PG. All queries still flow through SQL and unlike PG, we can force whatever execution plan we need. We do lots of PK lookups, and InnoDB is really good at that. In InnoDB all the data is stored in the PK while in PG it is just a pointer.

dhd415 · on Sept 29, 2016

>>In InnoDB all the data is stored in the PK while in PG it is just a pointer.

This is just a consequence of the PK being a clustered index in InnoDB which has both pros and cons. One of the big cons is that all of the columns of the PK are implicitly added to every secondary index as the row identifier. That isn't a big problem if your PK is a single column int, but if it's multiple columns, that often results in unnecessary bloat in your secondary indexes. Ideally (as in, dare I say, MS SQL Server), you'd have the option of a clustered or non-clustered PK for your table so you could choose the optimal index structure for your workload on a per-table basis.

rwultsch · on Sept 29, 2016

If you don't want a clustered index in InnoDB you can define the primary key as an auto incrementing uint.

ngrilly · on Sept 29, 2016

Yes, you can, but it doesn't change the fact that you still have a clustered index (an index organized table), which is great for PK lookups, but bad if you do a secondary indexes lookup (because you need to lookup through 2 B-trees instead of 1). There is real, and well-known, tradeoff here.

rwultsch · on Sept 29, 2016

You missed the point of my post. You are going to have one of the two issues, either looking through two index or indexes including the a large PK. At least with InnoDB you can make the choice. The strategy I suggested gets you the desired outcome of not including a large PK in all secondary indexes.

ngrilly · on Sept 29, 2016

> The strategy I suggested gets you the desired outcome of not including a large PK in all secondary indexes.

For an application in which most queries need a secondary index lookup, using heap organized tables is more efficient because the database needs to traverse only one B-tree (for the secondary index) that gives the physical position of the row in the heap. When using index organized tables, the database needs to traverse 2 B-trees (the secondary index first, then the primary index). Making the primary key short by using an auto incrementing integer helps, but doesn't remove this overhead.

joshberkus · on Oct 5, 2016

The other part of the tradeoff is that inserts and many other write operations are less expensive in heap tables. A Big Table in InnoDB, measured in "when do I start having to spend a lot of time troubleshooting this table's performance" is about 1% the size of a Big Table in Postgres. TokuDB was introduced for MySQL for a reason.

Heap vs. Index organization is a classic tradeoff of database design.

Now, if you're saying "it would be really nice if Postgres had the option of index-organized tables" I'd agree with you. I'd love to have that, as an option.

ngrilly · on Sept 29, 2016

> All queries still flow through SQL and unlike PG, we can force whatever execution plan we need. We do lots of PK lookups, and InnoDB is really good at that.

Being able to force the execution plan is more useful in MySQL than PostgreSQL because MySQL's optimizer is not very good at planning queries.

If you do a lot of PK lookups, then you don't need to force the execution plan.

wmfiv · on Sept 29, 2016

Postgres doesn't provide a store engine API. You can achieve some of those goals by using the foreign data wrapper (fdw) api.

https://www.postgresql.org/docs/9.5/static/postgres-fdw.html

For example, Citus Data provides a column store for Postgres via the fdw api.

https://github.com/citusdata/cstore_fdw

dhd415 · on Sept 29, 2016

PostgreSQL, for better or worse, doesn't have pluggable storage engines. There's some discussion on their dev mailing list about the possibility of adding that capability in PG10, though: http://postgresql.nabble.com/Pluggable-storage-td5916322.htm...

Some earlier (2013) discussion on the same topic: https://wiki.postgresql.org/wiki/2013UnconfPluggableStorage

asah · on Sept 29, 2016

tl;dr: Foreign Data Wrappers (FDW) provide 99% of the same functionality, but with even more flexibility incl smart query optimizer support.

pgaddict · on Sept 29, 2016

That is far too optimistic, IMHO.

FDW are a great way to access external data sources, but it lacks proper support for visibility and transactions, and so on. Also, the FDW API follows the "tuple at a time" execution model, which prevents a lot of optimizations in the upper part of the stact (vectorized execution etc.).

There are several products using FDWs to change storage, but I'd call it a misuse of a feature designed for very different purpose.

IMHO it's hardly a way forward without significant changes/improvements (which may happen, I don't know).

ioltas · on Sept 29, 2016

The discussion is moving on lately with a refactoring to create an access method handler for tables: https://www.postgresql.org/message-id/d7e41e76-e565-8bc0-4e9...

Here is as well some documentation on the matter: https://wiki.postgresql.org/wiki/HeapamRefactoring

Having "CREATE ACCESS METHOD [...] ON STORAGE|TABLE" to create a custom access method, or storage engine, and extending CREATE TABLE to be able to pass a storage method with the table definition could become a quite powerful combination. The main challenge is to come up with an interface solid enough to be able to handle problems related to MVCC, like VACUUM cleanup.

dragonwriter · on Sept 29, 2016

> I have never heard much about storage engines in Postgres, why could this be so?

Because PG isn't designed around pluggable storage engines, so its not really as practical to take a storage engine out and use it separately, and doesn't make much sense to talk about the storage engine separately from the whole system.

Nullabillity · on Sept 29, 2016

Postgres has one storage engine: Postgres. It doesn't have a pluggable interface like MySQL does.

evanelias · on Sept 29, 2016

FWIW, these solutions rarely bypass MySQL entirely or at all. Although there are ways to access InnoDB without making SQL queries (Memcached API; Handler Socket), the MySQL server is still involved. It just skips the normal protocol, auth, SQL parsing, etc.

Even then, there aren't a lot of published cases of people using these alternative access methods at scale yet. AFAIK, all of the large kv use-cases you've mentioned still go through traditional SQL queries. Despite the overhead of SQL parsing, it provides more control and visibility. The ecosystem around alternative access methods isn't nearly as mature.

MustardTiger · on Sept 29, 2016

>Everyone speaks about InnoDB and how performant and reliable it is

What? Everyone speaks about how unreliable it is and how many major data corruption problems it has.

>I have never heard much about storage engines in Postgres, why could this be so?

Because they didn't take the approach of having multiple storage engines, they just made one that works and is not easily removed from the database.

evanelias · on Sept 29, 2016

You may be confusing InnoDB with MyISAM (which is prone to corruption, especially upon crashes) or with running MySQL without a strict SQL mode (which causes bad things like silent truncation of overflowing values).

InnoDB is, and always has been, a very reliable and durable storage engine with solid performance characteristics.

MustardTiger · on Sept 30, 2016

No, I am referring to innodb, which has a number of known reliability problems which are "wontfix".

evanelias · on Sept 30, 2016

Care to share examples?

calpaterson · on Sept 29, 2016

Postgres does not have pluggable storage engines - there is essentially just one way to store things.

malisper · on Sept 29, 2016

> Index-only scans for partial indexes

This one is huge for my company. Almost every single query of ours could use an index-only scan, but the planner would never choose to perform one because of the weirdness around partial indexes. We expecting a several x speedup once we upgrade to 9.6. All the need to improve now is a way to keep the visibility map up to date without relying on vacuums.

pgaddict · on Sept 29, 2016

Thanks, it's nice to see the patch is likely beneficial for other people!

snuxoll · on Sept 29, 2016

I don't see ever going away from using vacuum to maintain the visibility map, but hopefully the changes in 9.6 will make it a non-issue on large tables.

anarazel · on Sept 29, 2016

> I don't see ever going away from using vacuum to maintain the visibility map

I don't think that's that unlikely to change. There's two major avenues: Write it during hot-pruning (which is done on page accesses), and perform a "lower impact" vacuum on insert-only tables more regularly

> but hopefully the changes in 9.6 will make it a non-issue on large tables.

You mean the freeze map? That doesn't really change the picture for regular vacuums, it changes how bad anti-wraparound vacuums are. The impact of the table vacuum itself is most of the time not that bad these days (due to the visibility map), what's annoying is usually the corresponding index scans. They have to scan the whole index, which is quite expensive.

malisper · on Sept 29, 2016

I think there is already a patch for Postgres 10 that runs the vacuum on insert only tables. While not completely solving the problem, that will be helpful.

jbkkd · on Sept 29, 2016

Congratulations to the PostgreSQL Global Development Group on a much-anticipated release.

Curious about this:

> parallelism can speed up big data queries by as much as 32 times faster

Why would it be only 32 times faster? The sky's the limit if there aren't major bottlenecks on the way.

olavgg · on Sept 29, 2016

I tested parallel queries on PostgreSQL 9.6 on a few TBs of data, 5 billion rows on an older dual Xeon E5620 server. I also striped 4 Intel S3500 800GB drivers with ZFS and enabled LZ4 compression which has a compressratio of 4x.

For a sequential full table scan I could process about 2000MB/s of data(only 125MB/s was read from each SSD), I was limited by CPU power.

Anyway, same query took about 25 minutes on PostgreSQL 9.5 and now it was down to 2minutes and 30 seconds. For comparison, SQL Server 2012 spent 7 minutes on the same dataset on the same hardware.

greggyb · on Sept 29, 2016

Would you be willing to re-run that with SQL Server 2016? A Dev license is free, and there's been a lot of relational engine optimization since 2012. I'd be curious to see what the latest release can do compared to Postgres' latest.

I realize I'm asking a stranger on the internet to do something for free for me. If you don't have time or inclination to do this, no worries, but it seems like you've got a nice setup to be able to play with this. I'm sure I'm not the only one curious to see such a comparison.

olavgg · on Sept 29, 2016

I've tried SQL Server 2016, no difference.

greggyb · on Sept 29, 2016

Thanks, this and your other response are very useful!

dhd415 · on Sept 29, 2016

I'm a fan of both PostgreSQL and SQL Server, but I think these numbers are very workload-specific. I've gotten 1GB/s throughput on SQL Server 2012 on spinning disks and CPUs older than the E5620, so I've no doubt that same workload would exceed 2GB/s on your hardware. The apples-to-apples comparison here is between the two versions of PG where the performance improvement is clear. It's harder to do an apples-to-apples comparison between PG and SQL Server because the optimal schema and queries for a particular workload are likely to differ for each of them.

olavgg · on Sept 29, 2016

With SQL Server I don't get 2000MB/s on the same hardware, more like 600-800MB/s. This is most likely because of LZ4 compression and large block sizes(64k-128k) on ZFS, that results in a lot less IO. Because with SQL Server, IO was the bottleneck.

So yes, it is very workload specific. For random read/write they are probably more similar. But for reading a lot of data that can be read sequentially, PostgreSQL seems to win hands down, because it can get a lot of help from ZFS compression.

I would love to run the same test when SQL Server is available on Linux. But ZFS do also deliver slightly better throughput and slightly more iops on the FreeBSD platform, which I ran this benchmark on. And SQL Server probably demands a 4k block size, which is so small that LZ4 compression has no effect as I've already tried to run SQL Server on ZFS via iSCSI.

dhd415 · on Sept 29, 2016

Hmm, I've never run SQL Server on anything other than NTFS where 64KB was definitely the recommended block size. In either case, it's great to have choices. When the license fee is coming out of my pocket, I'm definitely not choosing SQL Server.

greggyb · on Sept 29, 2016

No one has tested a query that got more than 32x faster, so they don't want to promise something they can't prove.

Jweb_Guru · on Sept 29, 2016

There's also a limited amount of memory-level parallelism available... with 4-DIMM sockets you might need an 8-socket machine to get a 32x improvement on large (memory-bound) sequential scans, which I'd guess you can get on top-end Power machines.

(You can probably get more memory level parallelism with random access, but your overall bandwidth will likely be lower... fully exploiting memory bandwidth is complicated and difficult to do for real applications).

joshberkus · on Oct 3, 2016

That's pretty much the case, yes.

zejn · on Sept 29, 2016

A blog post on 2ndQuadrant shows a bit more how parallelism in PostgreSQL scales across cores: http://blog.2ndquadrant.com/parallel-aggregate/

noxin · on Sept 29, 2016

They likely benchmarked it on a 32 core system. Like a dual Opteron board. If the task was single-threaded before a 32-fold improvement is reasonable.

pedrocr · on Sept 29, 2016

It's very difficult to get a 32x speedup from 32 cores as there are always parts that are inherently serial, so it's more likely they tested it on a 64 core machine or something like that.

jelder · on Sept 29, 2016

Yes, this is thanks to Amdahl's Law.

https://en.wikipedia.org/wiki/Amdahl%27s_law

ris · on Sept 29, 2016

Nothing ever scales linearly without limit.

qaq · on Sept 29, 2016

Congrats on great release. With availability of E7-8800 v4 based servers (up to 192 cores in a single box) PG can cover a huge number of use cases without complicated setups.

mgberlin · on Sept 29, 2016

Does anyone know when this will be available on AWS RDS?

aidos · on Sept 29, 2016

On that topic - what's the general feeling about RDS?

I'm running pg on ec2 with a hot standby slave. I need the postgis extension but am not doing anything particularly esoteric. Ideally I'd like to have the certainty of aws handling backups for me.

I was researching moving to RDS today and would love to hear thoughts on whether it's a good general solution or not. What happens about downtime during upgrades or swapping instance sizes?

luhn · on Sept 29, 2016

> What happens about downtime during upgrades or swapping instance sizes?

This is one of my favorite features of RDS: You can set a maintenance window and have the option to not have changes take effect until that window. So if I want to upgrade Postgres or change the instance size, I set it up and the downtime happens when I'm fast asleep and nobody is using the site.

I also think (but not 100% sure) that if you have Multi-AZ enabled, changes are done by upgrading the slave, failing over, and then upgrading the ex-master, so downtime is limited to the failover period.

aidos · on Sept 29, 2016

Ah ok. That's useful info about the multiAZ setup - I'll have a look into that.

In my case, we now have customers around the world so we don't get the "night time" luxury. Part of the work I'm now completing is to split the system into an accounts db and customer data db. I was thinking to dip my toe in the water by just moving the account db to RDS to see how it goes.

merb · on Sept 29, 2016

downtime is more limited by your application. i.e. if there is a failure your connection / database pool just needs to reconnect. btw. we only have had psycopg2 pointing at it yet and that worked without a downtime. however i guess java hikaricp is as fast as that well. only "failures" will have a "small" downtime, however planned downtime is pretty/zero fast.

cwp · on Sept 30, 2016

Yes, I've noticed this too. The AWS documentation and the console both say that the changes may take a long time to apply, but in fact the database is up for most of that time. I've done several big upgrades that had only a few seconds of downtime.

rpedela · on Sept 29, 2016

Postgres RDS is solid and it supports PostGIS. If you have multi-AZ enabled, then downtime is typically measured in seconds even when upgrading versions or changing instance sizes. It will update one instance, automatically switch to it, and update the other one. It automatically handles backups, syncing read slaves, etc too. It is awesome in my experience.

aidos · on Sept 29, 2016

I've just read about this and there are a lot of people saying they had significant downtime during upgrades ([1] one such story, but there were quite a few on stackoverflow etc).

I'm going to run some tests myself to see how well it works on my existing data (only 30GB at the moment, but it was only 20GB a month ago and is growing fairly rapidly).

[1] https://inopinatus.org/2016/02/16/notes-from-a-postgresql-rd...

takeda · on Sept 29, 2016

Compared to others my experience with RDS is bad, although I used it last year ago perhaps things improved.

One major issues is that you are restricted to what you can do with it, not all options are available. You can only use extensions that they provide. (this I'm a bit fuzzy about) but changing disk size made service unavailable for ~30 minutes (proportional to new disk size). You weren't able to configure replication, the replication only happens to the backup node. You weren't even able to set up replication across regions.

The replication is kind of a bummer, because if you ever would like to move your data (perhaps to a vm or outside of aws) you would need to have an outage. Also if I remember there was no way to do major version update in place.

There was also another incident (it was caused by bug so hopefully it was fixed and won't happen to anyone else). We had cluster set up with a backup. One day out of nowhere the service stopped working and was unavailable for 1.5 hours. That was quite big issue because we used it for monitoring (zabbix), so any outage makes us blind to issues. Turned out that due to bug their backup routine made a mistake and started doing backup on the master server (normally it supposed to do on slave).

htn · on Sept 29, 2016

While not AWS product, Aiven (https://aiven.io/postgresql) has a hosted offering with 9.6 support on AWS as well as on Azure, GCE and DigitalOcean.

nhumrich · on Sept 29, 2016

Probably in 3-4 months. AWS has historically had a 3 month gap time for postgres. Their policy (from what they have said on the forums at least) is they wait for at least x.x.1 release before they start working on it.

Roboprog · on Sept 29, 2016

Is it just selection bias from posted links on HN, or has the PostgreSQL team been doing many (feature) releases lately?

Sounds good!

dragonwriter · on Sept 29, 2016

> Is it just selection bias from posted links on HN, or has the PostgreSQL team been doing many (feature) releases lately?

I think more like the former -- as I recall, the recent articles have mostly been about specific work going on for the 9.6 release, prereleases of 9.6, and now the actual release of 9.6.

pgaddict · on Sept 29, 2016

There's still only one major PostgreSQL release per year. There were a few posts about cool stuff built on top of PostgreSQL, a few posts about progress of the 9.6 development (e.g. when the parallel query got committed) etc.

anarazel · on Sept 29, 2016

> There's still only one major PostgreSQL release per year.

Well, due to the delayed 9.5 release (January 7th), there have been two this year ;)

pgaddict · on Sept 29, 2016

Well, that really depends on where exactly you place start of a year ;-)

Chinese New Year was February 8, 2016. Orthodox New Year was January 14, 2016. So it's 2:1 for me.

zejn · on Sept 30, 2016

Well, to the best of my knowledge I think PostgreSQL currently only supports Gregorian calendar system ... ;)

gtrubetskoy · on Sept 29, 2016

Anyone know the state of BDR in 9.6?

http://blog.2ndquadrant.com/bdr-is-coming-to-postgresql-9-6/

okket · on Sept 29, 2016

See discussion from 3 days ago (69 comments): https://news.ycombinator.com/item?id=12576116

TL;DR: It is not in mainline, but it does not need a patch anymore. You need to bring your own conflict resolution logic.

pgaddict · on Sept 29, 2016

Or design the application so that there are no conflicts (e.g. modifying different subsets of users on different nodes).

jimktrains2 · on Sept 29, 2016

Which is a form of conflict resolution. It requires the application to be aware of the datastore.

I wonder if, since BDR is just a plugin now, a plugin that used strong consistency guarantees could be built using the same changes that were required for BDR.

anarazel · on Sept 29, 2016

BDR has last-updated-wins builtin and conflict handlers that can be called if that's not what you need.

mamcx · on Sept 29, 2016

What is BDR?

joeriel · on Sept 29, 2016

Bi-directional replication

tmaly · on Sept 29, 2016

I am interested in the full text search as well as

Index-only scans for partial indexes

n4nagappan · on Sept 30, 2016

Does Postgres offer search based on tf-idf?

ris · on Sept 30, 2016

Yes and it's quite flexible in doing so https://www.postgresql.org/docs/9.6/static/textsearch-contro...

Chayanon1981 · on Oct 3, 2016