Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The state of full text search in PostgreSQL 12 (fosdem.org)
229 points by fanf2 on Feb 3, 2020 | hide | past | favorite | 50 comments



In the later part of the slides, I learned that Postgres 12 now supports case and accent insensitive collations. https://www.postgresql.org/docs/current/collation.html#COLLA...

For the many simple cases where a grep-like search is enough, my little experience with pg was far from satisfying. MySQL is simple to use for this thanks to the awfully named `utf8_mb4_unicode_ci`. With Postgres, searching for "étranger" won't find "Etranger" and "Étranger" unless you put some hard work into your database search system.

Unfortunately, it seems that PG12 won't change much in that aspect. From the link above, you have to create the collation (which is already a heavy restriction) and then "certain operations are not possible with nondeterministic collations, such as pattern matching operations." Looks like uncharted territory.

On a side note, I've always seen Solr, SphinxSearch and such specialized tools used instead of the DB's FTS. The main drawback is keeping the data in sync, but it has many pros.


CREATE EXTENSION unaccent;

begin; CREATE OR REPLACE FUNCTION public.insensitive_query(text) RETURNS text AS $func$ SELECT lower(public.unaccent('public.unaccent', CAST($1 AS text))) $func$ LANGUAGE sql IMMUTABLE; ALTER FUNCTION public.insensitive_query(text) OWNER TO [your user that has create rights on the database]; commit;

To use in a query:

WHERE insensitive_query(some_table.name) LIKE insensitive_query('%Étranger%')


If you use a function like this, it will need to be run for every entry; doesn't that invalidate the whole point of having indexes ?

What you can do is storing the unaccented, simplified version of a word. Which I'm sure pg already does in a way.


You can also create an index on the result of the function[0].

[0] https://www.postgresql.org/docs/12/indexes-expressional.html


I forgot to include the index and it is too late to edit my comment.

CREATE INDEX index_name ON some_table (public.insensitive_query(NAME));


> If you use a function like this, [each query] it will need to be run for every entry


No; if you create an index for the item: `insensitiveQuery(some_table.name)` and you then attempt to run: `SELECT * FROM some_table WHERE insensitiveQuery(name) LIKE insensitiveQuery('Étranger');`, the engine would run the `insensitiveQuery` function on the string `'Étranger'`, and then look up that result in the index directly, which is O(nlogn) or faster. If you have a million tables, that'd run the function only once (on `Étranger`) and do about 20 lookups in the index before finding you the qualifying row(s).


Cheer. I did not get the index-on-a-derived-value bit.


This is a lot more graceful than some of the other workarounds I've seen/tried before.

Thank you and I feel a little sheepish I didn't do it first!


I'm getting the impression that recreating functionality from Lucene on top of RDBMS abstractions is certainly possible, but requires a lot of effort from devs and you still need to understand stuff like tokenization, stemming, inverted indexes etc. if you want to use it. At this point one may as well just learn Solr or Elasticsearch. And these are still better at searching at the low level, although of course you sacrifice consistency guarantees. This is a legitimately different problem arena, not NoSQL for the sake of it.


I see your point, but I also see why RDBMS are appealing. We just built a documentation portal for our customers, and we initially used ElasticSearch for as-you-type full text search. I ended up redoing the implementation with SQLite (using fts5) and a simple express api later on top of it. The resource consumption is 3 orders of magnitude lower, the speed is the same and the quality of the results comparable. Elastic was crashing on us due to excessive heap memory consumption, even if 2gb of ram was assigned to it and we had two nodes. I thought the resource consumption was crazy for a light search use case, and found out that SQLite can do the same with 20 mbs of ram.


> recreating functionality from Lucene on top of RDBMS

strange mix of assumptions there, since searching text strings is Job One of computer search since 'forever' .. if Lucene has done some (search) things well, there is no reason that others do not have a chance to also do those things well. In fact, encourage it, foster it, engage with the process that brings new software capacities ..


Sure, I was meaning all the structures and algorithmic stuff we do to limit as much as possible straight up scanning strings. I came to appreciate much the work accumulated, layer by layer, in Lucene. Of course you can redo it elsewhere and it's a worthy programming enterprise but often not directly related to what you want to accomplish with your application.


PG could really do most of it if they got serious about https://github.com/postgrespro/rum with TF/IDF ranking.


For MySQL, most of the time it is better to not use a real collation at all and just do utf8mb4_bin. This disabled weird features like case insensitivity. You do need to make sure you properly encode strings, especially if you are working with combining code points. But it is much simpler and a little more efficient.


http://www.public-software-group.org/pg_collkey exists, first release in 2006 for PostgreSQL 8.

Those who wanted Unicode-compliant collation support had already installed it years ago.


The actual slide deck (that is easy to miss) is here:

https://fosdem.org/2020/schedule/event/postgresql_the_state_...


Really well put together- I learned several more things about PG text searching from this.

From my personal experience working with dozens of California tech companies of varying sizes, I'd say around 80% of the time I see ElasticSearch being used when the product could have just used their existing PG cluster (and existing team experience). I.e. so many companies pick up ElasticSearch only because they lack of full-text search capability awareness of PG. Of course, ES does have its advantages, but I feel it's in very specific use-cases.


Postgresql is awesome, but I don't see any advantage when you need to add like 40 aggregations plus joining to 10 tables.. (easy if you denormalize to elastic)


I can think of a bunch off the top of my head that mostly stem from a serious reduction in complexity:

- having one data store instead of two

- no data store synchronization and related consistency headaches

- the ability to easily join/restrict based on business logic (especially access control)

- easier deployments

- easier qa/local setups

- fewer failure scenarios

- although non-standard, you're dealing with sql and can explain/analyze your queries with tools one already knows if one understands postgres

- the ability to extend my existing (Spring Boot) integration tests to test new searching logic without having to worry about a separate test db.

I've been building a travel blogging platform as a side project that has about ~250 users. Through a couple of choice materialized views to create search documents from a number of different tables, Postgres FTS, and some GIN indices, I was able to make a lighting fast search with intelligent suggestions(ie using user specific restrictions such as follower/following, places a user has posted, etc).

Is my search perfect? No. I could do a better job normalizing diacritics, stemming could be improved, etc. But it was dead simple to implement well enough in a very short amount of time. I'm very happy with my current solution. There's a time and place for a dedicated search solution but it's certainly not in fledgling systems without massive load.


Ever consider doing a write up of how you implemented your Postgres FTS? Sounds like you’ve had a good experience with it.


Missed the part about serving the response bellow 50ms :) Not too much traffic.. between 40/50 req/s.. and a lot of other more important things to be stored on postgres (transactions, stocks.. )


How do you denormalize to elastic without preforming joins on your normalized data?


Haha, just joining them ..


Looks like they’ve taken it down now? Damn that’s unfortunate


Correct link is on the main page.


This sounds like it's about the problem of: I have a column filled with the last names of lots of people and I want a text search that will find these names, but, which takes into account all the various ways people end up writing that name. So, a search for 'mueller' should also find 'Müller'.

However, that problem sounds nearly intractable to me.

In basis it sounds like you just 'canonicalize' everything. You store, alongside 'Müller', also the string 'mueller' (the rule being: first transform all characters into their asciification, with the asciification of ü being defined as 'ue', then lowercase that.

The problem is, of course.. that searching for 'MULLER' should also find this person. And now we're more into trying to get this done by first deconstructing the unicode into ascii + accent marks and then just wiping out the accent marks.

But once we state that 'mueller', 'muller', and 'müller' need all be 'equal' to each other, where does it end? A last name can be from any of a thousand languages, each with their own unique conventions on how these things are asciified, and any given asciification must match any other. Sounds like a combinatorial explosion.

One could disregard this notion and say that 'muller' does not match 'müller'. However, note that generally, Sjögren's syndrome tends to be asciified as 'Sjogren's syndrome', probably because Sjögren was norwegian and I assume that the norwegian asciification of ö, is o. The german asciification of ö is oe however; and I can't tell from a last name if it is germanic in origin, norwegian in origin, or from the land of fairytales and flying pandas, such as Haägen-Dazs.

Just by thinking about this for a while I have concluded that it's hopeless, but perhaps I'm missing some fancy trick to at least try to make this problem dealable. So far my attempts to look for solutions tend to come down to advice to use unicode normalization and then get rid of all accents, then lowercase it, which obviously doesn't work at all (with that strategy, mueller would not find müller), or heuristic solutions (if a lot matches, good enough), which are their own can of worms.

Is there a solution to this to me seemingly impossible problem?


The solution is to stop trying so hard.

As you said, the problem is fuzzy, so you do... a fuzzy search, return a bunch of results, and let the user pick. If the dataset is too large, then you just let the search include more details. "Mueller from Berlin" or "Muller born on july 7th" are much better queries to find a specific person than a perfectly written "Müller".

In fact, if you were to crack the problem of searching by considering all applicable rules, the system would still be useless because people make mistakes all the time and the Müller guy actually wrote "Miller" while signing up ;P


Yes, for me the solution was to stop thinking about search problems as a deterministic problems with a solution that can be (easily) found. It’s a very complex domain space, google still improves their search, with many people working on the problem for more than 20 years.

So it is unreasonable for me to think that I can do a name search (I have exactly the same feature in my product) in a perfect way from scratch just by thinking hard, just because name search is narrower, hence simpler domain (lol).

So I implemented something, got a round of feedback, implemented other forms that users used. My search will display Sjögurd even if user typed in Sjoegurd, despite that the name is Norwegian? Probably, and what’s the downside?

My search will work imperfectly for some lesser known languages, like Georgian? Most certainly, and users from these cultures expect that, they will sigh and deal with it. I know, because I too come from a country with non-latin alphabet and we have lots of different transcriptions into it. It’s ugly, yes. Because the world is ugly :)


This is a - if not solved, then certainly addressed - problem.

Your specific example of:

> So, a search for 'mueller' should also find 'Müller'.

Is discussed here:

https://discuss.elastic.co/t/u-umlaut-search-indexing-user-n...

If your question is specifically "How do I use a feature that PostgreSQL 12 doesn't have?" then you've got a different problem.

[EDIT] Refer the modules mentioned on slide 26 - these may help with your indexing challenges within PostgreSQL


Have you looked into Soundex or Metaphone? I myself have not used them. I just saw them mentioned in PHP's function list and thought they might be useful some day.

- https://www.php.net/manual/en/function.soundex.php

- https://www.php.net/manual/en/function.metaphone.php

PostgreSQL has an extension, fuzzystrmatch, that uses the same algorithms:

- https://www.postgresql.org/docs/12/fuzzystrmatch.html

But it warns that the functions "do not work well with multibyte encodings (such as UTF-8)." So maybe first translate any such characters to ASCII?


These only work with English, so not a general solution. There are more modern, better tools for the problem.


I think practical solution to this is to return all words that are within a Levenshtein distance[1] of 2*number_accented_characters of the original word, ranked by the distance.

Levenshtein distance can be calculated using a trie[2] so it's fast enough for most purposes.

If you wanted to get fancy you could modify the Levenshtein to reduce the distance penalty on accented characters, and/or only apply the Levenshtein calculation within 1 character of an accented character or the equivalent ASCII substitution sequence.

[1] https://en.wikipedia.org/wiki/Levenshtein_distance

[2] http://stevehanov.ca/blog/?id=114


There are various metaphone algorithms that produce an encoding that handles these cases, at least for English and some of the other European languages from which words and names have made there way into English.

The "fancy trick" is basically that search still works pretty well if you just throw away all the vowel sounds - your examples of "mueller" and "Müller" both encode as: MLR

https://en.toolpage.org/tool/metaphone


I'm not convinced these aren't real problems. Someone has heard a name. They type what they think it looks like. Postgres finds the right answer.


> Is there a solution to this to me seemingly impossible problem?

Not at all impossible, you're just not familiar with Unicode. Just use a software that implements UTS #10/UCA. Reply if you want to see code examples for the problems you stated.


The video should be uploaded soon. See the FOSDEM video status page: https://review.video.fosdem.org/overview


Thanks. I was just about to comment that the article seems to be missing a video!



I hooked it up and had a quick play around with it for a proof of concept project at work, unfortunately that project never went ahead in the form that would have used Zombo, but during testing it was a breeze to setup and seemed to do everything it promised more or less straight out of the box.

I'd definitely give it another, proper go if I got the chance.


Thanks for the positive feedback. I'm sorry your project didn't go in a direction where ZDB would be useful, but I'm thrilled you had a positive experience with ZomboDB.


Thanks! Looks like I’ll invest some time into setting it up. Doing the current tsvector with the English dictionary for stems and zombodb seems closer to what I want.


ZomboDB developer here. With ZDB you don't need tsvector at all. With a properly defined mapping, Elasticsearch can do the stemming for dozens of languages, including English.

ZDB exposes darn near everything ES supports.


Are you? Never heard of that, but it looks pretty cool.


Still no relevance algorithm support ?

Bm25 or tfidf would be welcome


Would like to see this work integrated: https://github.com/postgrespro/rum


This looks great! Been considering moving from SQL Server to Postgres for our .NET Core apps' storage and search, rather than use a separate service like Manticore or Elastic. I wish Microsoft SQL Server had easier and better full-text support, now that their cross-platform development framework and tooling has improved so much.


[flagged]


Ad for what? The free software conference or the free software itself, both free as in both beer and freedom as well as well established and common topics on HN?


The slide decked on that page is very useful. It has a lot of examples.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: