For the many simple cases where a grep-like search is enough, my little experience with pg was far from satisfying. MySQL is simple to use for this thanks to the awfully named `utf8_mb4_unicode_ci`. With Postgres, searching for "étranger" won't find "Etranger" and "Étranger" unless you put some hard work into your database search system.
Unfortunately, it seems that PG12 won't change much in that aspect. From the link above, you have to create the collation (which is already a heavy restriction) and then "certain operations are not possible with nondeterministic collations, such as pattern matching operations." Looks like uncharted territory.
On a side note, I've always seen Solr, SphinxSearch and such specialized tools used instead of the DB's FTS. The main drawback is keeping the data in sync, but it has many pros.
begin;
CREATE OR REPLACE FUNCTION public.insensitive_query(text) RETURNS text AS $func$
SELECT lower(public.unaccent('public.unaccent', CAST($1 AS text)))
$func$ LANGUAGE sql IMMUTABLE;
ALTER FUNCTION public.insensitive_query(text)
OWNER TO [your user that has create rights on the database];
commit;
To use in a query:
WHERE insensitive_query(some_table.name) LIKE insensitive_query('%Étranger%')
No; if you create an index for the item: `insensitiveQuery(some_table.name)` and you then attempt to run: `SELECT * FROM some_table WHERE insensitiveQuery(name) LIKE insensitiveQuery('Étranger');`, the engine would run the `insensitiveQuery` function on the string `'Étranger'`, and then look up that result in the index directly, which is O(nlogn) or faster. If you have a million tables, that'd run the function only once (on `Étranger`) and do about 20 lookups in the index before finding you the qualifying row(s).
I'm getting the impression that recreating functionality from Lucene on top of RDBMS abstractions is certainly possible, but requires a lot of effort from devs and you still need to understand stuff like tokenization, stemming, inverted indexes etc. if you want to use it. At this point one may as well just learn Solr or Elasticsearch. And these are still better at searching at the low level, although of course you sacrifice consistency guarantees. This is a legitimately different problem arena, not NoSQL for the sake of it.
I see your point, but I also see why RDBMS are appealing. We just built a documentation portal for our customers, and we initially used ElasticSearch for as-you-type full text search. I ended up redoing the implementation with SQLite (using fts5) and a simple express api later on top of it. The resource consumption is 3 orders of magnitude lower, the speed is the same and the quality of the results comparable. Elastic was crashing on us due to excessive heap memory consumption, even if 2gb of ram was assigned to it and we had two nodes. I thought the resource consumption was crazy for a light search use case, and found out that SQLite can do the same with 20 mbs of ram.
> recreating functionality from Lucene on top of RDBMS
strange mix of assumptions there, since searching text strings is Job One of computer search since 'forever' .. if Lucene has done some (search) things well, there is no reason that others do not have a chance to also do those things well. In fact, encourage it, foster it, engage with the process that brings new software capacities ..
Sure, I was meaning all the structures and algorithmic stuff we do to limit as much as possible straight up scanning strings. I came to appreciate much the work accumulated, layer by layer, in Lucene. Of course you can redo it elsewhere and it's a worthy programming enterprise but often not directly related to what you want to accomplish with your application.
For MySQL, most of the time it is better to not use a real collation at all and just do utf8mb4_bin. This disabled weird features like case insensitivity. You do need to make sure you properly encode strings, especially if you are working with combining code points. But it is much simpler and a little more efficient.
Really well put together- I learned several more things about PG text searching from this.
From my personal experience working with dozens of California tech companies of varying sizes, I'd say around 80% of the time I see ElasticSearch being used when the product could have just used their existing PG cluster (and existing team experience). I.e. so many companies pick up ElasticSearch only because they lack of full-text search capability awareness of PG. Of course, ES does have its advantages, but I feel it's in very specific use-cases.
Postgresql is awesome, but I don't see any advantage when you need to add like 40 aggregations plus joining to 10 tables.. (easy if you denormalize to elastic)
I can think of a bunch off the top of my head that mostly stem from a serious reduction in complexity:
- having one data store instead of two
- no data store synchronization and related consistency headaches
- the ability to easily join/restrict based on business logic (especially access control)
- easier deployments
- easier qa/local setups
- fewer failure scenarios
- although non-standard, you're dealing with sql and can explain/analyze your queries with tools one already knows if one understands postgres
- the ability to extend my existing (Spring Boot) integration tests to test new searching logic without having to worry about a separate test db.
I've been building a travel blogging platform as a side project that has about ~250 users. Through a couple of choice materialized views to create search documents from a number of different tables, Postgres FTS, and some GIN indices, I was able to make a lighting fast search with intelligent suggestions(ie using user specific restrictions such as follower/following, places a user has posted, etc).
Is my search perfect? No. I could do a better job normalizing diacritics, stemming could be improved, etc. But it was dead simple to implement well enough in a very short amount of time. I'm very happy with my current solution. There's a time and place for a dedicated search solution but it's certainly not in fledgling systems without massive load.
Missed the part about serving the response bellow 50ms :)
Not too much traffic.. between 40/50 req/s.. and a lot of other more important things to be stored on postgres (transactions, stocks.. )
This sounds like it's about the problem of: I have a column filled with the last names of lots of people and I want a text search that will find these names, but, which takes into account all the various ways people end up writing that name. So, a search for 'mueller' should also find 'Müller'.
However, that problem sounds nearly intractable to me.
In basis it sounds like you just 'canonicalize' everything. You store, alongside 'Müller', also the string 'mueller' (the rule being: first transform all characters into their asciification, with the asciification of ü being defined as 'ue', then lowercase that.
The problem is, of course.. that searching for 'MULLER' should also find this person. And now we're more into trying to get this done by first deconstructing the unicode into ascii + accent marks and then just wiping out the accent marks.
But once we state that 'mueller', 'muller', and 'müller' need all be 'equal' to each other, where does it end? A last name can be from any of a thousand languages, each with their own unique conventions on how these things are asciified, and any given asciification must match any other. Sounds like a combinatorial explosion.
One could disregard this notion and say that 'muller' does not match 'müller'. However, note that generally, Sjögren's syndrome tends to be asciified as 'Sjogren's syndrome', probably because Sjögren was norwegian and I assume that the norwegian asciification of ö, is o. The german asciification of ö is oe however; and I can't tell from a last name if it is germanic in origin, norwegian in origin, or from the land of fairytales and flying pandas, such as Haägen-Dazs.
Just by thinking about this for a while I have concluded that it's hopeless, but perhaps I'm missing some fancy trick to at least try to make this problem dealable. So far my attempts to look for solutions tend to come down to advice to use unicode normalization and then get rid of all accents, then lowercase it, which obviously doesn't work at all (with that strategy, mueller would not find müller), or heuristic solutions (if a lot matches, good enough), which are their own can of worms.
Is there a solution to this to me seemingly impossible problem?
As you said, the problem is fuzzy, so you do... a fuzzy search, return a bunch of results, and let the user pick. If the dataset is too large, then you just let the search include more details. "Mueller from Berlin" or "Muller born on july 7th" are much better queries to find a specific person than a perfectly written "Müller".
In fact, if you were to crack the problem of searching by considering all applicable rules, the system would still be useless because people make mistakes all the time and the Müller guy actually wrote "Miller" while signing up ;P
Yes, for me the solution was to stop thinking about search problems as a deterministic problems with a solution that can be (easily) found. It’s a very complex domain space, google still improves their search, with many people working on the problem for more than 20 years.
So it is unreasonable for me to think that I can do a name search (I have exactly the same feature in my product) in a perfect way from scratch just by thinking hard, just because name search is narrower, hence simpler domain (lol).
So I implemented something, got a round of feedback, implemented other forms that users used. My search will display Sjögurd even if user typed in Sjoegurd, despite that the name is Norwegian? Probably, and what’s the downside?
My search will work imperfectly for some lesser known languages, like Georgian? Most certainly, and users from these cultures expect that, they will sigh and deal with it. I know, because I too come from a country with non-latin alphabet and we have lots of different transcriptions into it. It’s ugly, yes. Because the world is ugly :)
Have you looked into Soundex or Metaphone? I myself have not used them. I just saw them mentioned in PHP's function list and thought they might be useful some day.
I think practical solution to this is to return all words that are within a Levenshtein distance[1] of 2*number_accented_characters of the original word, ranked by the distance.
Levenshtein distance can be calculated using a trie[2] so it's fast enough for most purposes.
If you wanted to get fancy you could modify the Levenshtein to reduce the distance penalty on accented characters, and/or only apply the Levenshtein calculation within 1 character of an accented character or the equivalent ASCII substitution sequence.
There are various metaphone algorithms that produce an encoding that handles these cases, at least for English and some of the other European languages from which words and names have made there way into English.
The "fancy trick" is basically that search still works pretty well if you just throw away all the vowel sounds - your examples of "mueller" and "Müller" both encode as: MLR
> Is there a solution to this to me seemingly impossible problem?
Not at all impossible, you're just not familiar with Unicode. Just use a software that implements UTS #10/UCA. Reply if you want to see code examples for the problems you stated.
I hooked it up and had a quick play around with it for a proof of concept project at work, unfortunately that project never went ahead in the form that would have used Zombo, but during testing it was a breeze to setup and seemed to do everything it promised more or less straight out of the box.
I'd definitely give it another, proper go if I got the chance.
Thanks for the positive feedback. I'm sorry your project didn't go in a direction where ZDB would be useful, but I'm thrilled you had a positive experience with ZomboDB.
Thanks! Looks like I’ll invest some time into setting it up. Doing the current tsvector with the English dictionary for stems and zombodb seems closer to what I want.
ZomboDB developer here. With ZDB you don't need tsvector at all. With a properly defined mapping, Elasticsearch can do the stemming for dozens of languages, including English.
This looks great! Been considering moving from SQL Server to Postgres for our .NET Core apps' storage and search, rather than use a separate service like Manticore or Elastic. I wish Microsoft SQL Server had easier and better full-text support, now that their cross-platform development framework and tooling has improved so much.
Ad for what? The free software conference or the free software itself, both free as in both beer and freedom as well as well established and common topics on HN?
Slides link: https://fosdem.org/2020/schedule/event/postgresql_the_state_...
Slideshare: https://www.slideshare.net/vyruss000/the-state-of-full-text-...
Video will appear under these URLs after mirrors are synced:
MP4: https://video.fosdem.org/2020/H.2214/postgresql_the_state_of...
WebM: https://video.fosdem.org/2020/H.2214/postgresql_the_state_of...