SQLite FTS5 Extension

simonw · 2024-08-09T06:02:25.000000Z

This is hugely underrated in my opinion: it’s a very competent search engine.

It also ships as part of the Python standard library, so if your machine has Python installed you have a high quality search engine ready to use without installing anything else.

I have a CLI tool (and Python library) for working with it here: https://sqlite-utils.datasette.io/en/stable/cli.html#configu...

anitil · 2024-08-09T06:37:25.000000Z

On a lark I used this to build a local-first code search tool for my organisation. Of course I used datasette for running the queries.

Other than getting all my git access revoked and receiving a very concerned email from our security team, it was a great project.

marvel_boy · 2024-08-09T11:21:13.000000Z

Wait, why they revoked git access?

Retr0id · 2024-08-09T11:27:16.000000Z

Someone unexpectedly ingesting all git repos at once would look a lot like a compromised workstation being used to harvest company data.

I assume everything was fine once they explained what was going on.

anitil · 2024-08-12T01:14:58.000000Z

Yes you're exactly correct. They were worried I'd had my laptop compromised or stolen but after a little bit of back and forth I think they just thought it was funny and restored my access

rcarmo · 2024-08-09T07:02:51.000000Z

I have been using it as my only searching engine for my blog in years and it’s much better than anything else I’ve thrown at it. For bonus points, peewee ORM supports it directly.

barrenko · 2024-08-09T09:24:03.000000Z

I've got half of my mind to use this as a search for legal documents and "pretend" it's an LLM.

wahnfrieden · 2024-08-09T06:33:01.000000Z

though useless for many non english languages

epcoa · 2024-08-09T07:08:43.000000Z

Eh? The tokenizers are fully pluggable, the Unicode and ICU tokenizers will be good for many purposes and you can always drop something specific in.

Eg https://github.com/wangfenjin/simple

wahnfrieden · 2024-08-09T07:24:06.000000Z

I know it’s extensible, with original work. ICU isn’t helpful for some languages

AlexeyBelov · 2024-08-10T06:33:16.000000Z

For what languages? Why ICU isn't helpful? My naive guess would be "it works for anything supported by Unicode".

wahnfrieden · 2024-08-10T15:55:10.000000Z

languages that don't use spaces like Japanese

troupo · 2024-08-09T12:58:24.000000Z

Of all/many opensource projects I looked at, only meilisearch is good for non-English languages: https://www.meilisearch.com/docs/learn/resources/language

djhn · 2024-08-10T08:54:15.000000Z

Unfortunately it still works very poorly for languages with compound words (German, Swedish, Finnish) where you would usually want to surface search results if someone uses the second or third part of the compound word, otherwise the user has to enumerate the (often too many) variants. Meilisearch doesn’t support suffix search and their tokenizer can’t handle these languages.

For instance searching for ‘___missilestrike’ in Kyiv would ideally surface all incidents, regardless of launch mode (surface-to-surface, air-to-surface…), range, propulsion, warhead, or guidance system. In these languages a number of these would form the prefix of that compound word.

wahnfrieden · 2024-08-09T13:08:20.000000Z

Japanese support looks good. But it looks like I can’t deploy this inside an iOS app :(

plq · 2024-08-09T16:56:13.000000Z

Why not?

wahnfrieden · 2024-08-09T20:47:57.000000Z

I need an offline client side db/ fts (for this language learning app: https://reader.manabi.io )

plq · 2024-08-10T16:39:07.000000Z

To my knowledge, meili is free software (MIT license) so theoretically, it should be possible to embed it to an IOS app.

wahnfrieden · 2024-08-17T03:35:00.000000Z

it's more about the tech it uses... it's oriented toward docker based deploy and is made in rust so maybe it can be ported to iOS, I'd have to spend time forking the packaging. looks like it's geared toward server environments

keithalewis · 2024-08-09T06:59:28.000000Z

[flagged]

zarathustreal · 2024-08-09T10:30:09.000000Z

My brother in Christ, releasing a tool for free does not exempt it from criticism

nxicvyvy · 2024-08-09T08:20:12.000000Z

Shh, it's a secret, don't tell people.

iansinnott · 2024-08-09T05:54:18.000000Z

It's a very useful feature of sqlite and it also works great in-browser using wa-sqlite[0]. Example, if anyone's curious [1].

[0]: https://github.com/rhashimoto/wa-sqlite [1]: https://github.com/iansinnott/prompta/blob/master/src/lib/mi...

jbaiter · 2024-08-09T06:41:48.000000Z

I'd go as far as claiming that the WASM port of SQLite with FTS5 enabled makes SQLite the best general purpose client-side search engine currently on the market. And with the option to further customize it with custom tokenizers and auxilliary functions, you can do pretty advanced stuff with it, rivalling Lucene (not completely though, e.g. FTS5 does not store offsets, so highlighting is more expensive than in Lucene).

ThatPlayer · 2024-08-09T08:40:13.000000Z

I've been trying something similar, but using https://github.com/mmomtchev/sqlite-wasm-http to stream the database over http for a SPA without a backend. It's actually able to do searches (for queries that aren't super short) without downloading the entire FTS table.

iansinnott · 2024-08-09T15:10:42.000000Z

Interesting, thanks for sharing. I didn't realize you could use search features when streaming the db over a network.

gyf304 · 2024-08-09T07:33:55.000000Z

If you want to index HTML documents using SQLite FTS5, I have a HTML (pseudo) tokenizer plugin https://github.com/gyf304/sqlite3-fts5-html

gudzpoz · 2024-08-09T09:21:28.000000Z

I once tried to use sql.js [1] on a static site for full text search. It worked, but the resulting database size for that site was too large for the web, even with things like detail=none and content='' applied, and requiring the user to download a database each time was just no go. (I guess things should work better for sites with less content or those not requiring a trigram tokenizer.)

I switched to Pagefind [2] afterwards before finding out a sql.js-httpvfs [3] fork of sql.js that removes exactly the need to fully download a database (with HTTP range requests). I haven't got the chance to test sql.js-httpvfs out though, but it looks pretty sound and could be much more flexible than Pagefind. (Previously discussed at https://news.ycombinator.com/item?id=27016630 .)

[1] https://github.com/sql-js/sql.js/

[2] https://pagefind.app/

[3] https://github.com/phiresky/sql.js-httpvfs

qwertox · 2024-08-09T07:02:53.000000Z

FTS5 got added to SQLite in October 2015 (3.9.0). But it's nice seeing it trending here on HN.

mcbetz · 2024-08-09T06:14:44.000000Z

Does anyone have successfully worked with Non-English text with FTS5 in Sqlite? I could not find any reference for German, e.g. and the default stemming does not seem to work properly (given some short tests).

sgbeal · 2024-08-09T11:23:23.000000Z

> Does anyone have successfully worked with Non-English text with FTS5 in Sqlite? I could not find any reference for German, e.g.

We use it in the Fossil SCM project and users have reported success with Chinese and Russian, so it presumably works fine with any European/Germanic language.

> and the default stemming does not seem to work properly (given some short tests).

The Porter Stemmer is documented as only being useful for English.

djhn · 2024-08-10T09:01:08.000000Z

It has pretty much the same support for other languages as most text mining tools and Elasticsearch via the snowball stemmer: https://github.com/abiliojr/fts5-snowball

Should work well for German, I’m using it with Nordic languages.

atoav · 2024-08-09T06:35:49.000000Z

Not that I tried it, but this problem seems to be related: https://stackoverflow.com/questions/45681645/how-to-enable-f...

Just that you would need to tokenize the right characters for your target language (e.g. ÜüÖöÄäßẞ¹), maybe those are already included in Unicode61.

¹: Yeah there is now a capital Eszett

freeqaz · 2024-08-09T06:59:42.000000Z

Does anybody know how to make it work with UUID as primary keys? I got stuck on that a while back and it randomly corrupts when using rowid. Couldn't figure out a solution other than migrate my schema to use a sequential integer instead.

mischa_u · 2024-08-09T07:25:40.000000Z

Use an UNINDEXED column, see https://www.sqlite.org/fts5.html#the_unindexed_column_option

bomewish · 2024-08-10T03:25:41.000000Z

We are running FTS5 with a bunch of sqlite databases for an internal search tool, one with 15m records (we sharded that one, one shard per core), others with 200k-2m records. Speed is very fast -- the biggest dbs are hundreds of ms response time, but others are sub 10 or tens of ms. By _far_ the simplest search option for a small organization where a lot of the data is already in a SQL db, and we don't have the resources to stand up and maintain an ES cluster.

djhn · 2024-08-10T09:13:23.000000Z

I’m interested in the sharding approach myself but those hundreds of ms sound a bit slow.

Do you have to do anything special to guarantee each sqlite instance executes in parallel with the others on its own core? Does the language you’re calling them from support concurrency or parallelism?

bomewish · 2024-08-11T00:26:30.000000Z

I just did some tests and the perf is actually worse than that — but it’s actually still perfectly fine for our use cases.

- 15m row db will return a search for an extremely common term in ~3s (9m rows have that term);

- When adding terms, so that only ~200k rows have the combination, response is ~2s

- That 15m db was in 16 shards. (I’m actually not sure why it should be so slow; doesn’t actually make sense given that a single 1m row db is not that slow, might look into it but not a huge priority)

- For a ~200k row db, search is ~150ms for three terms that appear in ~1.5k entries

- Obviously the network io is the slower piece anyway

This perf is fine for our use case — a research system for internal users. Certainly worth the trade-off of not having to deal with a more complex database like elasticsearch, or even PG with the tantivy extension (which we might switch to some day).

For sharding — we only shard the huge one. To make sure bm25 rankings are still _roughly_ the same as they would be if we did not shard, we just randomised the rows assigned to the num_core shards. It’s a 16 core machine so 16 shards.

We know/ensure each db file gets its own core because we use aioprocessing in python. It handles both multicore and async. Running a query with htop shows all the cores light up (unlike with unsharded).

It’s all about trade-offs in the end. We cache the first search then pre cache the next page (and so on) so the user only has to wait 2-3s when searching the big db for that initial search and after that it’s pretty snappy. Most searches are on much smaller dbs (thousands to hundreds of thousands of rows) and results there are often 10ms or something. No user complaints so far.

jezek2 · 2024-08-09T07:51:54.000000Z

FTS5 works great. The only issue I had is that the syntax for the queries is quite irregular. Had problems both with understanding it exactly and translating to it from the usual format of space delimited keywords, exact strings in quotes, etc.

I wish it would have an API instead where I could put the parts of the query exactly (as nodes in a tree) without the need to translate to this irregular syntax.

outcoldman · 2024-08-09T10:57:49.000000Z

I use it in one of my macOS apps (ShellHistory) for storing and searching history from shell (zsh, bash, fish). Works very well.

https://loshadki.app/shellhistory/

kevinak · 2024-08-09T10:06:18.000000Z

Using this for our new Svelte community website. Very pleased with it so far.

thunderbong · 2024-08-09T10:35:21.000000Z

What's the url for the Svelte community website?

kevinak · 2024-08-09T10:48:30.000000Z

The old one lives at https://sveltesociety.dev - be aware, lots of very old content, but components and packages pages are up-to-date.

The new one lives here but is under heavy construction: https://v2.sveltesociety.dev

Ironically, the Search doesn't work on the deployed version at the moment.

Edit: Looks like the search actually works if you're logged in.

m_fayer · 2024-08-09T07:12:18.000000Z

Has anyone tried this out on mobile? I have a list of app ideas that all center around shipping a big database to a device and allowing complex querying while offline.

wahnfrieden · 2024-08-09T07:28:55.000000Z

m_fayer · 2024-08-09T09:28:18.000000Z

How’d it go?

wahnfrieden · 2024-08-09T13:03:28.000000Z

GRDB on iOS and macOS

It’s easy and works well

It is slower than Realm in some benchmarking (I forget if it’s reads or writes) and might be less memory friendly (Realm does lazy evaluation). You need to use some third party open source to make sqlite faster by precompiling model definitions, which I didn’t bother with. I currently only use it for fts5 and Realm for other needs, but would like to try using it for more.

I also recommend looking at https://skip.tools which has cross platform SQLite for iOS and Android (you write your app in swift and SwiftUI and it generates the Android project and kotlin code)

Btw I use GRDB in my iOS/macOS app here: https://reader.manabi.io Manabi Reader, a Japanese learning app. I use SQLite for dictionary searches which works ok for Japanese only because I’m only searching dictionary expressions and not sentences

m_fayer · 2024-08-10T10:47:16.000000Z

Thanks for the input! Did you need to ship your own binaries of sqlite that include fts5 or does skip do that for you?

wahnfrieden · 2024-08-10T15:54:39.000000Z

I haven't used Skip yet, but the two founders are very responsive on twitter / elsewhere if you want to ask about that. I used GRDB for FTS5

philippta · 2024-08-09T11:46:15.000000Z

Not sure if I'm holding it wrong, but when I played with it I nowhere got results that are anything remotely like a proper search engine (thinking of Algolia, Elasticsearch, etc.).

It was good at finding the text I was searching for but in terms of ranking them, it felt like it needed a lot of post processing.

meindnoch · 2024-08-09T12:09:32.000000Z

Have you used ORDER BY bm25(...)?

spa5k · 2024-08-09T06:51:42.000000Z

Can you do fuzzy search with it?

jbaiter · 2024-08-09T07:18:03.000000Z

Not by default, no. But you could kind of implement it by providing a custom tokenizer that emits multiple terms for the same position in the document, with different variants of the same token. This would not be "proper" fuzzy search, but might be enough depending on the use case.

See https://www.sqlite.org/fts5.html#synonym_support for more details on the different approaches for implementing synonyms in custom tokenizers.

spa5k · 2024-08-09T07:33:14.000000Z

I don't think that will work for me, since I needed something that can handle mistakes in the words, like Du'ha to duha etc, Rahman to rehman, basically whatever looks closest.

jbaiter · 2024-08-09T08:37:09.000000Z

One thing you could do: FTS5 has the `fts5vocab` virtual table [1] that has all the terms. You could provide a user-defined function that computes the levenshtein distance between your query terms and the terms in that table, obtain candidate terms that way and build a big query that searches for all those lexically close terms.

[1] https://www.sqlite.org/fts5.html#the_fts5vocab_virtual_table...

jerrygenser · 2024-08-09T12:07:32.000000Z

I can confirm an approach like this works in practice.

Although instead of levenshtein I use spellfix (maybe it uses that under the covers? not sure). If there is no match from the first search, I use the sqlite spellfix extension [0] to find matches. Then feed those candidates into the terms.

https://www.sqlite.org/spellfix1.html

jbaiter · 2024-08-09T12:39:16.000000Z

Ah, that's great, didn't know about that :-) It seems to use a similar edit-distance algorithm under the hood, and the docs explicitely mention integration with the FTS extension, so this is probably the way to go!