This is neat. I notice that you do similar reverse lemmatisation as my Wiktionar...

hiAndrewQuinn · 2025-05-06T19:41:26 1746560486

Ha, what an honor to finally be noticed by Nuenki guy. Big fan of the project.

I did consider doing something Wiktionary-centric, and in fact have a Wiktionary JSONL scrape lying around courtesy of https://kaikki.org/ (from the same guy who started OpenSSH!). `tsk` does something similar to your defereferencing when it hits a "go deeper" phrase.

I decided against that approach in favor of the libvoikko spell checker because Finnish lies in this interesting zone of being an agglutinative language with a really, really regularized orthography. People love their neologisms here, and unfortunately most of them aren't catalogued in Wiktionary quite yet. I've found the mechanistic approach covers a lot of those edge cases well.

Take the word junttihenkiseni - the root form is junttihenkinen, but as of 05/06/2025 https://en.wiktionary.org/wiki/junttihenkinen does not actually exist. So `tsk` will have no data for it, but `finstem` with its mechanical approach works just fine. The word was originally coined to refer to Finland's unique spin on rock music in the 1970s and 80s.

On a broader level, if I can avoid hitting the network with small personal projects like this, I do try to. For example, `tsk` comes bundled every Finnish word with an English dictionary entry from Wiktionary, in a ~25 MB JSONL embed, and that allows us to build the randomly pruning trie that lets us get instantaneous prefix search across such a large space of things. I have met a lot of people who want to move to Finland from places where Internet is a sparse and valuable commodity, and I think their lives are much improved by having a tool they can just download one time and then use any place their laptop can be powered on.

Alex-Programs · 2025-05-09T11:21:14 1746789674

I'm the "Nuenki guy"!?!? Thanks!

And yeah, I used the same data source, just heavily processed. It's a great project!

I like your approach. I wish English used neologisms more; I use them occasionally, and it's just quite fun to create a new, lexically valid word to describe something novel.

The Nuenki dictionary database is a bit over a gigabyte, albeit with ~30 languages in it, so yeah that's definitely a bonus! JSONL probably compresses quite well, too. I added compression to Nuenki's (serialised-struct) dictionary entries a while back and it reduced the size by about 30%, iirc.