Hacker News new | past | comments | ask | show | jobs | submit login

How far behind do other big languages (Chinese, Arabic, Spanish, Hindi) lag behind English when it comes to full text search?



The problem here is unicode normalisation, a standard procedure of replacing equivalent glyphs (the "character units" a user sees) until only a specific set of graphemes (unicode character points split into groups that form glyphs) remain. you can do this with libicu before you index text or send text for a query to sqlite3. There's also an official ICU extension for sqlite3 that does this.

Other than that, there's also tokenizing (splitting text into words) that's also unicode defined and stemming (reducing tokens to a base stem like "likes"->"lik-" in English)


I believe there are tokenizers for Chinese language but haven't tried them since they aren't available by default in Android or iOS for example. In our app we've ended up having two modes - one for alphabetical languages and one for Thai, Chinese, Hebrew and Arabic, but in this case it's not actually using FTS but a plain SQL query.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: