> the prefix of many of your trigrams is just going to be surrogate pair bytes I...

bastawhiz · on Aug 20, 2021

> Not at all. The index would be a function of character pairs.

If it's a three byte trigram and it covers a three byte character, please explain how it could be anything other than a trigram containing a single character? If you want to cover a pair characters, you need six bytes. Which is also simply worse than the 42 bits you need to encode two Unicode code points as integers (as the author is doing now).

roca · on Aug 20, 2021

Because there are trigrams crossing the boundaries between characters.

E.g. given the bytes ABCXYZ, the trigrams are not just ABC and XYZ but also BCX and CXY. Then the question is, assuming ABC, XYZ, BCX and CXY are all present in a document, what is the probability that ABCXYZ is present in the document? If it's close to 1, then the byte trigram index is about as powerful as a character bigram index.

Of course it's also true that bigrams might be inadequately selective, but the underlying alphabet is probably larger than ASCII if it encodes to 3 bytes in UTF8.

Again I'm not arguing that trigrams-over-bytes is necessarily the best option here, just that it's simpler and a decision to do trigrams over Unicode characters needs data to justify it, and I'd love to know what that data is.

Scaevolus · on Aug 20, 2021

Trigrams have an approximately 30% false positive rate when used as a filter for whether to skip repos.

roca · on Aug 20, 2021

That's interesting, and doesn't surprise me. When I wrote an n-gram full-text search for email 20 years ago, I used 4-grams to get the required selectivity. I imagine code could be worse because of more restricted syntax and alphabet.

However, this doesn't bear directly on the question of trigrams-over-UTF8-bytes vs trigrams-over-Unicode-chars.