Hacker News new | past | comments | ask | show | jobs | submit login

> the prefix of many of your trigrams is just going to be surrogate pair bytes

I'm not sure what you mean by this. Surrogate pairs are a UTF16 thing, not relevant here.

> you've essentially eliminated Unicode support entirely.

That's not obvious to me at all. In text containing non-ASCII characters, the trigrams that span UTF8 character boundaries are likely to give you the selectivity you need.

> And for languages that use characters with three bytes, you're essentially indexing individual characters

Not at all. The index would be a function of character pairs.

It certainly could be worse but how much worse would depend on the characteristics of the source texts, which is why I think it would be really interesting to see whatever data Sourcegraph has on this.




> Not at all. The index would be a function of character pairs.

If it's a three byte trigram and it covers a three byte character, please explain how it could be anything other than a trigram containing a single character? If you want to cover a pair characters, you need six bytes. Which is also simply worse than the 42 bits you need to encode two Unicode code points as integers (as the author is doing now).


Because there are trigrams crossing the boundaries between characters.

E.g. given the bytes ABCXYZ, the trigrams are not just ABC and XYZ but also BCX and CXY. Then the question is, assuming ABC, XYZ, BCX and CXY are all present in a document, what is the probability that ABCXYZ is present in the document? If it's close to 1, then the byte trigram index is about as powerful as a character bigram index.

Of course it's also true that bigrams might be inadequately selective, but the underlying alphabet is probably larger than ASCII if it encodes to 3 bytes in UTF8.

Again I'm not arguing that trigrams-over-bytes is necessarily the best option here, just that it's simpler and a decision to do trigrams over Unicode characters needs data to justify it, and I'd love to know what that data is.


Trigrams have an approximately 30% false positive rate when used as a filter for whether to skip repos.


That's interesting, and doesn't surprise me. When I wrote an n-gram full-text search for email 20 years ago, I used 4-grams to get the required selectivity. I imagine code could be worse because of more restricted syntax and alphabet.

However, this doesn't bear directly on the question of trigrams-over-UTF8-bytes vs trigrams-over-Unicode-chars.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: