A 5x reduction in RAM usage with Zoekt memory optimizations

kevincox · on Aug 19, 2021

I'm surprised that they are storing Unicode characters instead of bytes. For example Rust's regex library works on bytes and unicode patterns are compiled into byte patterns which means that it doesn't need to worry about unicode and variable length characters when matching.

You'd think for code where the vast majority is ASCII this would be a huge improvement. I guess the downside is that searches for emoji and other "long" characters would need to look up more index entries. However I would expect that due to the rarity of that it would be beneficial overall.

hanwenn · on Aug 20, 2021

Hi, Zoekt author here.

There are a bunch of interesting replies, but they all miss the point.

The reason you can't use bytes is a combination of two things:

(1) code search supports case-insensitive matching.

(2) the search algorithm uses distance between trigram occurrences to reduce the number of matches.

With UTF-8, different case flavors of a character have different widths. For example, k (ascii k, 1 byte), K (ascii K, 1 byte) and K (unicode U+212A for Kelvin, 3 bytes). This messes up algorithm (2).

The same consideration holds if you want to support searches where "a" also matches "á" and "ã".

I did a talk on exactly this subject in 2017, which goes into more detail. The bit that explains it is here, https://youtu.be/_-KTAvgJYdI?t=602 . If you have more time, you can watch the entire thing, as it is entertaining, if I may say so :)

Sesse__ · on Aug 20, 2021

If you actually want to be UCA-compliant, though, it won't really help you to use codepoints. Different Unicode normalization, contractions and ignorables really mess up things. (In general, you can't search by case-folding if you want all the edge cases right; you have to apply the full Unicode collation algorithm.) I wonder if it's possible to store UCA weights instead of code points in the trigrams, but I haven't thought deeply about it, and it would surely affect indexing speed.

In plocate, where I have this exact same issue, I solve it half-heartedly by using 3-byte trigrams and then constructing a whole bunch of alternatives, even if they differ in width. (I don't support accent-insensitive search and I don't support contractions, so locate ß won't find SS.) See https://git.sesse.net/?p=plocate;a=blob;f=parse_trigrams.cpp... — but it doesn't support the Kelvin case if it's in the filename and you give K on the command line.

kevincox · on Aug 20, 2021

> (1) code search supports case-insensitive matching.

I'm not quite sure why this matters. Either you are case folding before you store and both cases are the same or you are expanding the query to consider all cases and both cases are the same. You are effectively storing "UTF-8" anyways. It is just whether you are storing 3 bytes or 3 characters.

> (2) the search algorithm uses distance between trigram occurrences to reduce the number of matches.

Could you not still measure this distance in codepoints? Or is there a benefit to tying this to the length of the trigrams?

> The same consideration holds if you want to support searches where "a" also matches "á" and "ã".

Well IIRC á can be two codepoints as well (if it isn't normalized) so don't you need to handle this case anyways?

Scaevolus · on Aug 19, 2021

The source code itself is stored as UTF-8, but the trigrams were represented as Unicode codepoints in the index. The last optimization packs the ASCII trigrams for efficiency.

kevincox · on Aug 19, 2021

But that's my point. Why not just store bytes, they you don't have to worry about packing, it is always packed.

If I understand correctly they are using 8 bytes for 3 codepoints. They could instead use 3 bytes for 3 bytes. This would use significantly less memory and would rarely be less selective. (If that was a concern they could probably consider 4-grams instead of trigrams and still use less memory.)

This also doesn't precude the splitting into the first 2 characters and last 1.

Scaevolus · on Aug 19, 2021

Oh, certainly, you could change the architecture to have byte-trigrams. These optimizations work without any large-scale architecture changes, making them safer to test and deploy, no data migrations or reindexing required!

roca · on Aug 19, 2021

I think the real question is: why not use trigrams of bytes (from the UTF8 representation) instead of Unicode codepoints? Seems like that would simplify everything.

If they have data showing that doesn't work as well in practice, I'd love to know what it is, because I can't think of a reason off the top of my head why it wouldn't work well. Obviously it could only be worse in non-ASCII situations, which are already super-rare.

bastawhiz · on Aug 19, 2021

Because then you've potentially got a lot of trigrams for 4-byte Unicode characters that only contain the first three bytes of the character. Even if they're rare, you're effectively hamstringing anyone who uses a language that includes four byte characters or emoji. And for languages that use characters with three bytes, you're essentially indexing individual characters. Even in the case where you're using two byte characters, which is very common, your trigrams become one and a half characters.

But moreover, the prefix of many of your trigrams is just going to be surrogate pair bytes. At this point, the whole point of building a reverse index is somewhat moot if you're indexing over parts of characters, byte sequences that are not useful on their own, and other such garbage. Yes, the ASCII happy path is fine, but you've essentially eliminated Unicode support entirely.

tialaramex · on Aug 20, 2021

> Even if they're rare, you're effectively hamstringing anyone who uses a language that includes four byte characters or emoji.

I don't think so. Suppose I am looking for some code that I remember has a comment with the shark emoji U+1F988 in double quotes.

So the trigrams this backend is searching for are 0x22F09F and 0xA68822.

The first trigram is only going to appear in documents quoting an emoji animal. Yes it will match kangaroos or indeed crabs in a few Rust crates, but it will entirely skip every program that doesn't have an emoji animal in quotes. The second trigram only matches documents with certain 3 or 4-byte Unicode scalars (such as an emoji) at the end of a quote. It will not, for example, match most of those Rust documents talking about crabs since crabs do not end with the bytes 0xA6 0x88.

So chances are you went from "Oh no, a 4 byte character, surely the worst case" to only the documents you actually wanted before the main algorithm even began running.

> But moreover, the prefix of many of your trigrams is just going to be surrogate pair bytes.

I assume this is just a brain misfiring, this is obviously not how UTF-8 works.

bastawhiz · on Aug 24, 2021

That works great for emojis and terribly for Japanese, Chinese, Korean, Thai, etc. where each character is a three byte codepoint. Then you're literally just building a reverse index of individual characters. If this was effective, they wouldn't be using trigrams in the first place: it's a much simpler system to implement. Sure, most languages use Latin characters in keywords, but that's unhelpful for comments.

roca · on Aug 20, 2021

> the prefix of many of your trigrams is just going to be surrogate pair bytes

I'm not sure what you mean by this. Surrogate pairs are a UTF16 thing, not relevant here.

> you've essentially eliminated Unicode support entirely.

That's not obvious to me at all. In text containing non-ASCII characters, the trigrams that span UTF8 character boundaries are likely to give you the selectivity you need.

> And for languages that use characters with three bytes, you're essentially indexing individual characters

Not at all. The index would be a function of character pairs.

It certainly could be worse but how much worse would depend on the characteristics of the source texts, which is why I think it would be really interesting to see whatever data Sourcegraph has on this.

bastawhiz · on Aug 20, 2021

> Not at all. The index would be a function of character pairs.

If it's a three byte trigram and it covers a three byte character, please explain how it could be anything other than a trigram containing a single character? If you want to cover a pair characters, you need six bytes. Which is also simply worse than the 42 bits you need to encode two Unicode code points as integers (as the author is doing now).

roca · on Aug 20, 2021

Because there are trigrams crossing the boundaries between characters.

E.g. given the bytes ABCXYZ, the trigrams are not just ABC and XYZ but also BCX and CXY. Then the question is, assuming ABC, XYZ, BCX and CXY are all present in a document, what is the probability that ABCXYZ is present in the document? If it's close to 1, then the byte trigram index is about as powerful as a character bigram index.

Of course it's also true that bigrams might be inadequately selective, but the underlying alphabet is probably larger than ASCII if it encodes to 3 bytes in UTF8.

Again I'm not arguing that trigrams-over-bytes is necessarily the best option here, just that it's simpler and a decision to do trigrams over Unicode characters needs data to justify it, and I'd love to know what that data is.

Scaevolus · on Aug 20, 2021

Trigrams have an approximately 30% false positive rate when used as a filter for whether to skip repos.

roca · on Aug 20, 2021

That's interesting, and doesn't surprise me. When I wrote an n-gram full-text search for email 20 years ago, I used 4-grams to get the required selectivity. I imagine code could be worse because of more restricted syntax and alphabet.

However, this doesn't bear directly on the question of trigrams-over-UTF8-bytes vs trigrams-over-Unicode-chars.

Sesse__ · on Aug 20, 2021

It's entirely possible to do so, and I do it in plocate, which solves a similar problem in pretty much the same way. But it's non-ideal for case-insensitive matching (see previous comment).

benibela · on Aug 20, 2021

I use another regex library that does that and do not like it at all

Something like \w makes the regex complexity explode. It spends more time building the regex than actually matching it.

burntsushi · on Aug 20, 2021

You can't get around the Unicode problem for free. If your state machine operates on codepoints, then your state machine is almost certainly very slow. Not only because you now need to do UTF-8 decoding, but you also probably need a sparse representation (e.g., a hash map) to deal with the large space of Unicode codepoints since that's what your transitions are defined over.

You're right that Unicode-aware \w is huge if you build a full byte-based DFA for it. But that's why Rust's regex crate doesn't build full DFAs. It only builds the part of the DFA it needs (at search time). So you get the best of both worlds: reasonably fast compilation combined with fast search. (A big enough state machine will still ultimately overwhelm it and result in slower searches or large memory growth, depending on which trade off you pick. But a simple \w is handled just fine.)

cinntaile · on Aug 19, 2021

If anyone else but me wonders where the name comes from, Zoekt means Seek.

Context: "Zoekt, en gij zult spinazie eten" - Jan Eertink

("seek, and ye shall eat spinach" - My primary school teacher) https://github.com/google/zoekt

therealmarv · on Aug 19, 2021

Related: The last time I've checked a standard Ubuntu does not have RAM compression (ZRAM) enabled by default (unlike current Windows and Mac which have that for years). It helps a lot with programs like browsers.

klysm · on Aug 19, 2021

I understand why it’s opt in though, not a clear win in all cases.

smcnally · on Aug 19, 2021

Your point about opt-in makes me wonder if my smaller VMs and VPSs would benefit. This recent one-pager covers the ‘compression can help but not always’ brackets you’ve noted.

https://benad.me/blog/2021/02/23/ram-compression-on-linux/

> Still, unless you frequently use the swap device on your Linux system, it may not be worthwhile to reserve a large amount of memory to a zram device.

I know how to view current swap usage and will check if there’s historical usage in syslog or the like.

klysm · on Aug 20, 2021

As always with performance things, you have to benchmark. But that’s always easier said than done. With the insane complexity of memory systems in modern CPUs anything can happen.

FeepingCreature · on Aug 20, 2021

zram is amazing. I love it. Makes my 16GB machine feel like a 24GB, no noticeable downsides. I've totally turned off regular swap, even.

Scaevolus · on Aug 19, 2021

Author here, ask me anything! :-)

beff_jesos · on Aug 19, 2021

Is this available on the self hosted version as well now? I am getting RAM issues on the AWS hosted cluster.

Scaevolus · on Aug 20, 2021

Yes, all these optimizations have been in the last few releases.

Reducing RAM allocations in the cluster can be tricky. Zoekt memory maps its index files and will absorb all of RAM as cache, and the standard cadvisor metrics will simply declare that your container is using almost all of its RAM allocation. Looking at go's heap statistics for the process will give you a better idea of how much overhead you need.

cbm-vic-20 · on Aug 19, 2021

Not directly related to the linked content, but "N times less" never really made that much sense to me. In this case, I'm guessing "5x reduction" means "80% less"? Or "20% of the previous usage"?

DistractionRect · on Aug 19, 2021

N times less usually just means it uses 1/N. So this would mean it uses only 20% of the memory it used originally.

rusk · on Aug 19, 2021

Yeah it doesn’t make sense as the x in 5x I would have thought repetesents a multiplication rather than (divide)

It would make more sense for me to say a fifth, or 1/5

MichaelBurge · on Aug 19, 2021

    log(5x)  =  log(5) + log(x)
    log(x/5) = -log(5) + log(x)

So the difference is just a sign change on a log scale. And it's natural to use a log scale when working with percentage changes.

rusk · on Aug 20, 2021

Natural for who?

Perhaps better expressed as “5 fold”.

tyingq · on Aug 19, 2021

"We went from 1400KB of RAM per repo to 310KB with no measurable latency changes."

So, not exactly, but close. ~22% of previous usage.

nijaru · on Aug 19, 2021

Yes that is correct. It's not always the most intuitive language.

5 times less than 20 would be 4 or (20 * 1/5), the same as 20 is 5 times more than 4.

ksec · on Aug 19, 2021

Yes it is the marketing speak of 80% less. And generally speaking works much better than percentage.

nytgop77 · on Aug 19, 2021

I saw advertisements "now 20% cheaper!", while price change (original->new) was 30€->25€.

Marketing always finds a way.. 5/30=16.6%, 5/25=20%