It is interesting that Japanese, Russian and Thai benenefit more (30 %) from bro...

niftich · on April 7, 2017

I think the feature in question is declared in the source code here [1]. The RFC goes into some detail about what this means [2] and how it's used [3]. I'd love a whitepaper but the RFC is fairly descriptive and is the best source I can find.

[1] https://github.com/google/brotli/blob/master/dec/context.h [2] https://tools.ietf.org/html/rfc7932#section-2 [3] https://tools.ietf.org/html/rfc7932#section-7

SaveTheRbtz · on April 10, 2017

The first draft of the article actually had that reason, but there is also a strong correlation between the size of the dict (these dicts are almost 1Mb, while other languages are closer to 500kb) and compression ratio improvements. Therefore I've played it safe and attributed it to the window size.

Though for languages like Korean and Chinese (whose size is more inline with latin languages) we see 27.5% improvement, which is most likely due to context modeling.

Therefore I assume ratio improvement is split ~50/50 between these two. It was easy to verify that by compressing data with `brotli --window 15` and comparing ratios there, but I was lazy there. I'm sorry.

PS. I've also skipped NFC/NFD part of the post which is very interesting for Korean, where NFC normalized text occupies 30% less space. It also gives additional ratio 5% for brotli and 15% for gzip.

JyrkiAlakuijala · on April 10, 2017

There is no Thai or Korean in the dict. The total size of the dict (including all languages) is 120 kB.

SaveTheRbtz · on April 11, 2017

By "dict" I meant the data we are compressing: these are basically dictionaries for "English to X" translation.

What was saying is that there is a strong correlation between size of the data I was compressing and compression ratio improvements over gzip.