Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is interesting that Japanese, Russian and Thai benenefit more (30 %) from brotli, than latin languages (25 %). This is because of the utf-8 context modeling in brotli.


I think the feature in question is declared in the source code here [1]. The RFC goes into some detail about what this means [2] and how it's used [3]. I'd love a whitepaper but the RFC is fairly descriptive and is the best source I can find.

[1] https://github.com/google/brotli/blob/master/dec/context.h [2] https://tools.ietf.org/html/rfc7932#section-2 [3] https://tools.ietf.org/html/rfc7932#section-7


The first draft of the article actually had that reason, but there is also a strong correlation between the size of the dict (these dicts are almost 1Mb, while other languages are closer to 500kb) and compression ratio improvements. Therefore I've played it safe and attributed it to the window size.

Though for languages like Korean and Chinese (whose size is more inline with latin languages) we see 27.5% improvement, which is most likely due to context modeling.

Therefore I assume ratio improvement is split ~50/50 between these two. It was easy to verify that by compressing data with `brotli --window 15` and comparing ratios there, but I was lazy there. I'm sorry.

PS. I've also skipped NFC/NFD part of the post which is very interesting for Korean, where NFC normalized text occupies 30% less space. It also gives additional ratio 5% for brotli and 15% for gzip.


There is no Thai or Korean in the dict. The total size of the dict (including all languages) is 120 kB.


By "dict" I meant the data we are compressing: these are basically dictionaries for "English to X" translation.

What was saying is that there is a strong correlation between size of the data I was compressing and compression ratio improvements over gzip.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: