Hacker News new | past | comments | ask | show | jobs | submit login

It's 7TB compressed. If it's text you'd need about 70TB to decompress it. It's probably mostly images though, so probably not quite that bad.



I've tried to do lossy compression of epubs with some lines of bash scripts; i.e. removing the images and fonts that were not needed. Many epubs could be downsized to a third of their size, but then I found a book that needed the supplied fonts and gave up. When doing lossy compressions can not have those kind of bugs.

What I also found was that many of the images in the epubs were already unuseable and nothing like their counter parts in phsyical books.


I don’t understand this. Are they epubs of comics or something? Epubs are already compressed (zip).


Good compression of lots of epub files can likely be way more efficient, as deduplication/compression algorithms can be run on lots of books at the same time. Especially so with a good dictionary.


It's not terribly uncommon to find an epub with several megabytes of cover art and a few hundred kilobytes of text.


Things can be rendered from compressed container files. For HTML with images, even slow-but-strong compression like LZMA is already fast enough to render pages as fast as you can click through them, even on fairly old hardware.

Kiwix .ZIM file format is a good example. The entire Gutenberg Library is a single ~65 Gb file, and you can read any book from it without unpacking anything.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: