Hacker News new | past | comments | ask | show | jobs | submit login

I guess I've never really understood the need for ePub when PDF is now a standard. I'm sure I'm missing something, the question is: what is that thing?



PDFs bake in a page size. For 99% of them, 8.5 x 11.

Your typical eink device (a kindle) is 6 inches diagonal, but there are also 6.8, 7.8, and 9.7 in somewhat common use, and more uncommon. Three sizes of iPads, dozens of sizes of android tablets, phones and laptops, etc.

You can't make a single page size display well on all of this -- at 8.5 x 11, a page will be too small to read when shrunk down to fit a 6in screen. Bake it at 6in and the pages are ridiculous on larger screens, but still too large for phones. You need a format that shapes the number and size of pages to fit the display -- enter ePub, which is based on HTML.


The big thing is that PDF is page oriented and ePub is flow oriented. An ePub book can be reformatted for different page and font sizes. A PDF book is difficult to read if the screen is not large enough.


If you are reading an ePub format book on your Kobo and zoom in then it'll work as expected. If you are reading a PDF format book then after zooming in you will need to move the page around making reading incredibly awkward / impossible.

In other words, ePub cares about where the chapters are and such, it is structured text, while PDF is a page description and what's on a single page will always be on a single page.


Aside from zooming, most ereaders offer the option to use a bigger font. That will result in a different text flow, which usually works. But it can break up text flow within a paragraph, within a sentence and even within a word. I've seen words break up into single letters, each letter one line.


One thing I can tell you:

ePub is one of the best formats for blind people, they can read all contents of most books very easily. You can (like HTML) basically just dump out plain text by ignoring all the tags.

PDFs are the worst format that is commonly available, it isn't much better than a jpg of text. There are many programs that try to convert PDFs back into plain text, they all get confused by sufficiently complex PDFs. When you look inside a PDF, it is formatted as "put characters ABC at page location X,Y", you then have to try to extract the text from that by manually tracing the flow / columns / etc.


> When you look inside a PDF, it is formatted as "put characters ABC at page location X,Y", you then have to try to extract the text from that by manually tracing the flow / columns / etc.

And as you want to do more and more complex things with letter spacing you end up with ever shorter flows of characters; there are certainly some tools that end up creating every character individually so even extracting a single word in Latin script is hard.


When a PDF is properly formatted, when sentences are structured like sentences, and when it's a simple PDF like a book without tables and pictures, it works.

Sometimes you see a PDF where the end of a line of text means the end of a paragraph structure wise. And if you choose to use a bigger font, the text flow changes and sentences break halfway, leaving a half empty line, etc. When this happens once every two or three pages, not a problem, but if this happens all the time it makes the book unreadable. This happens quite a lot at the ending of a page as well.

If you have something like a sidenote or footnote, it is very likely that it breaks up the textflow, and not in a nice way, after a paragraph, but it breaks up a sentence. This can be very confusing, especially if you're reading in a different language with complex sentences.

Then there is another level of breaking up a sentence halfway, where words are not formatted as words, but as letters, and where each line breaks up words. I've seen this with a O'Reilly book. They fixed this within a week, so that was excellent service, but it shows how this can work.

Images can be a problem, depending on their size. Most ereaders are underpowered and can have difficulty processing large images.

Tables will be mess in 99% of the cases, because they consist of text and lines, and do not have any structure. It's always a surprise how they will show up. It could mean that half of the table is not even shown, or column one is shown, below that column 2, and you don't see the relation anymore. Sometimes column 2 is partly shown before column 1.

Some PDFs simply crash the ereader program.


PDF sucks on mobile and ebook readers. Seriously sucks.

I apologize for the language but I find no other words that work.


They're both document formats, but they come at it from opposite directions. PDFs contain low-level drawing instructions for displaying their content, but usually know far less about the content itself. ePub is the other extreme- it contains structured content, but allows more leeway in the presentation of that content.

Both formats have evolved over the years to build on their strengths and shore up their weaknesses- for example, you can now embed structured content in a PDF to make it more ePub-like (and potentially reflow text in a viewer), but in practice few PDFs take advantage of this. I don't know nearly as much about ePub, but I would imagine that over the years it has evolved to allow for more control over the visual presentation of the content.

Ultimately PDF is a more universal format but ePub is a better fit for most digital books. Both formats will probably coexist for a long time.


My thought has been more like, "I guess I've never really understood the need for ePub when HTML is now a standard." … which is exactly what this announcement is about.


Ideally, ePub would be "reflowable HTML plus baked-in things you would obviously want in a book made dead simple for author-editor-publisher workflows that will never involve a programmer in a million years". Proper footnotes that appear on the same page as the thing they're footnoting, for example.

In practice though this is still a shitshow even in ePub 3, 99% of publisher just ship a pile of endnotes, and consequently certain authors like Terry Pratchett are a god-awful experience in ebook format. Jumping back and forth every 30 seconds sucks.

There are work-around that can be done with JS or certain reader-specific markup you can use for iBooks to at least get pop up footnotes, but most publishing houses don't have the technical knowhow or desire to invest that much time in doing the right thing. There needs to be dedicated markup and a better standard for how readers should treat its display whenever possible.

A real failure of the standardization committee, and it will likely only get worse with a group that cares even less about books specifically. The standard continues to be driven by people more interested in gee-whiz features than solving longstanding problems in replicating table-stakes features of print.


Readers complain that ebooks cost too much, and then readers complain that publishers don't want to spend money on technical stuff.

But some of the problem is that there's not a lot of payoff for technical development, because only some platforms will support it. The Kindle doesn't handle Javascript or math at all, has very poor support for tables, SVG art, and video. So there's half your market gone right there.


> Readers complain that ebooks cost too much, and then readers complain that publishers don't want to spend money on technical stuff.

To be clear, I'm complaining that the standard overcomplicated certain desirable feature to the point where publishers would have to spend money on technical stuff, when they shouldn't have to.

I fully get why publishers don't want to have to engage in fiddly, expensive bespoke development per-title.


> A real failure of the standardization committee, and it will likely only get worse with a group that cares even less about books specifically.

W3C WGs have hugely varied membership, and I wouldn't expect the composition of the WG to change much just because of a change of SDO.


One thing is packaging: there's a desire to have a single file to ship to users to put on their devices instead of a whole folder (and changing that at this point is probably too painful to be worth it), and then you need to address how to deal with addressing (in URLs) other things within the package.

Of course, being packaged also make it much easier to apply DRM (and, perhaps thankfully, means we don't actually need to address DRMing of HTML content directly which helps fight against it being in browsers).


html can contain arbitrary wacky things, epub is a subset of html for low powered readers, with standard bits like chapterization and footnoting.


Well, that was the plan, and it was largely even true for EPUB 1 and EPUB 2.

EPUB 3 wound up being so baroque and overblown that there's still not even one fully-compliant implementation, more than 5 years after the standard was published. They had to back off some time ago and bless efforts that implemented only subsets of the full spec, but the damage had been done.

It's the second-system effect from hell (even though it was technically the third system).


I think this is actually a perfectly reasonable question.

ePub allows content to reflow in order to fit the screen size, while PDFs are a fixed size.

It's not typically a problem if you're reading the book on a desktop or laptop, but it becomes frustrating when you're on a tablet or mobile device.


More generally, PDF is a vector graphic format, not a text format. A bunch of things we absolutely take for granted, like selection and copying, simply don't work reliably. (I've hit a number of glitches with the JS-implemented PDF readers that make me pretty sure they're just OCRing the rendered page, which is a ridiculous state of affairs.)


> they're just OCRing the rendered page

Not quite. Usually the PDF specifies each character (although the reader still has to do a slightly wacky conversion from glyph name to unicode character) but the position is specified as an (x,y) position, so the reader has to reconstruct the order that they come in, add spaces and newlines, etc.


EPUBs have reflowable text(HTML).


ePub isn't for the same use case as the one PDF is optimized for; it's much closer to the HTML use case (device size and aspect independent hypermedia).

Which is why ePub is basically HTML + some media formats + packaging.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: