Hacker News new | past | comments | ask | show | jobs | submit login
IDPF, EPUB Standardizing Body, Has Combined with W3C (idpf.org)
131 points by rhythmvs on Feb 5, 2017 | hide | past | favorite | 46 comments



I believe it a good thing that the EPUB standard will henceforth be further developed by and within the same body that develops the standards on which EPUB relies. After all, EPUB is “stripped down html5” anyway. As a developer of Web-based html5 books, I can certainly see the benefits of becoming enabled to re-use my static html and css ‘as-is’ and repackage into EPUB-based books for offline consumption by e-readers.

But there’s some fierce objection against the merger of IDPF into the W3C [1][2], the key (?) argument being formulated as:

> “The W3C is focused on promoting the Web, but eBooks are not websites. When the IDPF is gone, who will advocate for readers?”

Maybe a fair point, but I don’t think I can agree. While it is true indeed that reading long-form content like books requires enduring focus, and that having to read them in a browser where linkbait always is luring to have you click away into a never ending feed of distraction, the problem is not the underlying technology.

[1] http://www.publishersweekly.com/pw/by-topic/digital/content-...

[2] http://futureofebooks.info/


I thought that WhatWG were actually the ones developing HTML5?


WhatWG is just rubberstamping "whatever Chrome implements" (sometimes with feedback from Firefox) as HTML5.

The actual development is entirely controlled by the browser vendors, causing pain for everyone trying to parse HTML programmatically.


Because everything went so well when the W3C was left to their own devices...


Certainly better than bullying cURL into accepting their idiotic URL specification.

Especially considering what they want would rather be specified as a parser for generating URLs from user input, and not to simply demand every tool interacting with URLs to be able to parse malformed URLs.


WHATWG didn't bully cURL into URL, people complaining that cURL didn't match browsers did that.

The WHATWG made a decision long ago that its standards would be descriptive (describe how people parse it), not prescriptive (describe how people should write it). Anyone who attempted to write a web browser would have had to reverse-engineer how other browsers treated crap, because it was the only way to get websites to work. If you think the definition of URL was stupid, you should see what they had to do to support document.all: define a new concept in JS to represent the notion of "this looks and acts like undefined but you can actually use it as an object."


And a descriptive standard is entirely useless.

The entire point of a standard is that implementors agree on a design definition, then implement it, and it stays consistent, forever.

If you look at standards that work and fail, you’ll quickly notice a pattern. Prescriptive are Metric, the A-series of paper, the entire SI system, most open standards, etc. Descriptive are the imperial / US standard, Letter paper, Microsoft "Open" XML, etc.

And before you complain that prescriptive standards are useless because you can never change legacy systems: Several countries have prescriptive language design, with legal authorities how the language has to be used, and they manage to deal with centuries of legacy data.


The email RFCs are prescriptive, and totally useless. I actually have evidence, for example, that RFC 2047 is more often violated than not, and I wonder if message/global will ever see usage as a Content-Type. Prescriptive attempts at tackling memory models in languages have generally failed.

Also, your delineation of prescriptive/descriptive is laughable. The imperial standard, letter paper, and OOXML are all prescriptive standards (albeit OOXML is a very badly written one). Prescriptive language standards aren't necessarily well-applied--ask how many people follow the 1996 German spelling reform, or how many use «le hashtag» instead of the "official" «le mot-dièse» (hint: look at the name of the Wikipedia page is).

OOXML did poorly not because it was descriptive but because it wasn't precise. It was an XML rendering of internal Office file formats, and its description of terms were no better than internal documentation. Something like the TNEF format is much closer to a descriptive document, since it spends a lot of time discussing the differences between Outlook 2007, Outlook 2010, and Outlook 2013 at various steps.


Considering I’m German, the 1996 spelling reform was exactly what I was referring to – I’ve only found a single document this decade which wasn’t in the new spelling.

All other documents I read have been updated in the meantime.

If an entire country can update centuries of material in a few years, why is it so hard to update some simple websites, or, in case that’s not possible, ship a polyfill as Addon?


If a new version of, say, Chrome were to drop the Mozilla/5.0 from its UA and break a website because it relied on Mozilla/* in its UA detection (there are STILL sites that do this), who would users blame? Chrome, obviously.

If you're trying to make a new web browser, it's even worse--people won't use it if it breaks the sites that use it. And the web developer would say "it works in all the major browsers, what's wrong with it and why should I spend the time to fix it for your shitty new browser?"

The problem is that the blame for broken sites is universally attributed to browsers, not website developers.


That’s why you create a new version that is specifically not backwards-compatible with the old one, ensure all tools have appropriate linters already existing, browsers in developer mode or beta/dev versions fail on such sites, etc; and then let websites opt-in to the new version with a special header.


That has been tried before, and that has failed. One of the things that HTML5 fixed was the DOCTYPE mess (in fact, the <!DOCTYPE html> represents the minimal string that enabled standards mode in every browser, including IE). In addition to standards/quirks mode (and the concurrent but separate HTML/XHTML issues), Mozilla tried versioning JS (that was ripped out), and IE tried the compatibility mode switches.

It should be noted that later versions of IE eventually gave up and joined the crowd by having its UA string pretend to be Chrome, which pretends to be Safari, which pretends to be Firefox, which pretends to be Netscape.


I know that this is just a minor example, but since 2006 (after a few minor revisions) the new German orthography is pretty widely used, especially by people who's work involves writing.


> The WHATWG made a decision long ago that its standards would be descriptive (describe how people parse it), not prescriptive (describe how people should write it)

WhatWG standards are prescriptive in the usual sense; where they differ from other prescriptive approaches is in being grounded in implementation commitments. A standard no one implements is useless, after all.

A standard that is written by a committee where no one is committing to actually implement the standard is useless.


> Especially considering what they want would rather be specified as a parser for generating URLs from user input, and not to simply demand every tool interacting with URLs to be able to parse malformed URLs.

The goal of the URL standard isn't to generate URLs from user input (for example, the browser address bar is out-of-scope), but instead define how <a href="foo"> in HTML (or, rather, more generally, {http://www.w3.org/1999/xhtml}a@href from the DOM), url("foo") in CSS, Location: foo in HTTP, and similar get parsed. There are large parts of this that the web relies on undefined error handling around this, which makes it worth standardising somewhere (as browsers cannot practically drop it, and most web content is targeted primarily at browsers, hence if you want to be compatible with the web you need to be compatible with browsers).

I'll also point out that the WHATWG scarcely exists: as a venue there's almost no formal organisation or high-level plan, it largely working on some shared values, and as a result there's a fair bit of variety between different groups of people working on different specifications where some take those values to further extremes than others.


> The goal of the URL standard isn't to generate URLs from user input […], but instead define how <a href="foo"> in HTML

Ehm, that’s exactly user input. If it were machine input, there would be a normalized definition, and you’d use that internally.

And you wouldn’t define a URL spec this way, but you’d define a strict URL spec that browsers and all tools should use, and a legacy tool for converting URLs.

Then any tool that doesn’t directly deal with user input can always be sure the URL they get will strictly follow the standard, and only the first tool in the pipeline has to deal with malformed input (which this is)


> Ehm, that’s exactly user input. If it were machine input, there would be a normalized definition, and you’d use that internally.

That depends on the definition of "user"; most things I'm used to refer only to the current user of the device as the user, and everything else as untrusted input.

The problem with having a formal definition of the strict subset is that you end up with bugs (often security critical) in almost every implementation because of some case where the conversion produces something not in the strict subset. That's something that's happened with way too many formats.


Not really.

Usually, in such a situation, you can fail early, and you can log a warning.

The alternative, of trying to fix the developers error with heuristics, almost always ends with worse security-critical bugs.

There’s a reason people advocate for strict typing, and proper errors, and not PHP’s "any unreferenced constant is of type string with its name being its content".


When I was making a simple EPUB app for my own use [1], I found it surprising that many publishers don't follow IDPF standards. Most do, but there are a significant amount that mix and match all sorts of rules.

[1] http://jathu.me/bisheng/


Any pattern to it or anything in particular that struck you?


The problem is two fold. The first is that most authors and publishers are not very technical. The second is tools like InDesign created EPUBs that would fail the idpf epubcheck tool with errors [1]. Adobe has fixed some of these issues in later versions, but it's expensive for a publisher to re-export all their books.

What this leads to is basically fixing random, one off issues depending on publisher and book. I would definitely suggest not writing your own reader and instead looking at something like Readium (and contribute if you have time!).

[1] http://mademers.com/two-more-indesign-cs5-export-to-epub-bug...


I oversee the epub production for an academic publisher, and the epub conversions are created by our typesetter at the end of production. They do indeed do a bit of custom programming (and sometimes manual labor) to make the files come out decent. They're supposed to run epubcheck on every file before delivery. I'm the guy who ends up fixing all those little random errors, so I have to insist that the code be reasonably clean and orderly (they work to a spec document that I prepared). Plus, accessibility is a concern. Poorly coded and disorganized files don't play well for people who need assistive reading systems.


This is good news. Making fully baked ePubs for the first few versions of the B&N Nook was more art than science due to some of the problem with ePubs.


I guess I've never really understood the need for ePub when PDF is now a standard. I'm sure I'm missing something, the question is: what is that thing?


PDFs bake in a page size. For 99% of them, 8.5 x 11.

Your typical eink device (a kindle) is 6 inches diagonal, but there are also 6.8, 7.8, and 9.7 in somewhat common use, and more uncommon. Three sizes of iPads, dozens of sizes of android tablets, phones and laptops, etc.

You can't make a single page size display well on all of this -- at 8.5 x 11, a page will be too small to read when shrunk down to fit a 6in screen. Bake it at 6in and the pages are ridiculous on larger screens, but still too large for phones. You need a format that shapes the number and size of pages to fit the display -- enter ePub, which is based on HTML.


The big thing is that PDF is page oriented and ePub is flow oriented. An ePub book can be reformatted for different page and font sizes. A PDF book is difficult to read if the screen is not large enough.


If you are reading an ePub format book on your Kobo and zoom in then it'll work as expected. If you are reading a PDF format book then after zooming in you will need to move the page around making reading incredibly awkward / impossible.

In other words, ePub cares about where the chapters are and such, it is structured text, while PDF is a page description and what's on a single page will always be on a single page.


Aside from zooming, most ereaders offer the option to use a bigger font. That will result in a different text flow, which usually works. But it can break up text flow within a paragraph, within a sentence and even within a word. I've seen words break up into single letters, each letter one line.


One thing I can tell you:

ePub is one of the best formats for blind people, they can read all contents of most books very easily. You can (like HTML) basically just dump out plain text by ignoring all the tags.

PDFs are the worst format that is commonly available, it isn't much better than a jpg of text. There are many programs that try to convert PDFs back into plain text, they all get confused by sufficiently complex PDFs. When you look inside a PDF, it is formatted as "put characters ABC at page location X,Y", you then have to try to extract the text from that by manually tracing the flow / columns / etc.


> When you look inside a PDF, it is formatted as "put characters ABC at page location X,Y", you then have to try to extract the text from that by manually tracing the flow / columns / etc.

And as you want to do more and more complex things with letter spacing you end up with ever shorter flows of characters; there are certainly some tools that end up creating every character individually so even extracting a single word in Latin script is hard.


When a PDF is properly formatted, when sentences are structured like sentences, and when it's a simple PDF like a book without tables and pictures, it works.

Sometimes you see a PDF where the end of a line of text means the end of a paragraph structure wise. And if you choose to use a bigger font, the text flow changes and sentences break halfway, leaving a half empty line, etc. When this happens once every two or three pages, not a problem, but if this happens all the time it makes the book unreadable. This happens quite a lot at the ending of a page as well.

If you have something like a sidenote or footnote, it is very likely that it breaks up the textflow, and not in a nice way, after a paragraph, but it breaks up a sentence. This can be very confusing, especially if you're reading in a different language with complex sentences.

Then there is another level of breaking up a sentence halfway, where words are not formatted as words, but as letters, and where each line breaks up words. I've seen this with a O'Reilly book. They fixed this within a week, so that was excellent service, but it shows how this can work.

Images can be a problem, depending on their size. Most ereaders are underpowered and can have difficulty processing large images.

Tables will be mess in 99% of the cases, because they consist of text and lines, and do not have any structure. It's always a surprise how they will show up. It could mean that half of the table is not even shown, or column one is shown, below that column 2, and you don't see the relation anymore. Sometimes column 2 is partly shown before column 1.

Some PDFs simply crash the ereader program.


PDF sucks on mobile and ebook readers. Seriously sucks.

I apologize for the language but I find no other words that work.


They're both document formats, but they come at it from opposite directions. PDFs contain low-level drawing instructions for displaying their content, but usually know far less about the content itself. ePub is the other extreme- it contains structured content, but allows more leeway in the presentation of that content.

Both formats have evolved over the years to build on their strengths and shore up their weaknesses- for example, you can now embed structured content in a PDF to make it more ePub-like (and potentially reflow text in a viewer), but in practice few PDFs take advantage of this. I don't know nearly as much about ePub, but I would imagine that over the years it has evolved to allow for more control over the visual presentation of the content.

Ultimately PDF is a more universal format but ePub is a better fit for most digital books. Both formats will probably coexist for a long time.


My thought has been more like, "I guess I've never really understood the need for ePub when HTML is now a standard." … which is exactly what this announcement is about.


Ideally, ePub would be "reflowable HTML plus baked-in things you would obviously want in a book made dead simple for author-editor-publisher workflows that will never involve a programmer in a million years". Proper footnotes that appear on the same page as the thing they're footnoting, for example.

In practice though this is still a shitshow even in ePub 3, 99% of publisher just ship a pile of endnotes, and consequently certain authors like Terry Pratchett are a god-awful experience in ebook format. Jumping back and forth every 30 seconds sucks.

There are work-around that can be done with JS or certain reader-specific markup you can use for iBooks to at least get pop up footnotes, but most publishing houses don't have the technical knowhow or desire to invest that much time in doing the right thing. There needs to be dedicated markup and a better standard for how readers should treat its display whenever possible.

A real failure of the standardization committee, and it will likely only get worse with a group that cares even less about books specifically. The standard continues to be driven by people more interested in gee-whiz features than solving longstanding problems in replicating table-stakes features of print.


Readers complain that ebooks cost too much, and then readers complain that publishers don't want to spend money on technical stuff.

But some of the problem is that there's not a lot of payoff for technical development, because only some platforms will support it. The Kindle doesn't handle Javascript or math at all, has very poor support for tables, SVG art, and video. So there's half your market gone right there.


> Readers complain that ebooks cost too much, and then readers complain that publishers don't want to spend money on technical stuff.

To be clear, I'm complaining that the standard overcomplicated certain desirable feature to the point where publishers would have to spend money on technical stuff, when they shouldn't have to.

I fully get why publishers don't want to have to engage in fiddly, expensive bespoke development per-title.


> A real failure of the standardization committee, and it will likely only get worse with a group that cares even less about books specifically.

W3C WGs have hugely varied membership, and I wouldn't expect the composition of the WG to change much just because of a change of SDO.


One thing is packaging: there's a desire to have a single file to ship to users to put on their devices instead of a whole folder (and changing that at this point is probably too painful to be worth it), and then you need to address how to deal with addressing (in URLs) other things within the package.

Of course, being packaged also make it much easier to apply DRM (and, perhaps thankfully, means we don't actually need to address DRMing of HTML content directly which helps fight against it being in browsers).


html can contain arbitrary wacky things, epub is a subset of html for low powered readers, with standard bits like chapterization and footnoting.


Well, that was the plan, and it was largely even true for EPUB 1 and EPUB 2.

EPUB 3 wound up being so baroque and overblown that there's still not even one fully-compliant implementation, more than 5 years after the standard was published. They had to back off some time ago and bless efforts that implemented only subsets of the full spec, but the damage had been done.

It's the second-system effect from hell (even though it was technically the third system).


I think this is actually a perfectly reasonable question.

ePub allows content to reflow in order to fit the screen size, while PDFs are a fixed size.

It's not typically a problem if you're reading the book on a desktop or laptop, but it becomes frustrating when you're on a tablet or mobile device.


More generally, PDF is a vector graphic format, not a text format. A bunch of things we absolutely take for granted, like selection and copying, simply don't work reliably. (I've hit a number of glitches with the JS-implemented PDF readers that make me pretty sure they're just OCRing the rendered page, which is a ridiculous state of affairs.)


> they're just OCRing the rendered page

Not quite. Usually the PDF specifies each character (although the reader still has to do a slightly wacky conversion from glyph name to unicode character) but the position is specified as an (x,y) position, so the reader has to reconstruct the order that they come in, add spaces and newlines, etc.


EPUBs have reflowable text(HTML).


ePub isn't for the same use case as the one PDF is optimized for; it's much closer to the HTML use case (device size and aspect independent hypermedia).

Which is why ePub is basically HTML + some media formats + packaging.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: