Hacker News new | past | comments | ask | show | jobs | submit login

As a long-time PGDP volunteer and a some-time Standard Ebook one, I would say the connection isn't close. The "distributed proofreaders" at the wonderful PGDP put zillions of hours into cleaning up and formatting books which are then fed to Project Gutenberg for distribution. Standard Ebooks picks up the PG books and re-formats them to their standards.

Back in the day I was the "post-processor" for a number of PGDP books. This meant I received the page scan files which had already been through five (5!) separate passes by volunteer proofers and compiled them into a single etext in (initially) HTML, and later Ebook.

The fact that Standard Ebooks finds typos in PG books (and they do, and kudos to them for their work) simply underscores the huge difficulty of cleaning OCR'd text. In the example on the linked Standard Ebooks front page, the typo of "tne" for "the" is a very typical "scanno" as they are called at PGDP. Both the software and the wetware have overlooked the missing vertical stem of the letter "h".

However, that particular scanno should never have reached distribution at PG, because the last two volunteer passes at PGDP _require_ the volunteer to apply spellcheck before committing a page as complete, plus the post-processor should use spellcheck on the finalized book. That example typo must have come into the PG library at least 20 years ago, or else it didn't come through PGDP.

From experience I can say that as an organization Standard Ebooks are much more tightly managed than most open-source volunteer outfits, and if you can fit into their system, you can put in very satisfying hours building books there. (Despite having formatted some (I thought) handsome works for PGDP, I couldn't meet the standards of Standard Ebooks, or maybe I was burned out, and didn't stay with them.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: