Nice to provide hardware hints and designs but geez that is almost the least of it. Cleverest hardware still only gets you a thumb drive full of page images. Now what? There needs to be a software workflow ending with a readable book in PDF, EBOOK or MOBI format, and there are many, many choices to be made along that path.
"In order to turn these raw images into an ebook, the very minimum you need to do is A, you need to rotate them, B you need to crop them down to use the page [?], and C you need to combine them into one document like a PDF... You can do OCR to make it searchable ... color correction... de-skewing, de-warping ..."
Back in 2012, there was a guy who started an open source project that did exactly this - he wrote it specifically for the DIY Book scanner. It had a local Django project as the interface. I don't remember the details, but it did a decent job of taking the images, OCR'ing them and creating an output PDF.
I believe he abandoned the project some years later as life got busy and he never found enough volunteers to help him.
Would have to go through my email records to find the name of the project.
Spreads[0] is probably what you are referring to. Some backstory:
I saw the diybookscanner community - which at that point mostly had Daniel Reetz [1] as its active contributor- struggle with mechanical contraptions for triggering cameras and very little software experience. I built a simple proof of concept to reliably trigger cheap consumer cameras using software. I built it on CHDK[2], the Canon Hack Development Kit, alternative firmware for cheap consumer cameras. The proof of concept worked.
I then had a fairly large number of book scanner kits built and shipped mostly around the EU [3]. More of a work of love than a business really, even if it was formally under an llc umbrella. Johannes initially was just a customer. He wanted to build a better software solution, and within the spirit of the project did so as free software. I tried to support him at this as well as I could, setting up build infrastructure, trying to reel in more people, getting him some cameras to test, get the amazing CHDK people to port to new camera models, ...
Then real life intervened indeed.
Johannes, if you read this, I'm still grateful for the experience of having worked with a great developer like you!
[EDIT] And of course, I should also mention Dan Reetz' incredibly inspiring work bootstrapping an incredible open hardware project! Hats off!
Hi Mark! We had electronic and USB triggering working with SDM and CHDK before you joined. But no image transfer or control of settings by USB. We deliberately pursued mechanical triggering for places where a computer and crazy firmware wasn't an option. I donated quite a few scanners to projects and people who simply couldn't use that stuff at the digitization site.
Johannes (spreads) was one of the most inspiring people I've ever worked with, so thankful for the energy and intellect he brought to the project- and the software he built. I donated a pair of DSLRs to him as a thank-you. Last I heard he was still working in a related space, but at a higher level.
Personally, I left the project to join Apple (they refused to let me continue any work on the open- source project while I was employed there), and gave Jonathon (tenrec) control. He redesigned the scanner again and sold kits as well as produced a Raspberry Pi based controller with nice software. Seems he has closed the store.
Hey Mark, hey Daniel, Johannes here, thanks for the praise, you're making me blush :D. The inspiration was mutual, I remember the time working on spreads and with you guys very fondly, learned a lot from it.
I'm still active in the "digitizing books" sector, albeit now officially employed at a library and more concerned with what we can do with the books after they're scanned :-)
> (they refused to let me continue any work on the open- source project while I was employed there)
I somehow thought that was illegal for residents of California, and IMHO should be illegal nationwide on general "not indentured servitude" grounds. Then again, I guess who wants to go up against Apple's legal team to find out
Rereading my previous comment, I realise that part of it could be misinterpreted. The diybookscanner project was wonderful to be part of, to contribute to.
One thing I realise was particularly impressive about diybookscanner.org was how much time you spent making it an environment friendly to broad experimentation and tinkering. Exploring broadly was absolutely necessary for a project like this. Mechanical triggering, the SDM experiments I had forgotten about, lighting, glass experiments, and more.
You sowed some powerful seeds. Your effort nursed diybookscanner.org into something that still speaks to the imagination of so many people. I feel privileged to have been part of that, and I'm more than happy to give you full credit.
I don't know if you can find the code through there, but I'm pretty sure he had made it free. I think spreads is a bit newer.
Edit: Found some more info. It did indeed use Scantailor in the backend. His SW was more of a Web based frontend to all the parts. You can see a video demo of it here:
Paper Upgrade was an awesome project and the author changed my life. I met my future wife on the plane while flying to visit him and donate a book scanner.
Not the same as Scan Tailor[1,2] ? Which was referenced from the Instructables link cited earlier. That apparently was a comprehensive toolkit in C++ and Qt, now archived.
Right, step 1 -> get page images, step 2 -> author images into book file. While OCR is obviously useful for search, a rotated phone screen will let you comfortably read a pdf book just fine unless you are talking about something like a textbook, in which case you probably wanted a tablet anyway.
I wrote up a guide on the authoring process using FOSS tools for some Digital Humanities folks a couple years ago: https://github.com/wikey/bookscan
It gives some background on the problem and covers a Scantailor (page crop, rotate, deskew), pdfbeads (compression, book metadata) authoring workflow, with pdftk for some general odds and ends.
I scan heavily from academic libraries in order to contribute to LibGen, but even with Scantailor it is very time-consuming. For example, if you are scanning scientific literature from the Eastern Bloc, it was often printed on low-quality, speckled paper, which means Scantailor often identifies too much of the scan as the page block, and then you have to manually tweak the rectangle.
The simplest of which would be to turn the images into a multi page raster PDF, using freely licensed linux based command line tools for PDF generation. Which will of course result in a rather large file size vs doing OCR, but might be the best preservation method for books with illustrations, unusual fonts, catalogs, mixed text and photos, etc.
I am not clear on to what extent the existing workflow does a de-skew of the camera images to deal with page curvature towards the spine.
I think I recall the Internet Archive having an open source design for something similar to this? And other projects which accomplish generally the same idea.
Just page images? No, Czur software with its OCR generates searchable pdfs and Word or Excel files with no further input. With careful attention to the scanned area, it's easy to get .xlsx files needing zero or minimal editing. The other advantage of the Czur is the automatic correction for curvature when scanning books with narrow margins on either side of the spine.
No, I have no connection with Czur - just an enthusiastic user!
Edit: "Finishing a book" is discussed at a very superficial level here: https://vimeo.com/user33752051 at about 1:00:
"In order to turn these raw images into an ebook, the very minimum you need to do is A, you need to rotate them, B you need to crop them down to use the page [?], and C you need to combine them into one document like a PDF... You can do OCR to make it searchable ... color correction... de-skewing, de-warping ..."