Hacker News new | past | comments | ask | show | jobs | submit login
DIY Book Scanner (diybookscanner.org)
340 points by bcaa7f3a8bbc on June 1, 2021 | hide | past | favorite | 124 comments



My first job out of college was scanning books for the Internet Archive down in the basement of the Library of Congress. Their scanning machines used a foot pedal to raise and lower the glass Platen, so I'd use one hand to flip the page and wiggle the cradle to get things nice and flat and the other would snap the photo. You can get pretty fast after a while, but boy is it mindless. Older books that had been rebound a couple times already were the hardest to work with as you have the least amount of margin. There's a bunch of different sized dowels that we would put under the spine in the cradle so the glass could gain a couple millimeters of margin, just enough to avoid cutting off text. Worst case scenario the book had to be unbound in order to capture. I did get to flip through a lot of cool old illustrated catalogues like this: https://archive.org/details/illustratedcatal00keil/page/14/m...


Later on I worked briefly for the Archive. The scanner I designed later became their "ttscribe". It was fascinating to see their process up close.


(I'm the founder of DIY Book Scanner, ran it from 2009 to 2015)


I currently work for the University I study at in the (biomedical) library. I scan lots of old journal articles and periodicals for academics who need it for their research. A typical job might be scanning an article from 1960 on Potato Research for an agriculturalist, or a graph of human energy expenditure, or an analysis of fibres for forensic medicine. We get researchers from all around the world requesting articles on all sorts of topics from our archives. It's a Sandstone university, so we have some very old collections that are definitely getting crumby!

Other than locating the books, by far the most tedious aspect is the scanning. We only have a terrible flatbed scanner, that is completely unforgiving - it only has a 25 page email limit, otherwise you have to split it into separate emails. And if you mis-scan a page accidentally (some of the book margins are super tight), then you have to restart the entire scan - there's no delete page button!


> And if you mis-scan a page accidentally (some of the book margins are super tight), then you have to restart the entire scan - there's no delete page button!

Sounds like you need a license for Adobe Acrobat Pro or some other application that will let you reshuffle/insert pages.


Oh I understand you ! I would have love to have design hints like that for one summer job I had. I was tasked to scan medical books with thousands of pages. I had little time constraints and could do it wherever I wanted. I was paid 80$ per book which I found crazy huge before starting. Even by optimizing every parameter of the hardware, software and my workspace, I couldn’t do more than a few 2-hour sessions per day and it took me several days per book. A boring job if there is one. I would certainly use it as a zen practice today but as a teenager, not really needing the money, I couldn’t find any value in it (after optimization).


Books that aren't rare, ie aren't valuable as artefacts, you would surely cut off the spine and run through an automated scanner?

But then medical texts probably cost way more than $80. How much was your boss making from those scans? Were they taking account of copyright law?


Wow that's awesome! I take it you're responsible for a chunk of the books available now on openlibrary.org?

When scanning books like that did you ever see anything interesting or are you so zoned out you don't really pay attention?


A very small chunk, I only lasted a couple months. Most of it was pretty boring, think volume after volume of copyright records or issues of the national stamp collector's magazine. Eventually I started working on some of the contract work they did for other agencies, e.g. declassified FBI case reports. The best was the stuff for the Smithsonian, which often included beautiful naturalist illustrations. I'm not sure how much of that stuff was public domain though.



Back when this was a more popular problem, I saw a number of projects that used a rubber tipped stick to automate page turning.

I wonder why this never took off?


I'd really like to see this too.

Building a scanner would be interesting, but the mind-numbing idea of turning pages manually isn't very enticing


Were you using gloves or something like that?


Unsure about IA, but gloves are typically advised against[0] unless you have a suspicion that the book will be dangerous (arsenic ink in bindings[1], dust, mold or frass (sadly)[2]). Hand-washing before is typical advice but YMMV

[0] https://www.nationaltrust.org.uk/features/why-wearing-gloves...

[1] https://daily.jstor.org/some-books-can-kill/

[2] https://www.ifla.org/node/93094


I get the points against gloves in your links, but from my experience my hands will constantly leave fingerprints and oil on the books no matter how much I wash it.

How do they avoid this issue? Or just don't bother?


The advice I was given when handling some of the books at the university of toronto's rare book collection was to simply not directly touch the parts that matter if it can be avoided. Turn pages via their margins.


Ripping is usually more of a problem than some finger oil and gloves makes you less sensitive and more likely to rip or destroy stuff.


>” Ripping is usually more of a problem..”

Accidentally ripping a page while turning the pages too quickly.


Nope. I'm not sure if this is the right way to put it, but we were basically a scanning factory. I think the really sensitive documents got routed elsewhere. Books and folios that had large format foldouts got more specialized, ahem, white glove treatment. Many scanners wore these little textured rubber finger tips: https://rubber_finger_tips.jpg.so/


That sounds incredible


I built one of these out of pine 2x4s and plywood. I thought it would be cheaper than buying one (I was wrong) but I'm also not a skilled woodworker and had to buy most of the tools.

It works quite well and I digitised dozens of textbooks I'd purchased and needed to reference but couldn't carry around every day while finishing my masters. My one had 2 Nikon mirrorless cameras controlled via Pi-Scan. https://github.com/Tenrec-Builders/pi-scan

I had a smaller toggle switch wired to the GPIO pins so I could click the scan next button without having to take my hands of the book. Once I got used to the workflow I could scan about 1000 pages per hour while watching Netflix.

I replaced it with a Czur scanner that isn't as good, but is a lot smaller and is good enough for my less demanding needs now that I'm not doing a masters degree :D


The dual camera is a design choice I hadn't thought of. I've thought of scanning a couple books, and that's probably the trick for me. Though maybe I'll rotate the book and scan / rotate images separately.


It let me capture the pages with the correct orientation and the cameras have a fixed focus on the Platen so it works really well. Then Scantailer can crop automagically and deal with the rest.


What was the purpose? Did the digitized books go to the public domain/your university, or something, or was it purely for personal research?


Just for personal use as all the books were still in copyright and I own the paper versions, it was a (probably legal) fair use of them purely for reference while studying.

I often needed to find information in the books and couldn't reasonably carry them all with me every day between work and uni.


In my opinion, buying a book (or other media) should give one right to a digital copy from any source.


I'm a heavy believer in more rights when you buy a digital asset, but I don't think it's reasonable that buying something in one medium should give you automatic freebies in another medium. That other medium still needs to be produced, at a cost to the publisher. Does buying the book give you the rights to the audiobook as well?

That said, there should certainly be no restrictions to creating your own versions of things you own in other media.


Print houses are increasingly using the same source and layout files for printed books as they are for ebooks. This means that there is little to no difference or extra cost to produce the alternate media version here.

I totally agree that an audio book has an entirely different production path and it's own entirely different staff and company that needs to make it's own costs back. But that argument is becoming less and less relevant for physical vs digital books.

Many non-traditional/smaller publishers and printers, such as No Starch Press, offer free copies of the ebook version when you purchase the physical copy and offer the digital copy at a reduce price compared to print.


I've not explained well, I don't expect the producer to have an obligation to create a copy, just to exhaust their right to sue you for having a copy. But, if the same work _is_ available in another format then I think it would be reasonable to expect the seller to provide it (it costs them bandwidth, nothing else over what they're already spending).

Audiobooks I'd say are new works. If you buy it on a medium you should be allowed to rip it and vice-versa, however (currently not allowed in UK).


> But, if the same work _is_ available in another format then I think it would be reasonable to expect the seller to provide it (it costs them bandwidth, nothing else over what they're already spending).

Sure, but it's worth something to you.

I just downloaded a productivity app. The free version is great, so I don't think I need to upgrade, but I noticed that the paid version includes a Pomodoro timer. Would it cost anything for the company to turn on that feature for me? No, of course not. It's clearly just an attempted to make more money. But an attempt to make more money is exactly what they have a right to do. That's why they're in the business.


I see where you're coming from, but format-shifting is different, a different class of product to a locked extension of digital good.

The only inhibition to me making or acquiring a copy of a book in my preferred format is copyright law, I already have a copy on paper; I can download or make a copy easily. That should be allowed as I have paid for a licence to the work (and I think is allowed under Fair Use in USA). As the copyright holder already has a digital copy the restriction of it is silly -- for the demos, who supposedly give the rights in copyright, the benefit lies in making that copy of the work available rather than forcing the extra work of having someone scan it and upload it and make it available. This later route also enables copyright infringement far more readily than the former does.


At Japan in the meantime, people in book scanning community (that exists) often just cut the book spine and scan the pages using normal scanner, throw it away once all the pages are scanned.

People (rightly) value room spaces than books there. It's called "Ji-sui" (scanning by oneself) and gear recommendation sites like [1] are abundant. Another reason of "Ji-sui" prevalence was the poor availability of ebooks, although that reason was less relevant today.

[1] http://monomania.sblo.jp/article/60578693.html


One business I kinda want to exist is a book warehouse/scanning operation. I send them boxes of books; they give me an app with access to digital versions of every book I send them. The whole operation is somewhere in, say, Nebraska, so storage cost is very low.


Similar scanning services exist in Japan as well (ex. [1]). The difference is that they kindly discard the books for you once getting scanned!

[1] https://www.bookscan.co.jp/


Something like this?

https://1dollarscan.com/sp/


$.01 per page and they are based in Fremont, CA.

This looks like a cheaper way to get most ebooks and you can ship the books direct from Amazon. If enough people did this, maybe ebook pricing will come down to something rational.


>maybe ebook pricing will come down to something rational.

Why don't you think it's rational? Note that the cost to print and distribute a printed book is only about $2 of its cost. Or don't you think authors, editors, etc. should be able to earn money?


I frequently see Kindle pricing as higher than print. If ebook prices were 2$ less, I'd be more inclined to buy some. If they were 50% less, I'd buy 10x more. If the author got a higher cut, I'd also be more inclined to give these middlemen my money. As it is, I buy used print books most of the time...


Book pricing can be weird at times but Kindle is usually cheaper than new print. But, yes, more than used. Though I'll often buy Kindle for fiction anyway because I don't really want physical books. (And, to be honest, I read a lot fewer books than I used to in any case. I'm definitely time/attention limited rather than money. Ebooks could be free and I wouldn't read a lot more.)


For a while there existed services that did this with your CD collection. You'd send them a crate of your music CDs and they'd send back all the music ripped as mp3s.

Obviously, they didn't re-rip music they'd already ripped so technically for popular music, you got 'someone else's' mp3s.

---

My multi function laser printer has a duplex scanner on it. It can scan pages at quite a rate. The problem is not the scanning, but the accurate OCRing, and for things like magazines, the storage of all those high resolution pages. Right now, I cut out the articles I want to keep from my monthly mags, and just scan those. It seems like a fair compromise right now.


Also, in UK with our strict "fair-dealing" (as opposed to USA's Fair Use) none of this is lawful - including ripping CDs for personal use (though that format-shifting was briefly allowed for a couple of years).

Disclaimer: This is my personal opinion; not legal advice.


When I was in college the iPad had just come out. I was determined to save money so I snagged an iPad to use as my omnitextbook and built a scanner based upon one of the schematics on this site with a friend.

I would usually be the guy that made an email group for everyone to share notes and questions for classes pre all of the blackboard garbage, so I started leveraging those connections and would ask if anyone would let me borrow their book for a scanned version in return. My friends and I would have a book scanning party and would help to scan each others’ books. We’d grab some drinks, find some favorite albums and hang out all night until the wee hours taking turns scanning texts.

After one semester the setup paid for itself. I would supplement some texts with learning trackers like bitme before amazing resources came around like libgen. Good times.


Thanks for sharing your story, I'm so glad to hear it was useful to you and your crew. Were you ever active on the forums?


Nice to provide hardware hints and designs but geez that is almost the least of it. Cleverest hardware still only gets you a thumb drive full of page images. Now what? There needs to be a software workflow ending with a readable book in PDF, EBOOK or MOBI format, and there are many, many choices to be made along that path.

Edit: "Finishing a book" is discussed at a very superficial level here: https://vimeo.com/user33752051 at about 1:00:

"In order to turn these raw images into an ebook, the very minimum you need to do is A, you need to rotate them, B you need to crop them down to use the page [?], and C you need to combine them into one document like a PDF... You can do OCR to make it searchable ... color correction... de-skewing, de-warping ..."


Back in 2012, there was a guy who started an open source project that did exactly this - he wrote it specifically for the DIY Book scanner. It had a local Django project as the interface. I don't remember the details, but it did a decent job of taking the images, OCR'ing them and creating an output PDF.

I believe he abandoned the project some years later as life got busy and he never found enough volunteers to help him.

Would have to go through my email records to find the name of the project.


Spreads[0] is probably what you are referring to. Some backstory:

I saw the diybookscanner community - which at that point mostly had Daniel Reetz [1] as its active contributor- struggle with mechanical contraptions for triggering cameras and very little software experience. I built a simple proof of concept to reliably trigger cheap consumer cameras using software. I built it on CHDK[2], the Canon Hack Development Kit, alternative firmware for cheap consumer cameras. The proof of concept worked.

I then had a fairly large number of book scanner kits built and shipped mostly around the EU [3]. More of a work of love than a business really, even if it was formally under an llc umbrella. Johannes initially was just a customer. He wanted to build a better software solution, and within the spirit of the project did so as free software. I tried to support him at this as well as I could, setting up build infrastructure, trying to reel in more people, getting him some cameras to test, get the amazing CHDK people to port to new camera models, ...

Then real life intervened indeed.

Johannes, if you read this, I'm still grateful for the experience of having worked with a great developer like you!

[EDIT] And of course, I should also mention Dan Reetz' incredibly inspiring work bootstrapping an incredible open hardware project! Hats off!

[0] https://github.com/DIYBookScanner/spreads

[1] https://danreetz.com/

[2] https://chdk.fandom.com/wiki/CHDK

[3] http://diybookscanner.eu


Hi Mark! We had electronic and USB triggering working with SDM and CHDK before you joined. But no image transfer or control of settings by USB. We deliberately pursued mechanical triggering for places where a computer and crazy firmware wasn't an option. I donated quite a few scanners to projects and people who simply couldn't use that stuff at the digitization site.

Johannes (spreads) was one of the most inspiring people I've ever worked with, so thankful for the energy and intellect he brought to the project- and the software he built. I donated a pair of DSLRs to him as a thank-you. Last I heard he was still working in a related space, but at a higher level.

Personally, I left the project to join Apple (they refused to let me continue any work on the open- source project while I was employed there), and gave Jonathon (tenrec) control. He redesigned the scanner again and sold kits as well as produced a Raspberry Pi based controller with nice software. Seems he has closed the store.


Hey Mark, hey Daniel, Johannes here, thanks for the praise, you're making me blush :D. The inspiration was mutual, I remember the time working on spreads and with you guys very fondly, learned a lot from it.

I'm still active in the "digitizing books" sector, albeit now officially employed at a library and more concerned with what we can do with the books after they're scanned :-)


> (they refused to let me continue any work on the open- source project while I was employed there)

I somehow thought that was illegal for residents of California, and IMHO should be illegal nationwide on general "not indentured servitude" grounds. Then again, I guess who wants to go up against Apple's legal team to find out


Hello Dan! Nice to see you here!

Rereading my previous comment, I realise that part of it could be misinterpreted. The diybookscanner project was wonderful to be part of, to contribute to.

One thing I realise was particularly impressive about diybookscanner.org was how much time you spent making it an environment friendly to broad experimentation and tinkering. Exploring broadly was absolutely necessary for a project like this. Mechanical triggering, the SDM experiments I had forgotten about, lighting, glass experiments, and more.

You sowed some powerful seeds. Your effort nursed diybookscanner.org into something that still speaks to the imagination of so many people. I feel privileged to have been part of that, and I'm more than happy to give you full credit.


Actually, it was a project called Paper Upgrade. Here is an old archive link:

http://web.archive.org/web/20140101000000*/http://www.paperu...

I don't know if you can find the code through there, but I'm pretty sure he had made it free. I think spreads is a bit newer.

Edit: Found some more info. It did indeed use Scantailor in the backend. His SW was more of a Web based frontend to all the parts. You can see a video demo of it here:

https://www.youtube.com/watch?v=Ad7aFYdbDos

Start at about 4:40.

The source is here:

https://code.google.com/archive/p/diy-ebook-creator/


Paper Upgrade was an awesome project and the author changed my life. I met my future wife on the plane while flying to visit him and donate a book scanner.


Wow. He spoke highly of you when I met him. My guess is you had not yet married, though - it was shortly after your visit - perhaps a month or two.

And yes, Paper Upgrade was awesome (especially if he did all the work on it). I was sad to hear he had shut it down.

Oh, and I'll make a note to myself never to work for Apple ;-)


Not the same as Scan Tailor[1,2] ? Which was referenced from the Instructables link cited earlier. That apparently was a comprehensive toolkit in C++ and Qt, now archived.

[1] https://scantailor.org/

[2] https://web.archive.org/web/20210304015939/https://github.co...


There are a couple folks that forked scantailor. I'm not sure the status of those. Here are a couple: https://github.com/4lex4/scantailor-advanced https://github.com/trufanov-nok/scantailor-universal


Right, step 1 -> get page images, step 2 -> author images into book file. While OCR is obviously useful for search, a rotated phone screen will let you comfortably read a pdf book just fine unless you are talking about something like a textbook, in which case you probably wanted a tablet anyway.

I wrote up a guide on the authoring process using FOSS tools for some Digital Humanities folks a couple years ago: https://github.com/wikey/bookscan

It gives some background on the problem and covers a Scantailor (page crop, rotate, deskew), pdfbeads (compression, book metadata) authoring workflow, with pdftk for some general odds and ends.


scantailor will get you most of the way there. the original project is dead but there are a few forks on github. It has been a while since I did any serious scanning so I can't remember which version I used. https://github.com/4lex4/scantailor-advanced https://github.com/trufanov-nok/scantailor-universal


I scan heavily from academic libraries in order to contribute to LibGen, but even with Scantailor it is very time-consuming. For example, if you are scanning scientific literature from the Eastern Bloc, it was often printed on low-quality, speckled paper, which means Scantailor often identifies too much of the scan as the page block, and then you have to manually tweak the rectangle.


The simplest of which would be to turn the images into a multi page raster PDF, using freely licensed linux based command line tools for PDF generation. Which will of course result in a rather large file size vs doing OCR, but might be the best preservation method for books with illustrations, unusual fonts, catalogs, mixed text and photos, etc.

I am not clear on to what extent the existing workflow does a de-skew of the camera images to deal with page curvature towards the spine.

I think I recall the Internet Archive having an open source design for something similar to this? And other projects which accomplish generally the same idea.


Just page images? No, Czur software with its OCR generates searchable pdfs and Word or Excel files with no further input. With careful attention to the scanned area, it's easy to get .xlsx files needing zero or minimal editing. The other advantage of the Czur is the automatic correction for curvature when scanning books with narrow margins on either side of the spine.

No, I have no connection with Czur - just an enthusiastic user!


This is an old article, so maybe some software isn't the best option nowadays, but you can get the idea of postprocessing: http://natecraun.net/articles/linux-guide-to-book-scanning.h...


https://www.instructables.com/Bargain-Price-Book-Scanner-Fro...

"Step 10 - Post-processing" has some steps


Video looks interesting, I'll check it out!


I'm very interested in getting into archival (getting started this month after a few more conversations).

Your buy button[0] is broken. You're potentially missing out on a few sales due to this.

Is 2x 4GB SD card sufficient for your purposes? I've been quoted 50MB TIFF images as a standard, and a lot of books wouldn't fit without swapping SDs at that size.

[0] http://store.diybookscanner.org/


If you use pi-scan the images are saved to a USB drive instead of the SDcards.


archiving what? just curious.


I want to digitize the entire linguistic and spoken corpus of a critically endangered language[0] and convert it to a searchable format to aid in language revival, academic research, and ensuring that an informed debate can occur when the modern usages of the language differ from traditional usages of the language.

Most of the printed books are scattered, but available, but it's akin to an iceberg: there's a significant amount of 'submerged' knowledge about the language in written manuscripts and recorded audio, and this is where a lot of the value comes from. Printed texts are primarily religious, and getting the colloquial usages of words and phrases is very useful.

Many manuscripts aren't digitized at all, or are available and need transcription.

The language is relatively well-recorded (dating back to at least the late 16th century in written form), and yet small enough that a comprehensive reference is viable: estimates of about 5MM words crop up, but even 3x could easily fit in memory on a Digital Ocean droplet, even if fully POS tagged[1]. Texts are also mostly in the public domain, and there's a lot of bilingual texts (which act as a Rosetta Stone).

[0] https://en.wikipedia.org/wiki/Manx_language#Revival

[1] https://en.wikipedia.org/wiki/Part-of-speech_tagging

EDIT: More than happy to talk in depth about this if anyone wants, via comments, or email on my profile.


If you have fast flatbed scanner, you can scan 300 pages in thirty minutes. Not worth the effort to build automation. Bigger problem was to sort out all errors and missed pages afterwards. Real-time display (from Imagemaqick) solved this problem:

    while true ; do
     for x in *.pnm ; do
      killall display 
      display -rotate 90 $x &
     done 
     sleep 5
    done


Nobody asked, but for the record, this is how make real one-page PDF from two-page scans. (gm = GraphicsMagick)

    mkdir kaksi
    rm kaksi/*
    j=102 
    scale=600
    size=500x730
    yla=27
    for x in *.pnm ; do
     echo $x
    gm convert $x -rotate 90 -crop $size+20+$yla -resize $scale -normalize kaksi/k$j.jpg
      j=$((j+2))
    done    
    j=103
    for x in *.pnm ; do
     echo $x
    gm convert $x -rotate 90 -crop $size+530+$yla -resize $scale -normalize kaksi/k$j.jpg
      j=$((j+2))
    done
    cd kaksi
    gm convert *.jpg -format pdf TheBook.pdf


If I read this right then it means "every 5 second, open the last scanned page (and nothing else / close the previous one)". But this seems like an inefficient way to do it, opening and killing all irrelevant pages all the time. This will be more efficient and react more quickly:

    lastfile=
    while true; do
        newestfile=$(ls *.pnm | tail -1)
        if [ "$newestfile" != "$lastfile" ]; then
            kill %
            display -rotate 90 "$newestfile" &
            lastfile=$newestfile
        fi
        sleep 0.3
    done


"Saving precious bits like it is 1969".

This would make an excellent song title.


Sorry but I find your answer disappointing and crossing over into offending. I spent some time first trying to understand how your code makes sense, then to write up a better solution and posted it, and you don't seem to be thankful at all and are instead dissing my effort. Sure, if it works for you, fine, I was under the impression that you didn't know better. You could have saved me time by indicating that you know your solution is hacky but you don't care.


Yes it is inefficient and weird. And thank you for saving the world. Not only Terran energy crisis, but we are also running out of useful data bits as universal entropy goes to infinity. https://en.wikipedia.org/wiki/Heat_death_of_the_universe


Please check the HN guidelines. "Be kind. Don't be snarky."


dont take offense, i am certain that if you would call eachother up you both would get along terrifically. text is a horrible communication medium!


Thanks. Yes, sometimes communication falls short and down, and one's own state of mind is time dependent and doesn't see everything involved. I was in an eager "would like to see positive feedback" state of mind just as timonoko may have been when originally posting their solution, and then interpreted the rejection in an overly negative light. My apologies to them and for the noise.


This really needs to be redesigned for ergonomics.

- Lever should have a button for capture

- Display should be visible while looking down

But now I see why destructive scanning (slicing the binding off and using a sheet feeding scanner) is so attractive. For any non-rare books, this is just too tedious and time consuming to go through for more than a few books.


Display should be visible indeed.

Your capture triggering suggestion is not as great though. The systems that I shipped with http://diybookscanner.eu actually used a USB foot pedal for triggering the cameras. That's by far a superior user experience to pressing a button while both hands are busy moving a cradle...

Destructive scanning feels incredibly cruel to the books. A non-destructive system like this actually works fairly well. You can expect to get up to about 1000-1200 pages an hour with it.


> Destructive scanning feels incredibly cruel to the books.

I suppose it depends on whether it has sentimental value. When I was young, I'd treat my books like treasures, putting covers on them (even paperbacks), making sure I didn't crease the spine when I read them. Now I consider books to be a temporary store of knowledge as the contents pass to my brain. I fold pages, underline, scribble notes in them. There are thousands more copies out there, I don't feel any need to baby my copy.


> For any non-rare books, this is just too tedious and time consuming to go through for more than a few books.

If your goal is to scan a whole bunch, it's tedious. If you want to do it once in a while, it's not really a problem.


One past thread, a long time ago:

DIY book scanning - https://news.ycombinator.com/item?id=991897 - Dec 2009 (7 comments)



For anyone interested, there is also https://libreflip.org/ website about similar device.


> While there are some computer algorithms that can help dewarp the pages after capture, it is always more reliable to just capture flat pages in the first place.

I’m sure this is technically true, but curious how much it matters in practice today? Reading Google’s book scanning patents I found a description of a de-warper based on capturing a 3d depth scan of the book, which I assumed they were using in order to achieve the scale of scanning all books on earth. Capturing and de-warping a 3d depth scan would also be leagues more reliable than trying to do a purely 2d image based de-warp.

> The lights must also be positioned to minimize glare and reflections.

For my personal photo scanning and archiving project, I used a polarizing filter on the light and on the camera in order to eliminate specular glare, it works amazingly well. Would that be impractical, and/or not work as well on books for some reason?


These kinds of discussions need more real examples to accurately depict the tradeoffs of destructive vs non-destructive scanning, so I'll add scans I personally made.

Here are two pages from Cracking the Coding Interview, 6th Edition, that I preferred over the digital versions I found online that were hard on the eyes because I disliked the black-and-white scans. Feel free to ask me about "details" in the process

https://imgur.com/2ZQFZ5p

It's entirely possible to accomplish post-processing without writing code if you have Adobe Photoshop.

I used a free-to-the-public bookscanner built by the Digital Archivists at Noisebridge in San Francisco to take pictures of all pages in my textbooks (it took a while). In Photoshop, you can record a macro to automatically crop to a rectangular region determined by just one or more points that are guaranteed to be on the page in every photo. The selection is made by the quick selection tool (selects similar pixels to the page color in the same region). With this macro recorded, you can run it in bulk through all files.

The textbook size was still large digitally (a gigabyte) because I wanted the highest quality possible for studying, but it beat having to carry heavy textbooks for sure. I also shared these files with friends and we were able to study without any physical textbooks for books that were not available digitally—it was amazing.

Personally I avoided all the deskewing technologies and preferred just pictures, all in color, as close to the real thing as possible, because Noisebridge's scanner used two DSLRs and the pictures were high quality. It was better than converting everything to black-and-white for reading enjoyability. OCR through ABBYY FineReader.

Overall it gets more annoying the thicker the textbook is. If destructive scanning is acceptable, one can just buy the book, go to FedEx and ask them cut the spine off for $4 to convert it to loose-leaf, then run it through a document scanner such as ScanSnap ix500, which is much faster at around 25 pages/min at its slowest

One really cool feature about Noisebridge's scanner (picture below) was that you could view the camera's viewfinder live in real-time, thereby speeding up iteration and catching errors much faster

https://imgur.com/4Pkdp1j


Are any of you part of book scanner clubs that might have a database of word counts of famous fiction books? I've found several lists online but it's not a wide selection of books - I'd imagine book scanners might have more. I'd be happy to share the database I've cobbled together.


I briefly participated in an eBookz scene group at the turn of the century, although we didn't keep track of any word counts (nor did we OCR) and we focused on non-fiction, mainly automotive repair manuals. I doubt it's a statistic that the scanners (people/groups) pay attention to.


Last year I bought a czur book scanner that looks kind of like a lamp to try and archive some 100 year old books I had limited access to. The resolution of the camera was so low I ended up balancing my phone on top and getting better images just using it as a light.


I find the latest Czur works well enough for non-glossy stuff. The broader scanning problem I find is that, beyond the small must scan category, I find I have so much stuff that generally it's not very practical.


The store seems to be down, any idea how much it costs?


Anyone have thoughts on the Easy Book Scanner design by David Landin?

https://www.instructables.com/Book-Scanner-Low-cost-easy-to-...


Looks incredibly serviceable and well engineered. I would expect reasonable and consistent results from the rig. The biggest question would come down to the cameras.

With these kind of rigs (two cameras, not computer controlled, no computer display) your big potential sources of error are either accidentally failing to trigger one camera or cameras losing focus on the page (especially if you are at something like the end of a chapter where there is often empty space in the middle of the page where the camera's auto-focus area is). His solution of using the IR remote should significantly reduce the issue of failing to capture on one camera. Cameras exist with manual focus settings, but they are often pricier or too old to reliably find one worth recommending to others. The CHDK alternative firmware for certain cheap Canon cameras generally adds a manual focus option for the less expensive cameras (though the individual features depend on who is making the firmware build you get).

Another option worth investigating is the newest Raspberry Pi camera modules with external lenses. Those should give you manual focus and the ability to build up an automated workflow you like around things like moving files around and any pre-processing you need. An ~9 mega pixel camera gets you 300dpi resolution on a full sheet of A4 paper, which is a lot more than most books.


A lot of people on the DIY Book Scanner forums got a lot of value out of this design and David's presence on the forum. In my opinion it's an excellent starting point.


Here a homemade way to digitize a book with a compact camera https://www.ikkaro.com/en/como-digitalizar-libro/


Hello if anyone is in the Bay Area and has a book scanner I’d love to scan my copy of this book which was only printed in India in 2001 and seems relatively rare:

https://www.abebooks.com/9780140298246/Patents-Myths-Reality...

I did fill out the form for the internet archive but it talked about scanning a library and I’m not sure they want to deal with just one book.


NoiseBridge maintained a DIY Book Scanner for a long time.


Oooh good tip thank you


Are the kits back in stock?

I fooled around with the DIY option, but realized I was incompetent. Ended up buying a cheap Czur scanner, which works surprisingly well.

For it, you hold the book open on the black mat on a table. The scanner uses a laser to measure and correct page curvature, and takes a picture of both pages.

It produces decent PDFs (I'm not sure about the comparative resolution) with (bad) OCR'ed text. (The IA re-OCR's the book after upload, right?)


I'd love to see something like this made out of entirely recycled phones and their cameras instead of going with discrete components... any leads?


I've just built a basic basic overhead 'rostrum' type rig using some wood, screws, and an older Android phone.

The phone's camera points down at a height of about 20cm/8" and can see an area on the base plate big enough for the object I'm capturing (in this case, 3.5" floppy disks).

I use an app called IP Camera (it's on the Play store for free) to serve the image via http. I then remotely grab it, process/crop it and store it. The project is in it's early stages, but is working quite well so far.


I have most of the entire collection of hardback national geographics from 1930 to 1970. Wonder how legal it would be to scan them. Always wondered.


unless you really want to do it for fun, they have almost certainly been scanned and put online by someone else. Archive.org has several https://archive.org/search.php?query=national%20geographic


> from 1930

Good news: It's very likely that the copyright has expired. If you were to scan them, remember to upload them to archive.org for everyone else to see.

Bad news: It's only the case if the copyright hasn't been renewed by the owner. Usually most owners don't renew them, but to determine whether or not this is the case, you need to go through huge catalogs of registered entries from the U.S. copyright office.


Incorrect. It is almost certainly still in copyright in the United States. Anything published after 1925 will be in copyright except for those published without notice 1926-1977 or were not renewed 1926-1963. The exceptions almost certainly don't apply to NatGeo.

Scanning typically falls under fair use, so copyright only applies to distribution of the scans.

edit: Maybe you edited or maybe I'm just dumb. Anyway, the problem with relying on a lack of renewal is that you have to prove a negative. NYPL among others have been doing interesting work on this problem: https://www.nypl.org/blog/2019/09/01/historical-copyright-re...


Under US law, between 1925 and 1964, it was necessary to renew copyright every 28 years.[1] There are many works which have fallen out of copyright for nonrenewal, and there are projects which are now going through copyright renewal records to determine which of these are in fact now public domain in the US. Roughly 3/4 of potentially copyrighted works (published since 1924 and initially registered) have proved to be public domain.

Once copyright has lapsed, it cannot be reinstated.

https://www.crummy.com/2019/07/22/0

https://www.nypl.org/blog/2019/05/31/us-copyright-history-19...

________________________________

Notes:

1. Technically, the obligation existed until the 1976 act, at which time copyright status was automatic, but was retroactively waived for works never having lapsed, back to 1964, in the 1992 act. It's complicated. (NYPL link above.)


It's just as legal as ripping your CDs into MP3s.


I've used scan tailor in the past to convert a outboard motor manual to pdf, it's pretty powerful. I didn't have a proper setup, but my results still came out decently.

https://github.com/4lex4/scantailor-advanced


I remember reading about Larry Page spending time developing a book scanner using a scanner and a hoover to turn pages


I built a similar one of these from a kit that Dan Reetz made. (Technology has improved since I built mine.)

I have eliminated most printed books. I had to pass a "psychological barrier" before I was able to discard the books I scanned.

The last holdout was music scores, but I now use an iPad for music at the piano.


Thanks for buying and building a kit. I appreciated everyone who did that so much.


I also want this.


The link http://store.diybookscanner.org/ goes to a shop page that is not configured yet.


I remember reading about Google's book scanner that operated automatically using vacuum pressure to gently flip pages. I'd love to see an open source variety of that.


I think about Google's scanner project a lot. I wonder if they are still scanning?

I would love if they found a way to offer those books (authors would have to agree) to the world.

If they did, I would forgive them for tracking me all these years.

I'd even put up with ads in the books.


They do if the books are out of copyright, and they're often copied to archive.org (and then transcribed via Wikisource) when this happens.

sample: https://books.google.im/books?id=Me8CAAAAQAAJ&printsec=front...


And you can order a bound paperback book made from the scans of just about any out-of-copyright google book. From the Harvard Bookstore (or anyone else that has an Espresso Book Machine):

https://www.ondemandbooks.com/as/?t=Hamlet&c=google


I don't know where to find a picture of it, but I remember seeing a video of one of their book scanners. Or maybe it was just one version of them.

Picture a wedge (shaped like the V of a book lying open on a table) that could move up and down. The wedge would descend and insert itself into the V of the book's pages, then rising up would suck the 2 left and right sheets against the sides of the wedge, scanning as it went.

Then it would blow those 2 pages to the left (or right I forget) and descend again and do the next set.

I thought it was pretty cool. Have never seen something like it since.


Automated page turning is an incredibly complex problem to tackle. It won't gain you as much as you'd think either. "Book" and "page" are a surprisingly difficult to define categories.

If my experience helping to build the diybookscanner.org project taught me anything, it's that picking the low hanging fruit of small efficiency improvements to the (semi-)manual process is so much more effective...


Which is probably why some of the scanning locations used something very similar to this DIY effort (I'm familiar with the Oxford, UK location that existed in the mid 00s). Humans turned the pages with the finger tips mentioned in another comment.


Google Linear Book Scanner

https://yewtu.be/watch?v=7MNqINDm1lk



UCL-CS had one of these which was deployed in conjunction with the British Library. This is when high pixel count CCDs were super expensive back in the 1980s. Amazing device.


Nice! A foot-pedal can improve his over-all efficiency and reduce lower back and neck pain.


I need this for some vintage IBM/PC-compatible programming books that are a zillion pages long.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: