Can you please explain what makes this utility different than other OCR solution...

jlsutherland · on Aug 30, 2016

The quick and dirty: OCR solutions exist, but to work well they generally need a little hand-holding. You have to give your OCR software a clean image if you want clean results (this goes for tesseract, ocropus, etc). The problem is that scans are rarely so clean...they are crooked, there is a hand in it, there is half of another page in it, etc. etc.---and common OCR software doesn't correct for this too well out-of-the-box.

doc2text bridges the gap between the initial scan and the scan you should pass through your OCR to greatly increase OCR ability. It takes that dirty scan, identifies the text region, fixes skew, performs a few pre-processing operations that help with common OCR binarization, and BOOM...data that was inaccessible, now accessible.

Try running tesseract or ocropus on a bad document scan before and after using doc2text...you'll see what I mean!

P.S. I should add...the end-user is also a little different from strict OCR packages/wrappers. Users might be admin staff or academics (or kids like my RA's) who want a simple, straightforward API to extract the text we need from poorly scanned documents. doc2text is built with this need in mind.

ashkulz · on Aug 31, 2016

Do you have a comparison with unpaper, which seems to do almost the same thing?

josteink · on Aug 30, 2016

As someone who has written a similar front-end[1] for tesseract, I'm equally curious :)

While mine is specifically designed towards document-archival means and also plugs into SANE, I'd love to know if this thing contains anything obvious I can add to mine to improve the quality of the results.

[1] https://github.com/josteink/autoarchiver