Doc2text – Detect text blocks and OCR poorly scanned PDFs in bulk

onetwotree · on Aug 30, 2016

I worked on a PDF text extraction project once, with scientific articles as the primary target.

That stuff is really hard, even when the text is ostensibly present in the PDF (as opposed to the PDF being an image of text). Thing is, it's all just "draw text" commands in the content streams (basically postscript programs). The text commands appear in no particular order and you generally have to compute the layout to see even where the spaces are (which is still a guessing game, because the PDF generator will vary the space width to achieve a visually pleasing format).

OCR is an approach that wasn't quite ready for prime time back then, so it's cool to see people working on it!

ldenoue · on Aug 30, 2016

Also look at Ocropy from Tom Breuel who has a page segmenter that identifies columns. https://github.com/tmbdev/ocropy

jlsutherland · on Aug 30, 2016

Ocropus is an incredible tool. I highly recommend it!

zaphar · on Aug 30, 2016

This looks awesome. I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause. I wonder if I can leverage this to improve the indexing.

JoshTriplett · on Aug 30, 2016

> I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause.

OCR for indexing seems like an easier problem than perfectly accurate OCR. You could do a fuzzy search that can match similar characters (1/I/l, A/4, 0/O).

malux85 · on Aug 30, 2016

Tesseract is sadly, quite out of date. If you would like help implementing Deep Learning models for OCR let me know.

redwards510 · on Aug 30, 2016

I would love to see a link to a tutorial or project that shows how to do this. As someone who has sampled many OCR products, I have been wondering why people are not using deep learning for this. It's really a match made in heaven. Or maybe the vendors are just training a system and then releasing it without updates?

The difficult part of OCR'ing forms is parsing text in a variety of word-wrapped panels and boxes and converting checkboxes to text. Is that something deep learning could be trained to handle? For example, imagine parsing the huge receipt you get when you buy a car. The text itself isn't always the challenge.

marai2 · on Aug 30, 2016

Would you mind commenting why Tesseract is out of date? I see developers are still active on it:

https://github.com/tesseract-ocr/tesseract/commits/master

malux85 · on Aug 30, 2016

There are people still working on VAX systems - are VAX systems not out of date by the same logic?

Spend 1 afternoon with tesseract, and 1 afternoon with Googles text recognition API. The quality of the results is night and day.

I would love there to be an open source one that can complete, which is why I said "sadly". But if you're interested in quality of results, Deep Learning is the way to go.

acdha · on Aug 30, 2016

Google has been one of the biggest contributors to Tesseract – has that changed? My understanding – which could be years out of date – was that Google Books used Tesseract but that most of their effort had gone into either advanced image preprocessing or large-scale training.

malux85 · on Aug 30, 2016

Yes you're right - they were one of the biggest contributors until a 12-18 months ago (roughly)

Now it's deep learning, it's at the point now where there's no point in spending ages manually 'feature engineering', just throw some GPU and soon to be TPU processing power at it

acdha · on Sept 1, 2016

Thanks for the update – has any of that been described in public?

redwards510 · on Aug 30, 2016

Can you please explain what makes this utility different than other OCR solutions? I've seen quite a few coming out recently. What is the secret sauce that makes this more than just a frontend for tesseract?

jlsutherland · on Aug 30, 2016

The quick and dirty: OCR solutions exist, but to work well they generally need a little hand-holding. You have to give your OCR software a clean image if you want clean results (this goes for tesseract, ocropus, etc). The problem is that scans are rarely so clean...they are crooked, there is a hand in it, there is half of another page in it, etc. etc.---and common OCR software doesn't correct for this too well out-of-the-box.

doc2text bridges the gap between the initial scan and the scan you should pass through your OCR to greatly increase OCR ability. It takes that dirty scan, identifies the text region, fixes skew, performs a few pre-processing operations that help with common OCR binarization, and BOOM...data that was inaccessible, now accessible.

Try running tesseract or ocropus on a bad document scan before and after using doc2text...you'll see what I mean!

P.S. I should add...the end-user is also a little different from strict OCR packages/wrappers. Users might be admin staff or academics (or kids like my RA's) who want a simple, straightforward API to extract the text we need from poorly scanned documents. doc2text is built with this need in mind.

ashkulz · on Aug 31, 2016

Do you have a comparison with unpaper, which seems to do almost the same thing?

josteink · on Aug 30, 2016

As someone who has written a similar front-end[1] for tesseract, I'm equally curious :)

While mine is specifically designed towards document-archival means and also plugs into SANE, I'd love to know if this thing contains anything obvious I can add to mine to improve the quality of the results.

[1] https://github.com/josteink/autoarchiver

placeybordeaux · on Aug 31, 2016

Some examples would be really informative as to how well it works.

cpr · on Aug 30, 2016

Pretty impressive leverage here: only a few dozen lines of code in total, using other OSS libraries.

jlsutherland · on Aug 30, 2016

Thanks!