Hacker News new | past | comments | ask | show | jobs | submit login
Doc2text – Detect text blocks and OCR poorly scanned PDFs in bulk (github.com/jlsutherland)
161 points by jlsutherland on Aug 30, 2016 | hide | past | favorite | 19 comments



I worked on a PDF text extraction project once, with scientific articles as the primary target.

That stuff is really hard, even when the text is ostensibly present in the PDF (as opposed to the PDF being an image of text). Thing is, it's all just "draw text" commands in the content streams (basically postscript programs). The text commands appear in no particular order and you generally have to compute the layout to see even where the spaces are (which is still a guessing game, because the PDF generator will vary the space width to achieve a visually pleasing format).

OCR is an approach that wasn't quite ready for prime time back then, so it's cool to see people working on it!


Also look at Ocropy from Tom Breuel who has a page segmenter that identifies columns. https://github.com/tmbdev/ocropy


Ocropus is an incredible tool. I highly recommend it!


This looks awesome. I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause. I wonder if I can leverage this to improve the indexing.


> I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause.

OCR for indexing seems like an easier problem than perfectly accurate OCR. You could do a fuzzy search that can match similar characters (1/I/l, A/4, 0/O).


Tesseract is sadly, quite out of date. If you would like help implementing Deep Learning models for OCR let me know.


I would love to see a link to a tutorial or project that shows how to do this. As someone who has sampled many OCR products, I have been wondering why people are not using deep learning for this. It's really a match made in heaven. Or maybe the vendors are just training a system and then releasing it without updates?

The difficult part of OCR'ing forms is parsing text in a variety of word-wrapped panels and boxes and converting checkboxes to text. Is that something deep learning could be trained to handle? For example, imagine parsing the huge receipt you get when you buy a car. The text itself isn't always the challenge.


Would you mind commenting why Tesseract is out of date? I see developers are still active on it:

https://github.com/tesseract-ocr/tesseract/commits/master


There are people still working on VAX systems - are VAX systems not out of date by the same logic?

Spend 1 afternoon with tesseract, and 1 afternoon with Googles text recognition API. The quality of the results is night and day.

I would love there to be an open source one that can complete, which is why I said "sadly". But if you're interested in quality of results, Deep Learning is the way to go.


Google has been one of the biggest contributors to Tesseract – has that changed? My understanding – which could be years out of date – was that Google Books used Tesseract but that most of their effort had gone into either advanced image preprocessing or large-scale training.


Yes you're right - they were one of the biggest contributors until a 12-18 months ago (roughly)

Now it's deep learning, it's at the point now where there's no point in spending ages manually 'feature engineering', just throw some GPU and soon to be TPU processing power at it


Thanks for the update – has any of that been described in public?


Can you please explain what makes this utility different than other OCR solutions? I've seen quite a few coming out recently. What is the secret sauce that makes this more than just a frontend for tesseract?


The quick and dirty: OCR solutions exist, but to work well they generally need a little hand-holding. You have to give your OCR software a clean image if you want clean results (this goes for tesseract, ocropus, etc). The problem is that scans are rarely so clean...they are crooked, there is a hand in it, there is half of another page in it, etc. etc.---and common OCR software doesn't correct for this too well out-of-the-box.

doc2text bridges the gap between the initial scan and the scan you should pass through your OCR to greatly increase OCR ability. It takes that dirty scan, identifies the text region, fixes skew, performs a few pre-processing operations that help with common OCR binarization, and BOOM...data that was inaccessible, now accessible.

Try running tesseract or ocropus on a bad document scan before and after using doc2text...you'll see what I mean!

P.S. I should add...the end-user is also a little different from strict OCR packages/wrappers. Users might be admin staff or academics (or kids like my RA's) who want a simple, straightforward API to extract the text we need from poorly scanned documents. doc2text is built with this need in mind.


Do you have a comparison with unpaper, which seems to do almost the same thing?


As someone who has written a similar front-end[1] for tesseract, I'm equally curious :)

While mine is specifically designed towards document-archival means and also plugs into SANE, I'd love to know if this thing contains anything obvious I can add to mine to improve the quality of the results.

[1] https://github.com/josteink/autoarchiver


Some examples would be really informative as to how well it works.


Pretty impressive leverage here: only a few dozen lines of code in total, using other OSS libraries.


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: