This looks awesome. I've got a ghetto full text search indexer I've written that...

JoshTriplett · on Aug 30, 2016

> I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause.

OCR for indexing seems like an easier problem than perfectly accurate OCR. You could do a fuzzy search that can match similar characters (1/I/l, A/4, 0/O).

malux85 · on Aug 30, 2016

Tesseract is sadly, quite out of date. If you would like help implementing Deep Learning models for OCR let me know.

redwards510 · on Aug 30, 2016

I would love to see a link to a tutorial or project that shows how to do this. As someone who has sampled many OCR products, I have been wondering why people are not using deep learning for this. It's really a match made in heaven. Or maybe the vendors are just training a system and then releasing it without updates?

The difficult part of OCR'ing forms is parsing text in a variety of word-wrapped panels and boxes and converting checkboxes to text. Is that something deep learning could be trained to handle? For example, imagine parsing the huge receipt you get when you buy a car. The text itself isn't always the challenge.

marai2 · on Aug 30, 2016

Would you mind commenting why Tesseract is out of date? I see developers are still active on it:

https://github.com/tesseract-ocr/tesseract/commits/master

malux85 · on Aug 30, 2016

There are people still working on VAX systems - are VAX systems not out of date by the same logic?

Spend 1 afternoon with tesseract, and 1 afternoon with Googles text recognition API. The quality of the results is night and day.

I would love there to be an open source one that can complete, which is why I said "sadly". But if you're interested in quality of results, Deep Learning is the way to go.

acdha · on Aug 30, 2016

Google has been one of the biggest contributors to Tesseract – has that changed? My understanding – which could be years out of date – was that Google Books used Tesseract but that most of their effort had gone into either advanced image preprocessing or large-scale training.

malux85 · on Aug 30, 2016

Yes you're right - they were one of the biggest contributors until a 12-18 months ago (roughly)

Now it's deep learning, it's at the point now where there's no point in spending ages manually 'feature engineering', just throw some GPU and soon to be TPU processing power at it

acdha · on Sept 1, 2016

Thanks for the update – has any of that been described in public?