Hacker News new | past | comments | ask | show | jobs | submit login

This looks awesome. I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause. I wonder if I can leverage this to improve the indexing.



> I've got a ghetto full text search indexer I've written that uses OCR as a fallback if it can't extract text from a pdf but as you say many times the quality is so bad it's a lost cause.

OCR for indexing seems like an easier problem than perfectly accurate OCR. You could do a fuzzy search that can match similar characters (1/I/l, A/4, 0/O).


Tesseract is sadly, quite out of date. If you would like help implementing Deep Learning models for OCR let me know.


I would love to see a link to a tutorial or project that shows how to do this. As someone who has sampled many OCR products, I have been wondering why people are not using deep learning for this. It's really a match made in heaven. Or maybe the vendors are just training a system and then releasing it without updates?

The difficult part of OCR'ing forms is parsing text in a variety of word-wrapped panels and boxes and converting checkboxes to text. Is that something deep learning could be trained to handle? For example, imagine parsing the huge receipt you get when you buy a car. The text itself isn't always the challenge.


Would you mind commenting why Tesseract is out of date? I see developers are still active on it:

https://github.com/tesseract-ocr/tesseract/commits/master


There are people still working on VAX systems - are VAX systems not out of date by the same logic?

Spend 1 afternoon with tesseract, and 1 afternoon with Googles text recognition API. The quality of the results is night and day.

I would love there to be an open source one that can complete, which is why I said "sadly". But if you're interested in quality of results, Deep Learning is the way to go.


Google has been one of the biggest contributors to Tesseract – has that changed? My understanding – which could be years out of date – was that Google Books used Tesseract but that most of their effort had gone into either advanced image preprocessing or large-scale training.


Yes you're right - they were one of the biggest contributors until a 12-18 months ago (roughly)

Now it's deep learning, it's at the point now where there's no point in spending ages manually 'feature engineering', just throw some GPU and soon to be TPU processing power at it


Thanks for the update – has any of that been described in public?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: