Hacker News new | past | comments | ask | show | jobs | submit login

Just tested with a random scanned page (http://www.hpl.hp.com/research/info_theory/ShannonWeb/fullsi...) the result is almost garbage. It seems as bad as most OCR software I have encountered. This was to be expected as it is based on ocrad.



Almost garbage? This is the OCR result for the 2nd paragraph. Almost perfect, although the last word in each line gets joined to the first one in the next line:

"The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected at anotherpoint. Frequently the messages have meamlng; that is they refer to or arecorrelated according to some system with certain physical or conceptualentities. These semantic aspects of communication are irrelevant to theengineering problem. The significant aspect is that the actual message isone selected from a set of possible messages. The system must be designedto operate for each possible selection, not just the one which will actuallybe chosen since this is unknown at the time of design."


I tried it with both ocrad and tesseract modes, and indeed, the ocrad mode produces garbage, the tessaract mode produces a really good result but takes a longer time doing it(mainly the time it takes to upload the entire thing and get the result back).

That seems to make sense to me, at least. Use ocrad mode by default, if it doesn't perform well, switch to tessaract and you'll hopefully get a better result.


When I did the test, it was garbage. Since your answer, I have repeated my test with results similar to yours.


Thanks! I wanted to try not sure it would fare better than usual OCR but was denied as I'm not a google product.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: