Using Pytesseract to Convert Images into a HTML Site

markdown · on March 8, 2020

First of all, great work, and thank you for sharing.

The video only shows this working with an image of a text-only page. What happens when there are photos embedded in the image?

armaizadenwala · on March 8, 2020

Hi! Thank you!

Tesseract is trained to only recognize text from images. I haven't looked into image detection yet though.

This project fits the situation where you need to digitize a bunch of physical copies / scans of documents. Sometimes these documents have images like company logos which would be useful to include in the final html page.

I'll try to take a look into it, it is a wonderful idea for a 2nd part. This current post is geared towards helping others transition into the world of data science with OCR by describing every step of the way.

riedel · on March 8, 2020

nice. But why are you attributing tesseract solely to google when it was initially developed by HP ? Does it help marketing nowadays?

netgusto · on March 8, 2020

I'd argue that one can refer to Tesseract as Google product without being deceptive, as it's been developed by Google since 2006 [1].

[1] https://github.com/tesseract-ocr/tesseract#brief-history