Hacker News new | past | comments | ask | show | jobs | submit login
Using Pytesseract to Convert Images into a HTML Site (armaizadenwala.com)
73 points by armaizadenwala on March 8, 2020 | hide | past | favorite | 4 comments



First of all, great work, and thank you for sharing.

The video only shows this working with an image of a text-only page. What happens when there are photos embedded in the image?


Hi! Thank you!

Tesseract is trained to only recognize text from images. I haven't looked into image detection yet though.

This project fits the situation where you need to digitize a bunch of physical copies / scans of documents. Sometimes these documents have images like company logos which would be useful to include in the final html page.

I'll try to take a look into it, it is a wonderful idea for a 2nd part. This current post is geared towards helping others transition into the world of data science with OCR by describing every step of the way.


nice. But why are you attributing tesseract solely to google when it was initially developed by HP ? Does it help marketing nowadays?


I'd argue that one can refer to Tesseract as Google product without being deceptive, as it's been developed by Google since 2006 [1].

[1] https://github.com/tesseract-ocr/tesseract#brief-history




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: