Hacker News new | past | comments | ask | show | jobs | submit login

The node module by the same name (https://github.com/dbashford/textract) also supports image OCR (via tesseract), excel files, RTF and other formats.



I also assumed that it was some kind of Python wrapper or implementation of Tesseract OCR when I saw that name. One would think so when Tesseract being (one of?) the best preforming OCR-programs out there.


Thanks for pointing this out. I've been working on a text extractor in Go at work and tried for a long time to get UnRTF working with RTF files containing Japanese characters to no avail. This lib lists catdoc as the extractor they use for RTF, so I'm going to give that a try.

Edit: Looks like catdoc doesn't work with RTF files containing Japanese characters either. Might end up having to use libreoffice or something like that.


For what its worth, textract (python) also has ambitions of including OCR through the tesseract-ocr project https://github.com/deanmalmgren/textract/issues/16




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: