The node module by the same name (https://github.com/dbashford/textract) also su...

wangman · on Aug 4, 2014

I also assumed that it was some kind of Python wrapper or implementation of Tesseract OCR when I saw that name. One would think so when Tesseract being (one of?) the best preforming OCR-programs out there.

shawnps · on Aug 4, 2014

Thanks for pointing this out. I've been working on a text extractor in Go at work and tried for a long time to get UnRTF working with RTF files containing Japanese characters to no avail. This lib lists catdoc as the extractor they use for RTF, so I'm going to give that a try.

Edit: Looks like catdoc doesn't work with RTF files containing Japanese characters either. Might end up having to use libreoffice or something like that.

deanmalmgren · on Aug 4, 2014

For what its worth, textract (python) also has ambitions of including OCR through the tesseract-ocr project https://github.com/deanmalmgren/textract/issues/16