Hacker News new | past | comments | ask | show | jobs | submit login
Tesseract.js wraps an Emscripten port of the Tesseract OCR Engine (github.com/naptha)
146 points by modinfo on May 9, 2022 | hide | past | favorite | 60 comments



I'm surprised that seemingly there are no other major FOSS OCRs than Tesseract and Tesseract is quite frankly horrible. I once tried to use it on a high-resolution screenshot of a Discord message containing only the characters "0" and "1". I cropped it to only have the text, restricted character sets, tried fiddling with the images contrast and what not and the result was still quite poor, with many characters mistaken or straight up ignored.

I have little expertise in ML, but from my limited understanding, OCR is the bread and butter of the field. I've read exactly one "Intro to ML" article and it was about recognising digits. And yet, we have an abundance of high quality proprietary OCRs that can recognise printed or even hand-written text and the single open source one is having trouble with perfectly formatted text with a readable font.

Could anyone with more expertise shine some light on this current state of affairs?


> I once tried to use it on a high-resolution screenshot of a Discord message containing only the characters "0" and "1". I cropped it to only have the text, restricted character sets, tried fiddling with the images contrast and what not and the result was still quite poor, with many characters mistaken or straight up ignored.

I had the opposite experience.

My partner was doing a project for the Army Core of Engineers and they only provided information via some system called ProjNet that, best I can tell, exported PDFs of Web Pages in pure vector format so they were unsearchable. Of course they needed to search 10000 pages of documents to answer questions for the ACoE.

I was able to feed the PDFs into Tesseract and produce 1:1 text document per page of PDF and then marry it back up to the PDF so they could search the PDFs. It worked astonishingly well and took about a half an hour using the cringiest of shell scripts.

I did something similar with SDGE's published rate tables to convert their screenshots of XLS files back into tablur data. It didn't work as well but still got the job done.


As I mentioned in another comment, EasyOCR and PaddleOCR.


I've had good results with EasyOCR, much better than Tesseract. I agree with you, Tesseract has performed very poorly in my experience.

https://github.com/JaidedAI/EasyOCR


It's amazing to me that there's so little in the OSS world about handwriting recognition. From an OCR perspective, I understand it's much harder than printed text, but there's not really anything for "online" handwriting recognition either (written on a screen/vectorized strokes). From my understanding online recognition should be easier than scanning printed text, and yet there aren't any tools out there that I can find.


Related:

Tesseract.js – A Javascript port of the Tesseract OCR engine - https://news.ycombinator.com/item?id=28105850 - Aug 2021 (37 comments)

Tesseract OCR - https://news.ycombinator.com/item?id=27876383 - July 2021 (65 comments)

Tesseract Teaser - https://news.ycombinator.com/item?id=26400168 - March 2021 (7 comments)

Tesseract.js: Pure JavaScript OCR for 100 Languages - https://news.ycombinator.com/item?id=21843713 - Dec 2019 (77 comments)

A guide to OCR with Tesseract, OpenCV and Python - https://news.ycombinator.com/item?id=21843342 - Dec 2019 (12 comments)

Using Tesseract OCR with Python - https://news.ycombinator.com/item?id=14741124 - July 2017 (47 comments)

Show HN: Tesseract.js – Pure JavaScript OCR for 60 Languages - https://news.ycombinator.com/item?id=12694004 - Oct 2016 (97 comments)


I made a utility that cleans up your Mac desktop and uses Tesseract to extract text from screenshots. This makes it really easy to find screenshots by searching for a line of text you remember.

https://gitlab.com/bearjaws/cluttr#readme


This is awesome, you should post this as its own thing


Tesseract is the most ideal OCRA SDK for reading simple bw documents. If you aim to read “text In the wild” or scene text then Firebase offers a much better alternative. I had a lot of hope for Tesseract 4.0 which is supposed to be based on NNs but it’s so far performed just marginally better than 3.0


For “text In the wild” or scene text, the last time I checked, EasyOCR and PaddleOCR were both good.


I expected these to still be pretty low quality, but surprisingly some quick tests shows that EasyOCR seems to be doing relatively decently at pulling text out of smartphone pics of documents.

Thanks for sharing these -- it's maybe just my very bad searching skills but I had been trying to set some stuff up with Tesseract and had come to the conclusion that I just couldn't use it for document photos and would either need to abandon that effort and buy a faster scanner, or hook into some proprietary service like Google/Apple.

Both of these look really promising, so now I'm excited again about the potential of setting up a fast Open Source way to digitize my documents.


Just IMHO Apple's Vision framework has been great too, and very easy to get started


Vision's rectangle detection or document scanner has worked well for us but in comparison to what Google's MLKit OCR offers it pales in comparison. MLKit OCR also does language detection + more languages out of the box.

EasyOCR is definitely interesting and something that's worked well for us at a prototyping level.


This was my experience too... I tried to us tessaract for mobile app that scanned food labels in realtime using the camera video feed, I found that google's ML library text recognition was much faster, and reliable.


And what are you supposed to use if you're not doing it on android and can therefore use firebase?


It's part of the Cloud Vision API which supports (g)RPC and REST. Used it in a trading bot to detect if a tweet image (from Elon Musk) contained any mention of the text Doge or Dogecoin, or even a real dog.


I tried to use Tesseract for a personal hobby project and found it very lacking. The OCR was not very accurate. I ended up switching to Azure Vision services which gives you 500 free OCR API calls a day (or some similar limit). This was perfect for my needs.


Sadly one of small hobby is to convert my own movies (Blu-Ray) into a digital file for my home server, and the subtitles are all image-based. The app I use (SubtitleEdit) relies on Tesseract for conversion, but it's far from perfect. :(

Sure, I could use someone else's subtitle file from the Internet, but that's not as fun than doing it yourself.


I'm not sure what engine it uses for OCR, but I recently had to convert some image-based subtitles to text-based `.srt` format, and had quite good success using a tool named `subtitlecomposer` to do the conversion. During the initial import of the original image-based subtitles, the software stops on every character it does not immediately recognize and asks you what symbol it's looking at. It then builds up a "symbols" map by doing this over and over again, each time it does not recognize the image of a letter, number, or other symbol. This map file can then be renamed to match other subtitle files, in order to process other subtitles in the same font/typestyle. Eventually the symbol map ends up so complete that entire subtitles get converted automagically on import, without ever needing to ask the user any further questions.


It's annoying to find out the actual code that does the OCR is not in this repo after looking through the entire thing. It's just a bunch of scheduling and worker logic and for some reason the JS is written twice once for the browser and once for Node.

The actual code that does the OCR is wraped and included via this package [0] which just wraps the original Tesseract in C++ [1] using wasm. Shameful title.

[0] https://github.com/naptha/tesseract.js-core

[1] https://github.com/jeromewu/tesseract


Can it be considered "pure" when the project uses WASM ?


In my experience "pure JS" is normally used to differentiate projects from those using NodeJS FFI - the important part is the target / executing runtime (which would be the JS runtime for WASM), rather than the project source.

For my own purposes, the priority for me when reading "pure" is that the core runtime I'm using (the JS runtime) is the only runtime dependency - I'm not depending on external binaries and execution environments like an FFI implementation would.

It also opens up codebases to browser compat, where FFI would typically not be available.


WASM is a form of FFI.


Wait, how so?

WASM isn't an interface or a wrapper, it's a language/format. Having trouble understanding what you mean by this, unless you're arguing that the WASM VM itself is the FFI?


I can kinda see what they mean (albeit it's somewhat of a stretch).

WASM "embeds" modules within the JS runtime, in a similar way that traditional FFI "embeds" native bindings compiled separately & externally. It's still quite different insofar as the VM is a part of the runtime, but there are vague parallels.

For me though, the practical problems related to runtime env that one encounters with traditional ffi bindings calling dynamically linked native libraries are rarely present with a WASM library, as the support within the runtime is explicit (the only real exception here is architecture, which is always an issue regardless).


It is FFI from JavaScript point of view, a way to call multiple native languages from JavaScript, requires import and export definitions, a wasm file is no different than a .o, .a, .obj, .lib, other than not using instructions of a real CPU on the market.


I can sort of see this, in the sense that the style of code I'm writing when I use WASM is similar to the style of code when calling into an FFI.

I think the implications of that code are different, but yeah, I see your point and I think it's fairly reasonable.


This is actually a decent way of framing it.

> A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written in another.

https://en.m.wikipedia.org/wiki/Foreign_function_interface


I think people just mean 'no server side ML' when they say pure JS in this context.


So pure browser or client side implementation then.


I imagine this is celebrating the V2 version release with WebAssembly? I believe it _used_ to be pure JS?


any idea how the performance compares to native code from the original tesseract?


Much much worse, unfortunately. Though no fault of the maintainer.

I use it in an Electron project and a documents that takes about 1.5 sec per page with the Tesseract CLI, I can get down to about 15 sec with Tesseract.js with parallelization.


Much worse. The accuracy is also worse.


Can I do OCR all-in-browser with this, without involving any backend? Not much familiar with OCR accuracy metrics, how much accurate is Tesseract?


Tesseract works well with clean input, though in my experience it suffers greatly as soon as anything gets noisy.


> Can I do OCR all-in-browser with this

Literally third sentence of the description:

> It works in the browser using webpack or plain script tags with a CDN and on the server with Node.js.


To be honest it is a confusing sentence. If it requires node js server then the answer should be no. Also, its confusing to me why it requires nodejs server ...


But this literally says it doesn't? It's "or".


I cannot give you numbers - it would be nice to have a benchmark -, but I can tell you results are quite satisfactory, and if compared to some OCR results you find around from mainstream commercial products, it can be much better than them if you pre-process the input.

It may miss a few features (some which I needed I had to code in).


Yes, but Tesseract is very inaccurate. Think "early 2000s speech recognition" accuracy.


Yes, this runs in your browser.


Why OCR tools (Tessract and Paddle) are written in Python? Even this one is in JS.

Is there any single-binary static OCR tool comparable to these two?


This is not written in JS. It's c++ code compiled into something runnable under JS. If you want to run tesseract from python there is PyTesseract which is a wrapper around the tesseract cli. Also, I'm sure there are python bindings to invoke the tesseract libs without going over the cli but I've never looked it up.


For people who are forbidden to compile C++?


This is a wrapper around a c++ codebase compiled with emscripten, so "pure" in the title doesn't really make sense.


> so "pure" in the title doesn't really make sense.

I can see where you're coming from, but I've never used or heard anyone in the web world use "pure" to mean only "written entirely in Javascript without transpilation or other tools."

If it hits the parts of "pure JS" that most people care about:

- it's running entirely in Javascript.

- it has no native dependencies.

- it can run entirely clientside.

- it can be embedded in a normal web page.

then I think most people will be fine with using "pure" to describe it.

----

I wouldn't even have that many quibbles with their phrasing even if they were compiling to WASM. Sure, at that point it wouldn't be running as pure javascript, but it would still hit 3 of the 4 points above.


“Pure” does definitely connote that you’ll be able to read all the code in the given language.

That’s exactly what “pure” means - “pure rust”, “pure go”, etc. IMO you can’t say the heart of all the work is a c++ lib and call it a “pure JS” anything.

More accurately / correctly / usefully would be calling it “JS wrapper over a c++ library cross compiled to JS”. Or maybe “All JS at runtime” or some other qualifier.

Want to see how to cross-compile a non-trivial c++ lib? Check out here!

Want to see a great JS wrapper library where we had to make cross-language & cross memory-management API decisions? Check this out!

But - want to read some awesome high performance image processing algorithms in JS? Not this.


> “Pure” does definitely connote that you’ll be able to read all the code in the given language.

I'm not sure where the line here is supposed to be drawn, but I don't personally think looking at asm.js code is significantly harder than looking at something like compiled Typescript, and certainly it's not any harder than looking at minified Javascript. If I'm going to be debugging code, they're both going to be annoying to look at. We're quibbling over definitions so I'm not going to say that you're wrong, "pure" can mean whatever you want it to mean. I'm just saying that most JS devs I know consider (for example) JSX code to still be pure Javascript when it's compiled.

It seems a little odd to me to look at something where every single line of code is Javascript, being run entirely in a Javascript interpreter, and say that isn't actually completely real Javascript, but if the programming circles you frequent are different and think about this differently, :shrug: more power to you.

> But - want to read some awesome high performance image processing algorithms in JS? Not this.

I also wouldn't necessarily assume that every Open Source program written in only JS without compilation is going to be well suited for reading or learning from. But again, sort of splitting straws here -- my only concern is that you're probably going to be disappointed a lot if you equate "pure Javascript" with "readable".


I tend to agree. When I see "pure" I don't think about the code being hand written in JS I think more about the potential of browser run OR that there are no native modules for Node required.

That being said perhaps a poll is needed to find out what most people think.


From a practical standpoint, WASM is "pure JS" insofar as if I am browsing libraries and see one advertising itself as "pure JS", by convention that to me means "no FFI" or "potentially works in browser".

The only place I've seen "pure JS" used in the JS world is differentiating e.g. a Postgres client implementation that does or doesn't depend on the specific version & configuration of libpq you have on your current system. That's about runtime deps, not about source language.


That's funny, when someone says "pure JS" I expect exactly that, something written in JS and only JS. Language matters.


If this is compiling to asm.js then every single line of the program is Javascript and it's running entirely in a Javascript interpreter. If it's not Javascript, then what is it?

I mean, if someone compiles a Markdown document and sticks the result on their website, do you say, "this isn't HTML"? People are free to use words however they want I guess, but I don't understand the perspective where the way a project was written suddenly means that the compiled result isn't Javascript -- it feels like it's taking the word "pure" in a metaphysical direction that I just don't really grok.


Can emscripten compile to JS? I thought it could only compile to WASM.


I'm not sure which they're targeting, but early emscripten targeted asm.js, it predated WASM. I would consider asm.js to still be Javascript, it's just an optimized subset of the language.

I'll concede though that if they're targeting WASM it's not technically pure Javascript, but it still feels a bit to me like splitting hairs since it's always been explained to me that WASM and the Javascript runtime under the hood have a lot of overlap.


Emscripten can target both WebAssembly and JavaScript. The JavaScript option uses wasm2js - it compiles first to wasm, then compiles that to JS.

https://github.com/WebAssembly/binaryen#wasm2js

The emcc flag -sWASM=0 disables the wasm final output and emits JS instead.


Emscripten started as an asm.js compiler iirc ? Or even "plain" js


Yes, historically Emscripten began before asm.js, targeting pure JS. That JS backend was replaced by an asm.js backend, which was later replaced by the current wasm backend (the wasm backend in upstream LLVM).

(But as already mentioned, JS is still supported today, using wasm2js.)


IIRC emscripten was around before Browser support for WASM was really a thing.


Ok, we've replaced "Tesseract.js – Pure JavaScript OCR" with a more specific sentence from the OP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: