I'm surprised that seemingly there are no other major FOSS OCRs than Tesseract and Tesseract is quite frankly horrible. I once tried to use it on a high-resolution screenshot of a Discord message containing only the characters "0" and "1". I cropped it to only have the text, restricted character sets, tried fiddling with the images contrast and what not and the result was still quite poor, with many characters mistaken or straight up ignored.
I have little expertise in ML, but from my limited understanding, OCR is the bread and butter of the field. I've read exactly one "Intro to ML" article and it was about recognising digits. And yet, we have an abundance of high quality proprietary OCRs that can recognise printed or even hand-written text and the single open source one is having trouble with perfectly formatted text with a readable font.
Could anyone with more expertise shine some light on this current state of affairs?
> I once tried to use it on a high-resolution screenshot of a Discord message containing only the characters "0" and "1". I cropped it to only have the text, restricted character sets, tried fiddling with the images contrast and what not and the result was still quite poor, with many characters mistaken or straight up ignored.
I had the opposite experience.
My partner was doing a project for the Army Core of Engineers and they only provided information via some system called ProjNet that, best I can tell, exported PDFs of Web Pages in pure vector format so they were unsearchable. Of course they needed to search 10000 pages of documents to answer questions for the ACoE.
I was able to feed the PDFs into Tesseract and produce 1:1 text document per page of PDF and then marry it back up to the PDF so they could search the PDFs. It worked astonishingly well and took about a half an hour using the cringiest of shell scripts.
I did something similar with SDGE's published rate tables to convert their screenshots of XLS files back into tablur data. It didn't work as well but still got the job done.
It's amazing to me that there's so little in the OSS world about handwriting recognition. From an OCR perspective, I understand it's much harder than printed text, but there's not really anything for "online" handwriting recognition either (written on a screen/vectorized strokes). From my understanding online recognition should be easier than scanning printed text, and yet there aren't any tools out there that I can find.
I made a utility that cleans up your Mac desktop and uses Tesseract to extract text from screenshots. This makes it really easy to find screenshots by searching for a line of text you remember.
Tesseract is the most ideal OCRA SDK for reading simple bw documents. If you aim to read “text In the wild” or scene text then Firebase offers a much better alternative. I had a lot of hope for Tesseract 4.0 which is supposed to be based on NNs but it’s so far performed just marginally better than 3.0
I expected these to still be pretty low quality, but surprisingly some quick tests shows that EasyOCR seems to be doing relatively decently at pulling text out of smartphone pics of documents.
Thanks for sharing these -- it's maybe just my very bad searching skills but I had been trying to set some stuff up with Tesseract and had come to the conclusion that I just couldn't use it for document photos and would either need to abandon that effort and buy a faster scanner, or hook into some proprietary service like Google/Apple.
Both of these look really promising, so now I'm excited again about the potential of setting up a fast Open Source way to digitize my documents.
Vision's rectangle detection or document scanner has worked well for us but in comparison to what Google's MLKit OCR offers it pales in comparison. MLKit OCR also does language detection + more languages out of the box.
EasyOCR is definitely interesting and something that's worked well for us at a prototyping level.
This was my experience too... I tried to us tessaract for mobile app that scanned food labels in realtime using the camera video feed, I found that google's ML library text recognition was much faster, and reliable.
It's part of the Cloud Vision API which supports (g)RPC and REST. Used it in a trading bot to detect if a tweet image (from Elon Musk) contained any mention of the text Doge or Dogecoin, or even a real dog.
I tried to use Tesseract for a personal hobby project and found it very lacking. The OCR was not very accurate. I ended up switching to Azure Vision services which gives you 500 free OCR API calls a day (or some similar limit). This was perfect for my needs.
Sadly one of small hobby is to convert my own movies (Blu-Ray) into a digital file for my home server, and the subtitles are all image-based. The app I use (SubtitleEdit) relies on Tesseract for conversion, but it's far from perfect. :(
Sure, I could use someone else's subtitle file from the Internet, but that's not as fun than doing it yourself.
I'm not sure what engine it uses for OCR, but I recently had to convert some image-based subtitles to text-based `.srt` format, and had quite good success using a tool named `subtitlecomposer` to do the conversion. During the initial import of the original image-based subtitles, the software stops on every character it does not immediately recognize and asks you what symbol it's looking at. It then builds up a "symbols" map by doing this over and over again, each time it does not recognize the image of a letter, number, or other symbol. This map file can then be renamed to match other subtitle files, in order to process other subtitles in the same font/typestyle. Eventually the symbol map ends up so complete that entire subtitles get converted automagically on import, without ever needing to ask the user any further questions.
It's annoying to find out the actual code that does the OCR is not in this repo after looking through the entire thing. It's just a bunch of scheduling and worker logic and for some reason the JS is written twice once for the browser and once for Node.
The actual code that does the OCR is wraped and included via this package [0] which just wraps the original Tesseract in C++ [1] using wasm. Shameful title.
In my experience "pure JS" is normally used to differentiate projects from those using NodeJS FFI - the important part is the target / executing runtime (which would be the JS runtime for WASM), rather than the project source.
For my own purposes, the priority for me when reading "pure" is that the core runtime I'm using (the JS runtime) is the only runtime dependency - I'm not depending on external binaries and execution environments like an FFI implementation would.
It also opens up codebases to browser compat, where FFI would typically not be available.
WASM isn't an interface or a wrapper, it's a language/format. Having trouble understanding what you mean by this, unless you're arguing that the WASM VM itself is the FFI?
I can kinda see what they mean (albeit it's somewhat of a stretch).
WASM "embeds" modules within the JS runtime, in a similar way that traditional FFI "embeds" native bindings compiled separately & externally. It's still quite different insofar as the VM is a part of the runtime, but there are vague parallels.
For me though, the practical problems related to runtime env that one encounters with traditional ffi bindings calling dynamically linked native libraries are rarely present with a WASM library, as the support within the runtime is explicit (the only real exception here is architecture, which is always an issue regardless).
It is FFI from JavaScript point of view, a way to call multiple native languages from JavaScript, requires import and export definitions, a wasm file is no different than a .o, .a, .obj, .lib, other than not using instructions of a real CPU on the market.
> A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written in another.
Much much worse, unfortunately. Though no fault of the maintainer.
I use it in an Electron project and a documents that takes about 1.5 sec per page with the Tesseract CLI, I can get down to about 15 sec with Tesseract.js with parallelization.
To be honest it is a confusing sentence. If it requires node js server then the answer should be no. Also, its confusing to me why it requires nodejs server ...
I cannot give you numbers - it would be nice to have a benchmark -, but I can tell you results are quite satisfactory, and if compared to some OCR results you find around from mainstream commercial products, it can be much better than them if you pre-process the input.
It may miss a few features (some which I needed I had to code in).
This is not written in JS. It's c++ code compiled into something runnable under JS. If you want to run tesseract from python there is PyTesseract which is a wrapper around the tesseract cli. Also, I'm sure there are python bindings to invoke the tesseract libs without going over the cli but I've never looked it up.
> so "pure" in the title doesn't really make sense.
I can see where you're coming from, but I've never used or heard anyone in the web world use "pure" to mean only "written entirely in Javascript without transpilation or other tools."
If it hits the parts of "pure JS" that most people care about:
- it's running entirely in Javascript.
- it has no native dependencies.
- it can run entirely clientside.
- it can be embedded in a normal web page.
then I think most people will be fine with using "pure" to describe it.
----
I wouldn't even have that many quibbles with their phrasing even if they were compiling to WASM. Sure, at that point it wouldn't be running as pure javascript, but it would still hit 3 of the 4 points above.
“Pure” does definitely connote that you’ll be able to read all the code in the given language.
That’s exactly what “pure” means - “pure rust”, “pure go”, etc. IMO you can’t say the heart of all the work is a c++ lib and call it a “pure JS” anything.
More accurately / correctly / usefully would be calling it “JS wrapper over a c++ library cross compiled to JS”. Or maybe “All JS at runtime” or some other qualifier.
Want to see how to cross-compile a non-trivial c++ lib? Check out here!
Want to see a great JS wrapper library where we had to make cross-language & cross memory-management API decisions? Check this out!
But - want to read some awesome high performance image processing algorithms in JS? Not this.
> “Pure” does definitely connote that you’ll be able to read all the code in the given language.
I'm not sure where the line here is supposed to be drawn, but I don't personally think looking at asm.js code is significantly harder than looking at something like compiled Typescript, and certainly it's not any harder than looking at minified Javascript. If I'm going to be debugging code, they're both going to be annoying to look at. We're quibbling over definitions so I'm not going to say that you're wrong, "pure" can mean whatever you want it to mean. I'm just saying that most JS devs I know consider (for example) JSX code to still be pure Javascript when it's compiled.
It seems a little odd to me to look at something where every single line of code is Javascript, being run entirely in a Javascript interpreter, and say that isn't actually completely real Javascript, but if the programming circles you frequent are different and think about this differently, :shrug: more power to you.
> But - want to read some awesome high performance image processing algorithms in JS? Not this.
I also wouldn't necessarily assume that every Open Source program written in only JS without compilation is going to be well suited for reading or learning from. But again, sort of splitting straws here -- my only concern is that you're probably going to be disappointed a lot if you equate "pure Javascript" with "readable".
I tend to agree. When I see "pure" I don't think about the code being hand written in JS I think more about the potential of browser run OR that there are no native modules for Node required.
That being said perhaps a poll is needed to find out what most people think.
From a practical standpoint, WASM is "pure JS" insofar as if I am browsing libraries and see one advertising itself as "pure JS", by convention that to me means "no FFI" or "potentially works in browser".
The only place I've seen "pure JS" used in the JS world is differentiating e.g. a Postgres client implementation that does or doesn't depend on the specific version & configuration of libpq you have on your current system. That's about runtime deps, not about source language.
If this is compiling to asm.js then every single line of the program is Javascript and it's running entirely in a Javascript interpreter. If it's not Javascript, then what is it?
I mean, if someone compiles a Markdown document and sticks the result on their website, do you say, "this isn't HTML"? People are free to use words however they want I guess, but I don't understand the perspective where the way a project was written suddenly means that the compiled result isn't Javascript -- it feels like it's taking the word "pure" in a metaphysical direction that I just don't really grok.
I'm not sure which they're targeting, but early emscripten targeted asm.js, it predated WASM. I would consider asm.js to still be Javascript, it's just an optimized subset of the language.
I'll concede though that if they're targeting WASM it's not technically pure Javascript, but it still feels a bit to me like splitting hairs since it's always been explained to me that WASM and the Javascript runtime under the hood have a lot of overlap.
Yes, historically Emscripten began before asm.js, targeting pure JS. That JS backend was replaced by an asm.js backend, which was later replaced by the current wasm backend (the wasm backend in upstream LLVM).
(But as already mentioned, JS is still supported today, using wasm2js.)
I have little expertise in ML, but from my limited understanding, OCR is the bread and butter of the field. I've read exactly one "Intro to ML" article and it was about recognising digits. And yet, we have an abundance of high quality proprietary OCRs that can recognise printed or even hand-written text and the single open source one is having trouble with perfectly formatted text with a readable font.
Could anyone with more expertise shine some light on this current state of affairs?