Right this very moment (well, a few moments ago when I wasn't procrastinating on HN) I was in the midst of extracting data from a client's old website in preparation of creating a new website.
A lot of that data is contained within images.
From a few preliminary tests, I'm hugely impressed. This seems on-par with any other OCR software I've used, and the fact that it happens in realtime in the browser is amazing.
I tried it on a piece of content I'd just had to type out, that was originally in an image. Typing out the content took about 10 minutes. Copying and pasting with Naptha, and then making some minor edits/corrections, did the same thing in about 2 minutes.
There's actually been a bit of research on the error rates you need to beat for OCR to be cost-effective vs. having people re-type. I don't have the references handy, but I believe it's generally cost effective to OCR with error rates up to nearly 2%, and most current "consumer grade" OCR is well below 1% error rates for scans that aren't absolutely atrociously poor quality.
My Msc thesis was on reducing OCR error rates by pre-processing of various forms, and while I managed to get some reduction in error rates, one of the things I found was actually that given how low the error rates generally was to begin with, you have a very tiny budget in terms of extra processing time before further error reduction just isn't worth it - if a human needs to check the document for errors anyway, a "quick and dirty" scan+OCR is often far better than even spending the time to get "as good as possible" results. Spending even a few extra seconds per page to place the page perfectly in a scanner, or waiting a few extra seconds for more complicated processing, can be a net loss.
It's a perfect example of "worse is better": OCR, at least for typed text, is good enough today that the best available solutions aren't really worthwhile to spend resources on (for users) unless/until they give results so perfect it doesn't need to be checked by a person afterwards.
It was suggested to me by a friend that to get good OCR results, run it through the scanner/OCR twice, then diff the results. Usually one or the other will get it right, and if you run the two results through a difference editor like 'meld', it's quick to fix.
That may work for some cases, and especially with horrible OCR engines and low quality scanners, but frankly when I did my research into this, the results varied extremely little from run to run, and you could usually easily identify specific artefacts in the source that tripped the engine up (rather than problems with the quality of the scan). E.g. letters that were damaged, or had run together, creases in the paper etc.
With really low res scanners I can image it could make a big difference.
Back in the late 90's I worked for a company that did a lot of OCRing and they ran the same image through multiple engines and then manually corrected the results. I think they had 3 engines, all from different companies, which processed all images and put the results into a custom format. Human beings were then employed to manually merge and correct the final text. It worked fairly well, especially considering the hardware/software available at the time.
The biggest problem was stuffing too many files into an NTFS directory. Apparently, NTFS didn't like tens of thousands of files in one directory. :)
To a certain extent, of course. The 2% was based on the assumption that if you are benchmarking against re-typing, you expect the same kind of quality you'd get from having a good typist re-typing the documents.
From my own experiments, I tend to find that you can read through and correct errors only relatively marginally faster than you can type because you either follow along with the cursor or need to be able to position the cursor very quickly when you find an error, and as the error rate increases, trying to position the cursor to each error very quickly gets too slow.
Dropping accuracy in your effort to correct the text doesn't really seem to speed things up much. You likely speed it up if you're willing to assume that anything that passes the spellchecker is ok (but it won't be, especially as modern OCR's often try to rely on data about sequences of letters, or dictionaries, when they're uncertain about characters)
If you're ok with lower accuracy, e.g. for search, and the alternative is not processing the document at all, then it'd be drastically different.
Holy crap, antimatter15 does so many cool things. I keep finding things that are really cool and then scroll down to find they are all written by him. First Shinytouch, then Protobowl years later and now this. And he's only a year older than me (19) so it isn't that he's had more time. Check out his Github profile for more of his projects: http://github.com/antimatter15
Yeah, I just haven't gotten around packaging the whole thing as a Firefox Addon. It's actually technically possible to run the whole thing on a normal unprivileged webpage (in fact, that's my development environment).
Reminds me of Powersnap on the Amiga. Many applications did their own text rendering without supporting cut and paste, and so this guy called Nico Francois had the bright idea of letting you select a region of a window, and matching the standard fonts against the windows bitmap.
Of course then it was "easy": almost all the text would have been rendered with one of a tiny number of fonts available on the system, with little to no distortion.
Powersnap was amazing. I seem to recall it was usually able to figure out what font each program was using and only had to search for letters for that specific font, and only fall back to a bigger search if that failed. I might be misremembering, but regardless, it was essentially as fast as any copy-paste today, in an environment where many programs weren't even written to support it.
Even though it solved a problem we don't usually have today (this story notwithstanding), it was still one of the most amazingly useful programs ever.
You're probably right - the manual says it did. It'd be able to get the last used font from the RastPort structure used to draw to the window [1].
If the window was rendered with multiple font that wouldn't be reliable, but I guess it'd likely be "good enough" to avoid a wider search most of the time.
@antimatter15, i have a project that does client-side image analysis and decompses document structures. it looks like your OCR code would be a great replacement for the server-side Tesseract ocr i currently use :)
here's what the project does now with js + web workers:
processing time is < 1500ms in Chrome and < 2000ms in FF
the code is open source, though using it isnt yet polished. i'm working slowly on a blog post series to detail how to use the lib(s). https://github.com/leeoniya/pXY.js
Doesn't work great. Went to reddit's advice animal page to try it out and it doesn't seem to work with livememe (I think they have an invisible layer over their images to try and block hot linking).
Bottom: TN[ FACTTNATl'M MAWING TNISM[M[ g
INST[AD of DRIVING D[TERMIN[D TN#rWASA ll[
Maybe it needs to be a certain font for better results. Still pretty cool. Hopefully all the kinks get worked out. I would definitely find this useful.
EDIT: need to make sure the language is set to "internet meme" and it works much better.
By default it uses Ocrad.js, a pure javascript OCR engine (ported via emscripten, see http://antimatter15.github.io/ocrad.js/demo.html). But if you right click on the selection and change the language to "Internet Meme", it should transcribe it correctly (note that this sends the selection off to a server for remote processing- It's not the default for privacy and scalability considerations at the moment).
Every time I click "Allow" on "Access data on all sites" for an extension I creep closer to my security hole paranoia threshold. If it was all in JS, who cares? But this sends ajax to remote servers of course.
Checking the "Disable Lookup" item on the settings menu prevents it making ajax calls to any server and does all processing locally. Of course there's a resulting drop in speed and OCR accuracy. The lookup requests are all HTTPS, are never logged, and contain no user identifying information.
That is the wording that Google Chrome chose for "allow this extension to access the DOM on any page". It sounds bad but these are the permissions an extension needs to be able to access images and text on any page.
2) Erase Text option menu location
Using version 0.7.2, the "Erase Text" option is displayed under the "Translate" section (certainly not where I would ever intentionally look for it).
3) Select Text -> Right-click changes selection
After selecting my text, when I right-click the selected text often (almost always) changes. For example, with the kitten text, I selected both paragraphs, but when I right-clicked to go to Translate->Erase the first paragraph ceased to be highlighted. After erasing the second paragraph I tried in vain to select and erase the first paragraph, but everytime I'd right-click the selected paragraph only a single word would still be highlighted. I eventually tried erasing text while only one word was highlighted and the entire first paragraph was erased.
4) I really appreciate the Security & Privacy section of the project page.
5) I would love to see a Firefox version of Project Naptha!
Looking through the code, you'll see he cites everything down to blog posts which he used. As he mentioned, it's based on the already published Ocrad.js too.
This is simply incredible. I'm just blown away by it.
I wonder if you could get better performance when running locally by sending the result through a spellchecker and doing some Bayesian magic on the word choice...
One of the rules for the heuristic for what images to ignore is that it needs to have over 19,000 square px, and that first image was a bit under that.
Very slick!
Does it automatically start OCRing every image, or does it wait for a user to try to select the image text?
Asking because I'm concerned about this decreasing performance.
It waits until you start selecting the image text, but the text detection starts when your cursor moves toward an image. It uses WebWorkers extensively, so on a multicore system, the performance shouldn't be hit. I haven't noticed an effect on battery life, but that's not out of the question.
Wow. Just wow. How did I live my life before this?
Once again, such a simple implementation by somebody that grabs some components that have been around for ages and mashes them up in a way that makes people question why it wasn't invented before
I've got this installed and it'll probalby never leave my chrome profiles. Keep up the awesome work!
I remember your 2nd place win at HackMIT, congratz again. It was THE most useful hack by far and I'm glad you've made it a public product now, and free. Wow, it seems like you beat all those years industrial OCR products... and by far. This is simply amazing, keep on the great work!!!
Randall Munroe's handwriting is a bit difficult to OCR because a lot of the letters are smushed together close enough that the it's not possible to unambiguously segment the text into distinct letters (which is a necessary first step in any OCR engine that I'm aware of). Maybe Google's (or Vicarious's) magical convolutional neural net that can solve CAPTCHAs would fare better.
> it's not possible to unambiguously segment the text into distinct letters (which is a necessary first step in any OCR engine that I'm aware of)
This made me realize I never saw such a thing as OWR, i.e. a software that would first try to recognized whole words, then go down to character level if no satisfying match found.
> it's not possible to unambiguously segment the text into distinct letters (which is a necessary first step in any OCR engine that I'm aware of)
In my experience, the ability to handle overlapping letters (which is very common on type-written text and professionally typeset material) is one of the key things that separate the relatively lightweight OCRs (like Ocrad and GOCR) from the big complicated ones (Tesseract, Cuneiform, Abbyy etc). Whitespace character segmentation cannot be taken for granted if you want to do any useful OCR of "historical" material.
This is amazing, and it has truly revolutionary implications for learners of scripts like Chinese, which are still truly indecipherable to learners when embedded in images. I was really happy to see that this extension supports both simplified and traditional Chinese. I tried it out, and while it shows promise there, it definitely still needs a lot of work.
Cool, I implemented the stroke width transform for text detection about a year ago. Nice to see someone else using implementing it, but I'm pretty sure convolutional neural nets do a better job at text localization.
1. The implementation of Stroke Width Transform is not super good. So far, http://libccv.org/ has the best implementation of SWT. But again, you can neither make the head nor the tail of that implementation.
2. There are just too many false text regions and the text detection accuracy is no where near what you can call good. A mixed use of multiple OCR engines might give better results.
All that said, you can't take away the cleverness of the application of detecting text. Mind == Blown, on that area.
I actually modeled my implementation after libccv's implementation. Part of what libccv seems to do is to run it multiple times at different scales, which isn't something that's very computationally feasable for a pure javascript implementation. My implementation has a second stage color filter which refines the SWT (this is something of a tradeoff that improves accuracy for machine-generated text and reduces accuracy for natural scenes, and I'm under the impression that the corpus used by SWT focuses on the latter).
Ocrad is being used as the default because it runs locally and it's small enough that it's easy to ship with. The remote OCR engine uses Tesseract which gets much closer to acceptable in a lot of circumstances.
But there is a lot of work which can be done to improve it. I have a friend who constantly nags me for not having a solid test corpus to run regression analysis/parameter tuning/science. Certainly it lacks the rigor of an academic and scientific endeavor, but I've always imagined this as a sort of advanced proof of concept. I think the application of transparent and automatic computer vision, deserves to be part of the interaction paradigm for the next generation of operating systems and browsers.
This looks very cool and could come in quite handy.
In case anyone from the project is monitoring - text selection did seem to work fine for me in FireFox (ESR 24.3) despite the "Not Supported" text being displayed.
Extension is awesome and while the code is messy, it has enough little jokes to keep you amused. For those looking to access the backend OCR service, it seems to be down right now, but will hopefully come back up soon.
Here were the API references I could find for the remote OCR:
Apparently the author was one of the winners of HackMIT 2013 according to some of the comments. Couple of fun things in there if you decide to poke around in the code. Jump into naptha-wick.js for the remote logic.
It's been six months since I started this project.
Just under two years after I first came up with the idea.
It's weird to think of time as something that happens,
to think of code as something that evolves. And it may
be obvious to recognize that code is not organic, that
it changes only in discrete steps as dictated by some
intelligence's urging, but coupled with a faulty and
mortal memory, its gradual slopes are indistinguishable
from autonomy.
Hopefully, this project is going to launch soon. It
looks like there's actually a chance that this will
be able to happen.
The proximity of its launch has kind of been my own little
perpetual delusion. During the hackathon, I announced that
it would be released in two weeks time.
When winter break rolled by, I had determined to finish
and release before the end of the year 2013.
This deadline rolled further way, to the end of January
term, IAP as it is known. But like all the artificial
dates set earlier, it too folded against the tides of
procrastination.
I'll spare you February and March, but they too simply
happened with a modicum of dread. This brings us to the
present day, which hopefully will have the good luck to
be spared from the fate of its predecessors.
After all, it is the gaseous vaporware that burns.
Yeah, I made the mistake of setting the App Engine budget to $1.00. Turns out that's probably not enough for a sustained run as HN's #2.
Yeah, the code is super messy, but I'd prefer if you didn't play around too much with the remote OCR service, specifically, the translation parts because Google Translate is pretty expensive per-use.
You have no donate link... if you're gonna be on big sites like HN, you might as well have a donation link so that hopefully you break even on App Engine.
Very impressive work. I'm not surprised to find antimatter15 behind it.
The website was not very clear if work was done client-side or not (mentioning server calls). It turns out that server calls can be disabled and the extension is working quite fine without. By default, I would disable this option and offer opt-in, it is better for privacy I think.
I get a big problem with various people sending me screenshots with stackdumps in. This is perfect for extracting them into the ticket bodies and it does it perfectly (I've just done 20 with it and manually checked them!)
This is the sort of stuff that really improves people's lives by making all data equal.
Please help, it looks brilliant, however, only the test page works for me. Can't get any other pages to work. Text simply isn't selectable - cursor remains as a pointer, not an'I' :(
I'm using the latest version of Chrome on a modern Mac and have Naptha properly installed and Chrome has been relaunched.
Awesome. I was actually at HackMIT. It is great to see you actually continue working on this. As a matter of fact, I told my friends who were working on similar idea for their senior project your project name last Fall. I emailed you for the Microsoft reference papers :) Not sure if I should copy and paste that.
This is really neat. I was playing with it on pictures of street signs and buildings and realized that if I select some text and then do ctrl+a it tried to select everything it thought was text...Then I used right click > translate > reprint to see what it thought each thing was.
I had high hopes for this, as I sometimes need to manually transcribe serial numbers from customers' screenshots.
However, it seems to confuse letter O and number 0.
Since serial numbers are not English words, I'm not sure how you would solve this unless you had a lookup for commonly used web fonts.
Seemed like an interesting project, clicked on the linked scanned the page an it seems to be an empty pointless web page trying to explain over pages worth of scrolling that it allows to deal with text trapped inside images which I already knew when I clicked the link.
Going back to the page after closing it once, I noticed written in smaller characters that this somewhat pointless page is for a useless extension as it is exclusively limited to the worst offender privacy wise of a web browser that I would not touch with a stick. google chrome is the new internet explorer to me as its main use is to download firefox.
In conclusion this looked promising but a confusing web page and browser lock-in renders it useless and shows that it is far from doing what it claims.
"... on every image you see while browsing the web" should be "...on every image you see while browsing the web in google chrome".
No github and no open license tells me that as a linux user of opera I'm pretty much assured I will never see a version of this extension.
This is extremely powerful for the end user. I've been doing a bit of OCR work using some pre-processing methods combined with Tesseract and OpenCV. I am curious to know how you are doing this on the fly and also as a chrome extension. Is the processing done in JS?
The biggest thing I'd like to see is enabling in-page (control/command-f) search. In my quick scan through the page it looks like it doesn't do that… is that right? Are there plans to add invisible text to the DOM that control-f can find?
One problem with that is that it processes images lazily. It continually extrapolates cursor moments ~1 second into the future and processes those relevant parts of relevant images. But it should be possible that after an image is processed (or even eagerly by looking up previously recognized regions from the cached OCR server), the page could be made Ctrl+F-able.
I like the way this extension removes text in the image, but I would much rather have a video delogo filter for that does not suck. It would be very useful for removing hard subtitles, station logos, screener warnings etc.
In any case, pretty cool project, I'm a bit amazed how far we've come since I've last played with OCRs (and defeated one bad CAPTCHA implementation, still in use at pastebin.com it seems).
Cool idea, definitely worth exploring the possibilities. A quick run showed me that it often interprets the "i" as "l" whenever the the gap between the line and dot is not apparant
Now that is pretty damn cool. Will help at work when marketing people do not copy paste email/article and just put screenshot of it and if you want to quite something from that picture...
It basically runs SWT on the image, and creates a 3d Lab histogram of the colors the SWT marked as text. Then it does a morphological dilation of 10 pixels and subtracts the original mask to get the colors of the pixels that represent the background.
Then it just binarizes the image by whether the internal histogram is larger than the corresponding value of the color on the external histogram.
It's a strategy that works quite well on machine-printed text, but probably less effective than existing strategies when it comes to scans or photographs.
Curious about this too. Also, what's the stack providing Tesseract-as-a-Service? According to my cursory search, Google app engine won't run Tesseract as its a native library, not an API. I'd like to try this on non-Latin/CJK hardcoded subtitles, but ocrat does latin only.
I wrote a little C program that uses TessBaseAPI to extract letter locations which gets triggered with ImageMagick's convert by a NodeJS script. The app engine frontend which acts as a caching reverse proxy.
I have wanted an extension to do this for so long. I even started coding my own at one stage but hit various issues. Thank you so much for creating this.
I remember seeing that from the project list and really wishing I could download it right away.
Just another example that the "idea are worthless!" saying is bullshit. This was a great idea, anyone implementing it first decently would get success with it.
Fixing that! I also have to write the entire second half of the chronology section, but at least it looks less like I pulled a "Monty Python animator".
Almost garbage? This is the OCR result for the 2nd paragraph. Almost perfect, although the last word in each line gets joined to the first one in the next line:
"The fundamental problem of communication is that of reproducing atone point either exactly or approximately a message selected at anotherpoint. Frequently the messages have meamlng; that is they refer to or arecorrelated according to some system with certain physical or conceptualentities. These semantic aspects of communication are irrelevant to theengineering problem. The significant aspect is that the actual message isone selected from a set of possible messages. The system must be designedto operate for each possible selection, not just the one which will actuallybe chosen since this is unknown at the time of design."
I tried it with both ocrad and tesseract modes, and indeed, the ocrad mode produces garbage, the tessaract mode produces a really good result but takes a longer time doing it(mainly the time it takes to upload the entire thing and get the result back).
That seems to make sense to me, at least. Use ocrad mode by default, if it doesn't perform well, switch to tessaract and you'll hopefully get a better result.
cool idea, a bit buggy yet and when i am trying to actually save images i do get the custom extension right click bar instead of the normal chrome bar to save the image, but i guess its still under development.
It's spelled Naphtha (http://en.wikipedia.org/wiki/Naphtha). And for the HN hordes - read the bottom of the linked project page, it is supposed to be a reference to Naphtha.
Curious, what makes you find those to be better than Chrome? I recently switched to FF for a variety of random work reasons, and found it so much worse than chrome (basic UI, dev tools, speed) that I switched back asap. Maybe I'm missing something awesome about them.
I use Safari, and I find it to be better than Chrome because it's easier to sync with my iPhone and iPad, and with iCloud keychain even my passwords are synced.
Right this very moment (well, a few moments ago when I wasn't procrastinating on HN) I was in the midst of extracting data from a client's old website in preparation of creating a new website.
A lot of that data is contained within images.
From a few preliminary tests, I'm hugely impressed. This seems on-par with any other OCR software I've used, and the fact that it happens in realtime in the browser is amazing.
I tried it on a piece of content I'd just had to type out, that was originally in an image. Typing out the content took about 10 minutes. Copying and pasting with Naptha, and then making some minor edits/corrections, did the same thing in about 2 minutes.