I've said this many times: Google's AI services are superior in every way. They have better results and are easier to use.
Amazon is the master of the "good enough". Their service works well enough that you can check the box that it exists and then point it at all that data you already have in AWS. And that's all that most everyone needs.
If you are using AI and your competitors aren't, it doesn't really matter all that much how good the AI is -- you're gonna do better and be more efficient.
It's only after everyone is using AI that it will start to matter how good your particular implementation is. Right now we're at the stage that any implantation is better than none.
I disagree with this sentiment that any "AI" is better than none no matter how poor it is. If I'm a customer and you try to extract text from an image for me, but it's wrong over half the time, then it looks really bad. An incorrect extraction is a bug. You're basically just adding bugs to your product. Better a good product with less features than a bug ridden product with lots of features.
> "any "AI" is better than none no matter how poor it is"
You don't have to disagree. The OP said:
> "it doesn't really matter all that much how good the AI is"
I think you're both correct and you bring up an interesting point. As long as your AI is "good enough" to replace what would've taken more resources to do otherwise, it's a win. I'm not sure if that's "half" as you stated, but I bet it depends on the task. If the task is saving a few seconds to query something, then I'd agree, half of the time wrong isn't a savings. But, if you have less than half a chance at saving thousands of dollars or hundreds of hours if it works correctly, then that may be chalked up as a win.
I dunno, there's clearly a bar for "good enough". From the article it's not obvious to me that AWS Rekognition meets that bar (yet). 21% precision with 54% recall isn't the kind of thing that achieves "if you're using AI and your competitors aren't, you're ahead".
I like your framing/sentiment though: it's not about small differences in "betterness", it's about the difference in kind in going from "no ML/AI" => "whoa, it works!".
Disclosure: I work on Google Cloud (but not in ML).
> Disclosure: I work on Google Cloud (but not in ML).
Since you work there, hopefully you can see this feedback and filter it up: I love your tools. They are the best. I have tried using your tools. They are hard to use, despite the fact that I have a pretty solid understanding of how to use them. I would like to use your tools more, but getting support is hard (partly because the docs aren't great, partly because there is no community, because see #1).
I don't know how to fix this, but it would be great if maybe Google spent some time focusing on building a community around your tools, like AWS did. At the beginning they had a lot of employees hanging out on the forums and on other forums, answer questions and building a community of users, and especially helping third parties who tried to build libraries for their tools (like boto for Python). It would be great if Google did that too.
Google has contributed quite a bit to Boto. The GCS command line tool, gsutil, is built on top of boto. If you're talking about Boto 3, though, not so much.
Is there any data that points to Google's AI service being superior?
I want to know which AI platform is the strongest, I did find in my anecdotal tests Microsoft's Vision API performed better but this was in mid to early 2016.
well, not so sure about Google's AI services are superior in every way.
Actually, Google has fewer public APIs when compared to Msft/AWS for AI use cases, ex: Face recognition etc. Maybe they just don't release stuff unless it's much better than the competitor's.
If I put in a “typo” (read: ambiguous letter), I want the meaning that was originally intended. Technically correct doesn’t but me anything; we can play that game all day but I just don’t have much use for a document full of errors. In a way: I want the AI to do what a human would do.
This is like speech recognition using context to fill in the gaps. Without this it would be unusable.
They do! For example, Vision takes languageHints [1]. With the Speech API, I took an hour long tour of Rouffignac (in French) and translated it into English. I considered adding some SpeechContext [2] for things like mammoth and so on, but actually it did a fine enough job as is (besides, I had to listen to it later, as there was obviously no coverage in the cave).
Disclosure: I work on Google Cloud (but not these APIs).
Was the image upside down? Sometimes stuff coming out of iOS is stored in a different orientation than it appears on the screen. (There is some metadata in jpeg that says to rotate it before displaying - browsers typically ignore this for backwards compatibility.)
OCR.space is a "good enough" option for many projects. It has a very generous free tier of 25,000 free conversions/month per IP address (Google only 1000/month per account). In my tests it performed not as good as Google, but good enough for many applications (much better than Tesseract).
Current state-of-the-art approaches in this field are significantly better than existing commercial solutions at recognizing text in the wild (i.e. scene text).
As an example, see the ICDAR 2015 results [1], where the Google Vision API is at 59.60% (Hmean) while the best ones are over 80%. Note that this test is about localization, i.e. finding the text location without recognizing the actual content, though on a more challenging dataset.
As for recognition, see the table on page 6 of this paper [2]. The "IIIT5K None" column should be pretty close to what was done in the OP, using the same dataset, with recognition accuracies of around 80% while the Google Vision API is at 322/500=64.4%. Note here that since this paper is only about recognition, there is no localization step before which would otherwise act as a filter and decrease the accuracy a bit by failing to localize some text that the recognition step would be able to recognize.
Click through to the competitions. My point was that here are current competitions giving you an idea of the state of the art. These are not student competitions or top coder things.
The analysis is weak.
"There were 10 images where all three APIs got it wrong".
Guess what. Zoom in on the "PRINCE" image, and you'll see it says top-right: A MIKE NEWELL FILM. So... both google and AWS did a nice job.
It's not reasonable to expect PRINCE as the outcome.
The point another person makes below about "payloads" and MSoft is valid too... As is the g-accented (not recognized because UTF codes not processed).
A human would record that as clearly being PRINCE. For their described use case, reading images of business and movie names, the presence of other, small text in the picture seems quite fair.
> The only drawback with AWS rekognition APIs is that it only takes an image stored as an AWS S3 object as input while the other API work with any image stored on the web.
It seems his examples include logos and such. Some of those services could be tuned for OCRing books and documents, which should be the bulk of OCR use cases in commercial applications. Since there's usually a trade-off between flexibility vs. accuracy, I wouldn't be surprised to see inverted results w/ a different dataset. Might be worth doing the test.
https://www.gutenberg.org might have this. Some books available are just OCRed. Some are proof read and corrected. Not sure if scanned image sources are available?
Image text recognition is a major problem we're trying to solve in our startup. I would love to be pointed to some SOTA research in this space. Hard to find anything by Googling about it.
As far as our experience goes, Cloud Vision API is a killer option compared to both AWS and MSFT. It's pricier than AWS though and is slower. MSFT is terrible in both price and speed.
If you're trying to handle text "in the wild" and not scanned documents, the keyword is "scene text". Most papers are focused on either detection/localization, i.e. finding the location of text, or recognition, i.e. recognizing the actual content given a cropped text image.
Here are some current state-of-the-art papers + code where available about detection:
Note that this paper is from 2010 and thus, while quite influential for its time far from the current state-of-the-art. The stroke width transform method that it introduced is simply not as good as current deep learning-based methods.
If you want to get a (slightly out of date but what can you do, the field is moving very fast) overview see this survey from 2016:
The big companies, particularly google, already uses state of the art techniques. Additionally, OCR has long been a topic study in academia. I think it's unlikely that we'll see any major "disruption" in this industry. Improvements will likely continue to come from the usual suspects.
the really sad being that with all the talks of AI taking over humanity a seemingly a simple task of reading printed text is so far behind..seems ML/AI has many years to really make big leaps.
Automatic translation has also already reached the low-hanging statistical fruit, meaning it has been stuck at 70-80% performance/accuracy for the last couple of years (i.e. you can use it to get a general idea of what a text written in a foreign language might mean, but you can't use it for day-to-day language interactions without sounding unprofessional or even stupid).
This article reminded me to check if any updates had been made to Google's OCR since Cloud vision was in beta sometime last year or earlier this year [0]. It looks like a new parameter/option for "Document Text Detection" -- i.e. something more akin to Tesseract, rather than just detecting words in images (such as road signs):
edit: I would attempt my own test right now but it's been awhile since I've tried to use Google Cloud. Right now I'm getting constant "Server Error" popups until Chrome decides to crash and die just when simply checking my account and billing page. The Cloud Console's wonkiness is probably one of the reasons why I stopped using GC in favor of AWS :/
Sorry to hear about whatever is going on between Chrome and the Console (can you try an incognito window?).
You can always test out the Vision API via the landing page (https://cloud.google.com/vision/). The full text results seem to be under the little document tab. I took a screenshot of the text above it, and it seemed to work as expected (breaking it into two paragraphs).
Disclosure: I work on Google Cloud (but not on ML APIs).
Thanks, I'm not sure what the problem could be since I've had GC bill me every month for the past couple of years for other APIs and services, perhaps I never officially enabled Cloud Vision API after it was in beta. The crash at account screen is puzzling but I'm assuming it must be on my end (conflicting plugin perhaps) -- I'll have to investigate it over the weekend. I'm hoping to use GC for a class so I'm going to try the signup and onboarding process from scratch anyway.
I’m surprised by these results being so poor because many of these reading tasks are so easy for human readers. And it seems so much more straightforward than stuff like face recognition and self driving cars!
That's what I've been doing. I had ~1,800 images to read, machine could only read about 900 so I tossed like $20 at it ($0.02 / image) and it is pretty good accuracy.
MTurk is hit and miss when it comes to workers, some will just click buttons to see if they can get paid, others completely knock it out of the park.
I have another project that I have put about $100 in so far and had decent results (incredibly quickly too!)
Getting my work to "good enough", then tossing the rest on Mturk is much, much cheaper in the long run.
I ran a bunch of stained glass images through several APIs - Google's did by far the best, although there are a lot of issues (curved text, hand-written text in odd alignments or on a curve)
Some of the companies that do this work have been around for many decades and have tons of photographs / scanned images, so I'm investigating ways to ingest images into a search engine to help locate old projects.
I would be curious to know if anyone in this community is indeed already using one of these APIs for a product and what your real-life experience is. Care to share your use cases?
Pingtype is my program for learning Chinese. I tried to use Tesseract to recognise some Chinese text, taken directly from a PDF that couldn't copy-paste for some reason. The results were awful.
I tried again with English text. I wanted a word list from a book that helps people learn English, so I took photos of the index. The format is word....page #, in two columns.
The results were just as bad.
I've given up on OCR, and decided I have to transcribe everything by hand. I only do it in my free time, and it's been taking months.
Is there any tool that can take a photo of a book where the pages curl towards the middle, and "flatten" it so that OCR will work better?
I have used tesseract and in my experience unless you train it for the particular type of text you want to recognize (font, background color, etc.) it will do quite poorly (including the recent lstm based versions). Would be great to see how it stacks up against these APIs though.
The supplied models are trained on document-like images, so I wouldn't expect it to do particularly well on things like street signs. My experience with the new lstm based versions is that it's very much competitive with closed source solutions for document-like OCR.
The one time I needed to turn a scanned PDF (600+ page book) into searchable text, I used this Ruby script https://github.com/gkovacs/pdfocr/ , which pulls out individual pages using pdftk, turns them into images to feed into an OCR engine of your choice (Tesseract seems to be the gold standard) and then puts them back together. It can blow up the file size tremendously, but worked well enough for my use case. (I did write a very special purpose PDF compressor to shrink the file back, but that was more for fun.)
Amazon is the master of the "good enough". Their service works well enough that you can check the box that it exists and then point it at all that data you already have in AWS. And that's all that most everyone needs.
If you are using AI and your competitors aren't, it doesn't really matter all that much how good the AI is -- you're gonna do better and be more efficient.
It's only after everyone is using AI that it will start to matter how good your particular implementation is. Right now we're at the stage that any implantation is better than none.