we think that's easier to integrate as there is nothing to install, nothing to maintain. We are also doing some pre/post processing which allow orientation detection for example. We also think that's might be useful for mobile as they are limited in resources. Also, on a PDF, we have an hybric way to perform both to not use slow OCR methods on easy to extract text but that's still quite a young product. New advanced features are coming.
The benefit is stamplin.com gets insight on what people are viewing and reading. They get to see what the user sees. They can compile a database and use or sell that information to be used for marketing purposes.
Also, it's an "API" (looks more like a url poiting to a CGI program to me, but whatever). API's are "cool" and "fun", while running local programs that you have control over is old and boring and not the future of computing.
our API is quite new and we understand it doesn't give an outstanding value for everybody as it target easy of use for the moment but next release is going to add more advanced things.
About us using your data, our privacy policy will clarify that.
I come bearing gifts, if anyone would like to host some of this themselves.
This follows the API documented by Stampin (minus the throttling errors)--it does not currently do the OCR, but as mentioned elsewhere by zdw you can probably get tesseract to get you like 80% of the way there. If you wanted to use that, you'd likely just replace the hacky `pdftotext` callout with your preferred toolchain.
You'll need Ruby, Sinatra, and the Xpdf tools, I believe.
Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.
The code:
require 'sinatra'
require 'json'
use Rack::Logger
post '/extracttext' do
begin
status 204 and return unless params["file"] != nil
type = params["type"] || "text"
lang = params["lang"] || "en"
tmpfilename = params["file"][:tempfile].path
`pdftotext #{tmpfilename}`
File.delete(tmpfilename)
convfile = File.open("#{tmpfilename}.txt","r")
lines = convfile.read.split("\n")
convfile.close
File.delete(convfile.path)
content_type "application/json"
{"text"=>lines}.to_json
rescue
status 500 and return
end
end
EDIT:
For God's sake run this in a jail and only on an internal network!
I like the concept and it is a good start. Pulling text from PDFs is especially painful. I think the output format needs improvement. It is just a large array of strings. It seems like the strings are sometimes a single line, and sometimes not. My particular use case is extracting raw data from a PDF. I would like to see more structure to the output. For example, knowing where new lines, tabs, etc are located would be very helpful for parsing raw data.
That's something we can provide pretty easily and we would try to provide that in our next release. If you want us to help you with your specific problem, please send us an email at info@stamplin.com.
The limit has been set to prevent our server from crashing as we do not have, for the moment, the financial capability to support a massive server farm. Again, if this limit prevent you from using our API, we might move the limit up if you ask it by email.
The OCR is really useless. I tested it with some reddit "advice animal" memes (because there is a need for transcriptions). You would think that text is pretty simple and easy, but the output I got was like:
/\n\nnmrs wn\ufb02qyi mm mm\nTlIIEI\ufb02|\ufb02llllM\u2018l co
Sorry that didn't work properly for you. We are working on improving our OCR results quality. Could you please send us at info@stamplin.com the file you used to get this useless result?
The upgrade button doesn't work, and nobody is going to hover long enough to see the "Not Available Yet" title. And the current 10 requests isn't even enough to test with.
I'm excited to try this.. so figure out a way to take my money soon.
some langages might not have an appropriate library, some might want to not have heavy processes on their device (mobiles). We also think that's easier to use as there is nothing to install. That mainly depends on your case.
Thanks for your comment!
We would really appreciate if you could explain us in more details problems you faced. I am sending you an email if that's ok for you to discuss that.