Show HN: An API to extract texts from images and PDF files

zdw · on July 18, 2013

What's the benefit to using this over `pdftotext` and/or `pdfimages | convert | tesseract`?

trez · on July 18, 2013

we think that's easier to integrate as there is nothing to install, nothing to maintain. We are also doing some pre/post processing which allow orientation detection for example. We also think that's might be useful for mobile as they are limited in resources. Also, on a PDF, we have an hybric way to perform both to not use slow OCR methods on easy to extract text but that's still quite a young product. New advanced features are coming.

jingo · on July 18, 2013

The benefit is stamplin.com gets insight on what people are viewing and reading. They get to see what the user sees. They can compile a database and use or sell that information to be used for marketing purposes.

Also, it's an "API" (looks more like a url poiting to a CGI program to me, but whatever). API's are "cool" and "fun", while running local programs that you have control over is old and boring and not the future of computing.

trez · on July 21, 2013

our API is quite new and we understand it doesn't give an outstanding value for everybody as it target easy of use for the moment but next release is going to add more advanced things.

About us using your data, our privacy policy will clarify that.

angersock · on July 18, 2013

I have sought to answer exactly that question above.

:)

I love the happy little Unix-style legos of productivity.

taf2 · on July 18, 2013

http://www.stamplin.com/api/ returns 403 when clicking on the API docs after confirming an account via email link

trez · on July 18, 2013

sorry, the correct url is http://www.stamplin.com/api/docs/

angersock · on July 18, 2013

I come bearing gifts, if anyone would like to host some of this themselves.

This follows the API documented by Stampin (minus the throttling errors)--it does not currently do the OCR, but as mentioned elsewhere by zdw you can probably get tesseract to get you like 80% of the way there. If you wanted to use that, you'd likely just replace the hacky `pdftotext` callout with your preferred toolchain.

You'll need Ruby, Sinatra, and the Xpdf tools, I believe.

Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.

The code:

  require 'sinatra'
  require 'json'

  use Rack::Logger

  post '/extracttext' do

      begin
      status 204 and return unless params["file"] != nil

      type = params["type"] || "text"
      lang = params["lang"] || "en"

      tmpfilename = params["file"][:tempfile].path
      `pdftotext #{tmpfilename}`
      File.delete(tmpfilename)

      convfile = File.open("#{tmpfilename}.txt","r")
      lines = convfile.read.split("\n")
      convfile.close
      File.delete(convfile.path)

      content_type "application/json"
      {"text"=>lines}.to_json

      rescue
          status 500 and return
      end
  end

EDIT:

For God's sake run this in a jail and only on an internal network!

rpedela · on July 18, 2013

I like the concept and it is a good start. Pulling text from PDFs is especially painful. I think the output format needs improvement. It is just a large array of strings. It seems like the strings are sometimes a single line, and sometimes not. My particular use case is extracting raw data from a PDF. I would like to see more structure to the output. For example, knowing where new lines, tabs, etc are located would be very helpful for parsing raw data.

Here is the PDF I used to test: https://www.gov.uk/government/uploads/system/uploads/attachm...

Is there a technical reason for the 1-2MB limit or is it arbitrary?

trez · on July 18, 2013

Thanks for your comment!

That's something we can provide pretty easily and we would try to provide that in our next release. If you want us to help you with your specific problem, please send us an email at info@stamplin.com.

The limit has been set to prevent our server from crashing as we do not have, for the moment, the financial capability to support a massive server farm. Again, if this limit prevent you from using our API, we might move the limit up if you ask it by email.

mappum · on July 18, 2013

The OCR is really useless. I tested it with some reddit "advice animal" memes (because there is a need for transcriptions). You would think that text is pretty simple and easy, but the output I got was like:

    /\n\nnmrs wn\ufb02qyi mm mm\nTlIIEI\ufb02|\ufb02llllM\u2018l co

trez · on July 19, 2013

Sorry that didn't work properly for you. We are working on improving our OCR results quality. Could you please send us at info@stamplin.com the file you used to get this useless result?

gkoberger · on July 18, 2013

The upgrade button doesn't work, and nobody is going to hover long enough to see the "Not Available Yet" title. And the current 10 requests isn't even enough to test with.

I'm excited to try this.. so figure out a way to take my money soon.

gnosis · on July 18, 2013

This looks nice except for having to depend on your servers as a middle man.

Any chance you could release the code as Free or open source so that its users can use it standalone on their own machines?

trez · on July 18, 2013

That's not planned at the moment but if we wouldn't find a way to monetize it, we would do it for sure.

RivieraKid · on July 18, 2013

Why would someone want to use an API instead of a library?

trez · on July 18, 2013

some langages might not have an appropriate library, some might want to not have heavy processes on their device (mobiles). We also think that's easier to use as there is nothing to install. That mainly depends on your case.

RivieraKid · on July 18, 2013

I agree that there might situations where it can be useful, but:

1) Mobiles have pretty good CPUs. I think uploading and waiting for response would be slower and less reliable.

2) If the mobile user doesn't have an internet connection, the app won't work.

3) As a developer, I would be dependant on an external service, that could stop working someday.

rpedela · on July 18, 2013

Can I assume API keys are on the roadmap? I don't particularly like using my username and password.

trez · on July 18, 2013

Yes, we'd like to increase security on each releases. It should be available in one of the next release.

rpedela · on July 18, 2013

Great!

antrover · on July 18, 2013

Nice. Are you using the Tessaract OCR lib at the core of the extraction?

trez · on July 18, 2013

yes we do

it_learnses · on July 18, 2013

Any custom requests? Let us *know.

trez · on July 18, 2013

thx, I am gonna fix that

trez · on July 19, 2013

fixed

smougel · on July 18, 2013

Any Feedback Welcomed

sebg · on July 18, 2013

Looks good - does it do data tables? That's a big issue and something I've heard about (run into) many times...

trez · on July 18, 2013

Thanks for your comment! We would really appreciate if you could explain us in more details problems you faced. I am sending you an email if that's ok for you to discuss that.

sebg · on July 18, 2013

responded to your email. good luck.