I have been working on a side project that needs to read dynamic table layouts and extract financial information. I was excited to hear about Tabula a few weeks ago but I had 0 success in getting even one PDF extracted.
I ended up using pdfquery package in python which heavily utilized PDFMiner under the covers.
Besides ABBYY soft (which is proprietary, licensed), does anyone have other recommendations?
I can't help but say I refuse to work with PDF files. I will email and do a ton of meetings and one on ones to explain that PDF is a container and that the format inside the container is the battle. Just give me the plain format and if it cost the company money it is worth it.
Much of the use of these tools is to extract data from government or corporate sources that while required to publish the information may not want make it easy to access. Thus they prefer PDF's.
Those of us trying to extract the data bound up in these PDF's do advocate to get access to the original data, but we have to deal with what we have today.
And this is not good for anyone and is the opposite of the spirit behind the Sunshine Laws.
My school district (What a mess) publishes images (Horrible bad images) of all the school notes including all financial information and spreadsheets. I had to one night type in for 4 hours manually the years budget just to check on our spending per student. It was $5,400 the lowest in our state.
Congrats on 1.0! We've been using Tabula in the office to get data, usually from government sources, out of PDFs. It's been very handy--though I don't especially love having Java on interns' PCs to use it. But it's worth the tradeoff to not waste their--and our--time manually extracting that data.
Congrats on the 1.0 release guys! We've been using Tabula since the days before the app packaging. It's been really cool to observe development progress, and especially to see you guys tackle the problem of distributing as an application.
If I had this when I was working on extracting the ISIR data fields in the Department of Education's documentation it would've saved me time. Bleh, it's a shame it didn't exist then. :(
This is positively phenomenal, and the UI is great for non-technical users. Super, super tool. Thanks so much for developing it and opening it up to the public!
Unfortunately it looks like the developers of JPedal decided to discontinue the LGPL version and focus on the proprietary version, so it's unmaintained unless someone else picks up development.
We use JPedal for rendering pages as images. For parsing, we use Apache PDFBox. In the near future, we plan to render the PDFs client side with Mozilla's PDF.js
PDFBox 1.8 less-than-great rendering engine forced us to include a separate library for that purpose only.
Moving to PDFBox 2.0 is also on our roadmap. But the text extraction API in 2.0 has changed a lot too, so porting our engine would require quite a bit of effort.
Friendly reminder: we're an MIT-licensed open source project, and we're always open to contributions!
I used this pretty heavily in May for a recent data-science project and it really saved my butt. Easy to use, speeded things up a lot. Choked on only one pdf. Looking forward to seeing the progress made.
I have a bunch of scanned PDFs from an open data request I'm looking forward to trying when I'm home. My own solution with pytesser was pretty effective but required a ton of tweaking.
I don't think it's going to be able to help you if they're scans. From the README:
> Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
I ended up using pdfquery package in python which heavily utilized PDFMiner under the covers.
Besides ABBYY soft (which is proprietary, licensed), does anyone have other recommendations?