Hacker News new | past | comments | ask | show | jobs | submit login
Tabula: Extract Tables from PDFs (tabula.technology)
67 points by yrochat on Aug 7, 2015 | hide | past | favorite | 26 comments



I have been working on a side project that needs to read dynamic table layouts and extract financial information. I was excited to hear about Tabula a few weeks ago but I had 0 success in getting even one PDF extracted.

I ended up using pdfquery package in python which heavily utilized PDFMiner under the covers.

Besides ABBYY soft (which is proprietary, licensed), does anyone have other recommendations?


Shameless plug: https://pdftables.com


I have tried this and it was very useful as well.


I can't help but say I refuse to work with PDF files. I will email and do a ton of meetings and one on ones to explain that PDF is a container and that the format inside the container is the battle. Just give me the plain format and if it cost the company money it is worth it.


Much of the use of these tools is to extract data from government or corporate sources that while required to publish the information may not want make it easy to access. Thus they prefer PDF's.

Those of us trying to extract the data bound up in these PDF's do advocate to get access to the original data, but we have to deal with what we have today.


And this is not good for anyone and is the opposite of the spirit behind the Sunshine Laws.

My school district (What a mess) publishes images (Horrible bad images) of all the school notes including all financial information and spreadsheets. I had to one night type in for 4 hours manually the years budget just to check on our spending per student. It was $5,400 the lowest in our state.


It's nice that you can opt out of working with PDFs but that's not an option for a large portion of the world.


I usually use "pdftotext -layout" and write python or perl code to handle the table extraction.

If I need more detailed formatting information, I use "pdftohtml -xml -fullfontname" and process the resulting xml.


What about ABBY, does that work well for you?

I'd even pay money to get somethig that works well.


Congrats on 1.0! We've been using Tabula in the office to get data, usually from government sources, out of PDFs. It's been very handy--though I don't especially love having Java on interns' PCs to use it. But it's worth the tradeoff to not waste their--and our--time manually extracting that data.


Congrats on the 1.0 release guys! We've been using Tabula since the days before the app packaging. It's been really cool to observe development progress, and especially to see you guys tackle the problem of distributing as an application.


If I had this when I was working on extracting the ISIR data fields in the Department of Education's documentation it would've saved me time. Bleh, it's a shame it didn't exist then. :(


This is positively phenomenal, and the UI is great for non-technical users. Super, super tool. Thanks so much for developing it and opening it up to the public!


It bombed on the very first PDF I fed to it. (Admittedly, a technical datasheet of ~50 pages.)


Hi. Can you share the PDF with us on our issue tracker? (https://github.com/tabulapdf/tabula/issues) We'd be happy to take a look at it


I have found that you need start a page at a time. Throwing a multi-page document to start with, typically leads to failure.


How does it read data from the PDF? Is there a PDF parser somewhere down inside the code?


It embeds the free-software version of JPedal: https://github.com/tabulapdf/tabula/tree/master/lib/jars

Unfortunately it looks like the developers of JPedal decided to discontinue the LGPL version and focus on the proprietary version, so it's unmaintained unless someone else picks up development.


Hi. Tabula author here.

We use JPedal for rendering pages as images. For parsing, we use Apache PDFBox. In the near future, we plan to render the PDFs client side with Mozilla's PDF.js


It's worth mentioning that PDFBox 2.0 does a great job of rendering PDFs too.


PDFBox 1.8 less-than-great rendering engine forced us to include a separate library for that purpose only.

Moving to PDFBox 2.0 is also on our roadmap. But the text extraction API in 2.0 has changed a lot too, so porting our engine would require quite a bit of effort.

Friendly reminder: we're an MIT-licensed open source project, and we're always open to contributions!


I used this pretty heavily in May for a recent data-science project and it really saved my butt. Easy to use, speeded things up a lot. Choked on only one pdf. Looking forward to seeing the progress made.


I have a bunch of scanned PDFs from an open data request I'm looking forward to trying when I'm home. My own solution with pytesser was pretty effective but required a ton of tweaking.


I don't think it's going to be able to help you if they're scans. From the README:

> Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.


Ah thanks, missed that. They gave me half text based and half scans, gotta love the government.


great project, extracting tables from tables is a task I need to make so damn frequently




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: