Hacker News new | past | comments | ask | show | jobs | submit login
Tabula: Convert PDF Table to CSV (tabula.technology)
86 points by wolpoli on Feb 26, 2022 | hide | past | favorite | 18 comments



Excalibur (1) is also an alternative. It’s great! The installation process was lackluster though with multiple dependency issues on a M1 MacOS, Ubuntu and WSL, YMMV.

1) https://excalibur-py.readthedocs.io


Why do these packages insist on involving databases, web servers, etc.?

Just give me a CLI package that takes a PDF and gives me text file as output.


You might be interested in the library underneath, called Camelot:

https://camelot-py.readthedocs.io/en/master/

It's usable from Python or via a CLI.


>Why do these packages insist on involving databases, web servers, etc.?

wholeheartedly agree, and ... give a try to:

    pdftotext -layout somePDF.pdf -


Thanks! That's what I've been using, but some tables give pdftotext problems :-(


I've been using Excalibur/Camelot in production. It has been great (considering how non-standard PDF tables are).

You just cannot approach it in a fire-and-forget way. It has two modes of operation and various PDF "styles" can respond differently to each mode.

If you have a series of similarly-structured PDFs, try to import them manually (e.g. using IPython), take note of which mode worked better, possibly some adjustments (detection thresholds). Then you can pretty much automate with these collected parameters.


The tabula algorithm clusters text or looks for lines bordering cells. The text-clustering is hit or miss. For lines, with the standard alternating-row shading (and no lines), tabula only picks up the shaded text.

It's great if it works.


I first try Tabula. If and when it fails, AWS Textract always works.


Is that related to this [0] Python library? If not, is it something one can run locally or is it purely an AWS product?

0: https://textract.readthedocs.io/en/stable/


AFAIK, these two are not related inspite of similar names and functionalities.

AWS Textract is a cloud-only service that you can use either on the console or through the APIs. You cannot run this locally.


I've used Tabula for many years, and it works great. Use it for converting PDF invoices from specific supplier for import into our system. Changed a task that took about a day each month, to about 10 minutes each month.


Does it work with scanned tables? That’s usually more common in my exp


Nope, it doesn't. I've honestly found tabula is pretty limited in its use. When it works it works well, but when it doesn't you're still stuck writing a lot of hodge podge code. Not sure why there's not more criticism of it.


Mmmh.

Bearer of bad news (sorry), but my experience with tabula has been so-so.

First, installing it is a major PITN.

Second, the output is unpredictable.

Ultimately, I've found that using:

    pdftotext -layout somePDF.pdf - | python3 myparser.py
where myparser.py is a 20 lines python script with a couple of regexes and a simple state machine works absolute wonders and can extract relevant data from PDFs even when the data isn't really organized in a table.

Also, pdftotext is open source, written in C++, and doesn't require to install a bottomless pit of dependencies like tabula does.

And of course, neither of these things will solve extracting data from PDFs that embed rasterized images or data that is the result of a complex SVG-type rendering.

I believe the real solution to that problem will be: render the PDF to an image at hi-rez and pipe it to some ML-powered process that reverse engineers relevant data out of arbitrary images.

Startup idea?


Can't even execute this on my local machine. Why bundle this with a web server and then distribute the code to run locally? Also is this written on ruby or java? Make up your mind.

1 star, won't even try it.


As far as I understand all of the library/business logic is actually implemented in java and available in its own repo (linked in the readme).

They then wanted a client application but didn't want to build a GUI in java (I assume?) so they grabbed ruby and created a webapp?

That said, it seems like you can run a CLI version using only java and nothing else.


I was able to run this in Windows Sandbox by simply installing the JRE, so no Ruby is required.


I use Tabula all the time to extract transaction data from sources that only provide this in PDF. Thanks Tabula!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: