Excalibur (1) is also an alternative. It’s great! The installation process was lackluster though with multiple dependency issues on a M1 MacOS, Ubuntu and WSL, YMMV.
I've been using Excalibur/Camelot in production. It has been great (considering how non-standard PDF tables are).
You just cannot approach it in a fire-and-forget way. It has two modes of operation and various PDF "styles" can respond differently to each mode.
If you have a series of similarly-structured PDFs, try to import them manually (e.g. using IPython), take note of which mode worked better, possibly some adjustments (detection thresholds). Then you can pretty much automate with these collected parameters.
The tabula algorithm clusters text or looks for lines bordering cells. The text-clustering is hit or miss. For lines, with the standard alternating-row shading (and no lines), tabula only picks up the shaded text.
I've used Tabula for many years, and it works great. Use it for converting PDF invoices from specific supplier for import into our system. Changed a task that took about a day each month, to about 10 minutes each month.
Nope, it doesn't. I've honestly found tabula is pretty limited in its use. When it works it works well, but when it doesn't you're still stuck writing a lot of hodge podge code. Not sure why there's not more criticism of it.
where myparser.py is a 20 lines python script with a couple of regexes and a simple state machine works absolute wonders and can extract relevant data from PDFs even when the data isn't really organized in a table.
Also, pdftotext is open source, written in C++, and doesn't require to install a bottomless pit of dependencies like tabula does.
And of course, neither of these things will solve extracting data from PDFs that embed rasterized images or data that is the result of a complex SVG-type rendering.
I believe the real solution to that problem will be: render the PDF to an image at hi-rez and pipe it to some ML-powered process that reverse engineers relevant data out of arbitrary images.
Can't even execute this on my local machine.
Why bundle this with a web server and then distribute the code to run locally? Also is this written on ruby or java? Make up your mind.
1) https://excalibur-py.readthedocs.io