PDF format does not give you enough semantic information to understand there is ...

haolez · on Sept 5, 2023

Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?

aidos · on Sept 5, 2023

This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

https://github.com/tabulapdf/tabula

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.