Hacker News new | past | comments | ask | show | jobs | submit login

PDF format does not give you enough semantic information to understand there is a table. The stream contains instructions such as moving to a coordinate, adding some text, adding some lines. No tool can extract tables with 100% precision.



Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?


This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

https://github.com/tabulapdf/tabula

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: