Hacker News new | past | comments | ask | show | jobs | submit login

I thought this was discussed on HN before, but I only found this link: https://news.ycombinator.com/item?id=6083051

kudos to Gary for packaging this up: https://github.com/garysieling/pdf-js-csv

Of course it has issues extracting data from many tables. There is a body of research literature on how to automatically extract tabular data from PDF (and other sources) and it is not considered an easy task.

You can always fallback to a manual tool like Tabula. They also have automatic table detection now, but last I check it only worked on certain kinds of tables.

I write the PDF table extraction code for docmunch.com. We think we have figured out how to achieve a very high degree of accuracy in PDF table extraction and how to make a nice UI for manual intervention. We would love to hear about your table extraction use cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: