Hacker News new | past | comments | ask | show | jobs | submit login
Extracting Tables from PDFs in Javascript with PDF.js (garysieling.com)
52 points by garysieling on Sept 21, 2013 | hide | past | favorite | 6 comments



Very nice. I've been doing some table extraction from PDFs recently. Also check out PDF2JSON for nodejs-based parsing - it grabs all the texts and positions so you don't have to 'intercept' draw calls and dumps them out in JSON.


Thanks. I looked into that recently, it does make this a lot easier, so now I have a node version of this as well.


I thought this was discussed on HN before, but I only found this link: https://news.ycombinator.com/item?id=6083051

kudos to Gary for packaging this up: https://github.com/garysieling/pdf-js-csv

Of course it has issues extracting data from many tables. There is a body of research literature on how to automatically extract tabular data from PDF (and other sources) and it is not considered an easy task.

You can always fallback to a manual tool like Tabula. They also have automatic table detection now, but last I check it only worked on certain kinds of tables.

I write the PDF table extraction code for docmunch.com. We think we have figured out how to achieve a very high degree of accuracy in PDF table extraction and how to make a nice UI for manual intervention. We would love to hear about your table extraction use cases.


I've done similar (but more single-use) things to extract text from PDFs, and data from PDF and PostScript plots. PDFs are actually surprisingly easy to dig into when they're decompressed (e.g. with pdftk), since they're mostly text based.


you can also use pdf2html with the option -x (to get xml). You would also have the position of each text tokens.


Pretty cool, thank you for sharing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: