Extracting Tables from PDFs in Javascript with PDF.js

unmei · on Sept 21, 2013

Very nice. I've been doing some table extraction from PDFs recently. Also check out PDF2JSON for nodejs-based parsing - it grabs all the texts and positions so you don't have to 'intercept' draw calls and dumps them out in JSON.

garysieling · on Sept 21, 2013

Thanks. I looked into that recently, it does make this a lot easier, so now I have a node version of this as well.

gregwebs · on Sept 21, 2013

I thought this was discussed on HN before, but I only found this link: https://news.ycombinator.com/item?id=6083051

kudos to Gary for packaging this up: https://github.com/garysieling/pdf-js-csv

Of course it has issues extracting data from many tables. There is a body of research literature on how to automatically extract tabular data from PDF (and other sources) and it is not considered an easy task.

You can always fallback to a manual tool like Tabula. They also have automatic table detection now, but last I check it only worked on certain kinds of tables.

I write the PDF table extraction code for docmunch.com. We think we have figured out how to achieve a very high degree of accuracy in PDF table extraction and how to make a nice UI for manual intervention. We would love to hear about your table extraction use cases.

mkl · on Sept 21, 2013

I've done similar (but more single-use) things to extract text from PDFs, and data from PDF and PostScript plots. PDFs are actually surprisingly easy to dig into when they're decompressed (e.g. with pdftk), since they're mostly text based.

trez · on Sept 21, 2013

you can also use pdf2html with the option -x (to get xml). You would also have the position of each text tokens.

briankim · on Sept 22, 2013

Pretty cool, thank you for sharing