Hacker News new | past | comments | ask | show | jobs | submit login

Try extracting tabular data from a PDF! With XML it's trivial, but for PDF you need highly specialized software packages to do this. One of the best, pdfplumber, is largely based [1] on a Master's thesis titled Algorithmic Extraction of Data in Tables in PDF Documents [2].

[1] https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/...

[2] https://trepo.tuni.fi/bitstream/handle/123456789/21520/Nurmi...




This was mostly aimed at the various ways the XML document may or may not conform to any number of XSD types. What we 'see' as a table might not be described and stored as a table in the same way in XML. And with XML I mean whatever XML office (the one from microsoft) generates.

A 6000-page spec and attributes that specify if the data is tabular data based on various properties (be it columns and rows or just plain text with start and stop pointers) and then may or may not render it visually as a table is error prone, even on first-party implementations (first-party desktop versions within Windows vary, as wel as on macOS, Android, iOS and their web offering).

If there was one simple data structure describing the table and all other aspects being optional, then yes, a XML based format is easier than OCR. But that's not the case I was pointing at.


> for PDF you need highly specialized software packages to do this.

Not really, or at least not all that specialized. You need:

a: a pdf-to-raster-image converter (ie any working PDF viewer, plus maybe the X server it talks to)

b: a reasonably decent OCR system capable of scanning tables (definitely nontrivial, but hardly "highly specialized" since things other than PDFs display data in tables).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: