What's so hard about PDF text extraction?

tolmasky · on March 3, 2020

This is why iPhone didn't initially ship with double-tap to zoom for PDF paragraphs (like it had for blocks on web pages). I know because I was assigned the feature, and I went over to the PDF guy to ask how I would determine on an arbitrary PDF what was probably a "block" (paragraph), and I got a huge explanation on how hard it would be. I relayed this to my manager and the bug was punted.

Edit: To add a little more color, given that none of us was (or at least certainly I wasn't) an expert on the PDF format, we had so far treated the bug like a bug of probably at-most moderate complexity (just have to read up on PDF and figure out what the base unit is or whatever). After discovering what this article talks about, it became evident that any solution we cobbled together in the time we had left would really just be signing up for an endless stream of it-doesn't-work-quite-right bugs. So, a feature that would become a bug emitter. I remember in particular considering one of the main use cases: scientific articles that are usually in two columns, AND also used justified text. A lot of times the spaces between words could be as large as the spaces between columns, so the statistical "grouping" of characters to try to identify the "macro rectangle" shape could get tricky without severely special-casing for this. All this being said, as the story should make clear, I put about one day of thought into this before the decision was made to avoid it for 1.0, so far all I know there are actually really good solutions to this. Even writing this now I am starting to think of fun ways to deal with this, but at the time, it was one of a huge list of things that needed to get done and had been underestimated in complexity.

Alex3917 · on March 3, 2020

> I know because I was assigned the feature, and I went over to the PDF guy to ask how I would determine on an arbitrary PDF what was probably a "block" (paragraph), and I got a huge explanation on how hard it would be.

The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are hundreds of billions of dollars going into self-driving cars, and like zero dollars going into this problem.

tolmasky · on March 3, 2020

What are the groups that would benefit most from the PDF-to-HTML conversion? Who are the customers that would drive this profit? I tried to make those sentences not sound contentious but unfortunately they do, but I am genuinely curious about this space and who is feeling the lack of this technology most.

greycol · on March 3, 2020

Almost any business that has physical suppliers or business customers.

PDF is de-facto standard for any invoicing, POs, quotes, etc.

If you solve the problem you can effectively programmatically deal with invoicing/payments/ large parts of ordering/dispensing. It's a no brainer to add it on to almost any financial/procurement software that deals with inter business stuff.

Any small-medium physical business can probably half their financial department if you can dependably solve this issue.

ofrzeta · on March 4, 2020

Starting November 2020 in the EU machine-readable invoices will be mandatory in the public sector (https://eur-lex.europa.eu/eli/dir/2014/55/oj).

As far as I understand there are at least two standards (I know of in Germany): XRechnung and ZUGFeRD/Factur-X (which is PDF A/3 with embedded XML).

http://fnfe-mpe.org/factur-x/factur-x_en/

sabas_ge · on March 4, 2020

Electronic invoicing in Italy is a thing since a couple of years, mandatory since Jan 1st 2019...

f3r3nc · on March 4, 2020

seems to be only for the electronic invoicing in public procurement

Fire-Dragon-DoL · on March 5, 2020

No, it's now for everyone. It was like that for a while

innomatics · on March 3, 2020

A business that invests in building a machine that reads data, produced by a 3rd party machine, using format intended for lay humans to read, is not investing in the right tech IMO.

Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process, or adopting interchange formats.

Problem for some small-business is the cost (process changes, licencing etc) of adopting interchange formats and working with large vendors is prohibitive at their scale e.g. the airline BSP system.

I agree that solving the problem generally i.e. replacing an accounts payable staff person capable of processing arbitrary invoice documents will be comparable to self-driving in difficulty.

If a company deals with a lot of a single type of PDF, then the approach could be economical. I am actually involved in a project looking at doing this with AWS Textract.

speedplane · on March 4, 2020

> A business that invests in building a machine that reads data, produced by a 3rd party machine, using format intended for lay humans to read, is not investing in the right tech IMO.

Building machines that understand formats that are understood by humans is exactly what we should be doing. People should read, write, and process information in a format that is comfortable and optimized to them. Machines should bend to us, we should not bend to them.

If businesses only dealt with machine readable formats, everyone's computer would still be using the command line.

And there's real condescension in your post:

> Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process

You're saying that businesses need to change their business to accommodate data formats, but it should be the other way around.

kbutler · on March 4, 2020

There has to be compromise.

The proliferation of computers in business over the last 50 years is precisely because businesses can save money/expand capacity by adapting the business processes to the capabilities of the computers.

Over that time, computers have become more friendly to humans, but businesses have adapted and humans been trained to use what computers can do.

tastyminerals · on March 4, 2020

Yes, most invoices are in PDF but only about 40% of them are native PDF meaning they are actual documents not scanned images converted to PDFs. There are are also compound PDF invoices which contain images. So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

dreamcompiler · on March 4, 2020

This is a huge pet peeve of mine. Most invoices are generated on a computer (often in Word) but a huge fraction of the people who generate them don't know how to export to a PDF. So they print the invoice on paper, scan it back in to a PDF, and email that to you. Thus the proliferation of bitmap PDFs.

speedplane · on March 4, 2020

> So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

You can go further. Invoices often contain block sections of text with important terms of the invoice, such as shipping time information, insurance, warranties, etc. To build something that works universally, you also need very good natural language processing.

thaumasiotes · on March 4, 2020

If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.

speedplane · on March 4, 2020

> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?

This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.

tastyminerals · on March 4, 2020

For accuracy and speed. The market SOTA Abbyy is far from being accurate.

speedplane · on March 4, 2020

> The market SOTA Abbyy is far from being accurate.

While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.

For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.

bhanhfo · on March 4, 2020

OCR.space charges only $10 for 100,000 conversions. The quality is good, but not as good as Abbyy.

minerals29 · on March 4, 2020

It is the best and this is one of the reasons why PDF extraction is hard :)

Alex3917 · on March 3, 2020

So I have a lot of experience with basically the same problem just from working on this: https://www.prettyfwd.com. As an example of the opportunity size just in the email domain, the amount of personal non-spam email sent every day is like 100x the total size of Wikipedia, but nothing is really done with any of this information because of this challenge. Basically applications are things like:

- Better search engine results

- Identifying experts within a company

- Better machine translation

- Finding accounting fraud

- Automating legal processes

For context, the reason why Facebook is the most successful social network is that they're able to turn behavioral residue into content. If you can get better at taking garbage data and repackaging it into something useful, it stands to reason that there are lots of other companies the size of Facebook that can be created.

K0SM0S · on March 3, 2020

I often ponder how much of the "old world" will get "digitalized" — translated in numeric form, bits. And how much will just disappear. The question might seem trivial if you think of books, but now think of architecture, language itself (as it evolves), etc.

There's almost no question in my mind that most new data will endure in some form, by virtue of being digital from day 1.

The endgame for such a company, imho, is to become the "source entity" of information management (in abstracted form), whose two major products are one to express this information in the digital space, and the other in the analog/physical space. You may imagine variations of both (e.g. AR/VR for the former).

Kinda like language in the brain is "abstract" (A) (concept = pattern of neurons firing) and then speech "translates" into a given language, like English (B) or French (C) (different sets of neurons). So from A you easily go to either B or C or D... We've observed that Deep Learning actually does that for translation (there's a "new" "hidden" language in the neural net that expresses all human languages in a generic form of sorts, i.e. "A" in the above example).

The similarities of the ontology of language, and the ontology of information in a system (e.g. business) are remarkable — and what you want is really this fundamental object A, this abstract form which then generates all possible expressions of it (among which a little subset of ~1,000 gives you human languages, a mere 300 active iirc; and you might extend that into any formal language fitting the domain, like engineering math/physics, programming code, measurements/KPI, etc.

It's a daunting task for sure but doable because the space is highly finite (nothing like behavior for instance; and you make it finite through formalization, provided your first goal is to translate e.g. business knowledge, not Shakespeare). It's also a one-off thing because then you may just iterate (refine) or fork, if the basis is sound enough.

I know it all sounds sci-fi but having looked at the problem from many angles, I've seen the PoC for every step (notably linguistics software before neural nets was really interesting, producing topological graphs in n dimensions of concepts e.g. by association). I'm pretty sure that's the future paradigm of "information encoding" and subsequent decoding, expression.

It's just really big, like telling people in the 1950's that because of this IBM thing, eventually everybody will have to get up to speed like it's 1990 already. But some people "knew", as in seeing the "possible" and even "likely". These were the ones who went on to make those techs and products.

IceKarma · on March 4, 2020

Digital data is arguably more fragile than analogue, offline, paper (or papyrus, or clay tablet) media. We have documents over 3000 years old that can still be read. Meanwhile, the proprietary software necessary to access many existing digital data formats is tied to obsolete hardware, working examples of which may no longer exist, emulators for which may not exist, and insufficient documentation may exist to even enable their creation. Just as one example, see the difficulty in enabling modern access to the BBC's 1986 Domesday Project.

bjonnh · on March 4, 2020

Academics and other people that rely on scientific publications. Most of the world's knowledge in science is locked into PDFs and screenshots (or even pictures) of manufacturer's (often proprietary) software... So extracting it in a more structured way would be a win (so HTML may not be best). On a related note, I've seen people using Okular to convert PDF tables to a usable form (to be honest its table extraction tool is one of the best i've seen despite being pretty manual).

speedplane · on March 4, 2020

> What are the groups that would benefit most from the PDF-to-HTML conversion? Who are the customers that would drive this profit? I tried to make those sentences not sound contentious but unfortunately they do, but I am genuinely curious about this space and who is feeling the lack of this technology most.

Legal technology. Pretty much everything a lawyer submits to a court is in PDF, or is physically mailed and then scanned in as PDF. If you want to build any technology that understands the law, you have to understand PDFs.

dscpls · on March 3, 2020

Organisations that have existing business processes to publish to print and pdf but now want to publish in responsive formata for mobile or even desktop web.

Changing their process might be more expensive than paying a lot of money for them to carry on as is for a few more years while getting the benefit of modern eyes on their content.

Edit: concrete example would be government publications like budget narrative documents.

tony0x02 · on March 3, 2020

The patent office?

aidos · on March 3, 2020

I’ve done a bunch of this work myself and while it’s a bit of a pain to do in general, you can make some reasonable attempts at getting something workable for your use cases.

PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned. Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.

kbenson · on March 3, 2020

> Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.

I've actually seen obfuscation used in a PDF where they load in a custom font that changes the character mapping, so the text you get out of the PDF is gibberish, but the fonts displayed on rendering are correct (a simple character substitution cipher).

The important thing to remember whenever you think something should be simple, is that someone somewhere has a business need for it to be more complicated, so you'll likely have to deal with that complication at some point.

ccozan · on March 3, 2020

That's why the only "reliable" way to extract text is to perform an OCR of the pdf rendering which is exactly what ABBYY is doing.

siftrics · on March 3, 2020

I’m the founder of a startup that is doing this, as well. We strive to be as simple and easy as possible to use.

If you care to check us out: https://siftrics.com/

Defenestresque · on March 3, 2020

Your website demo video is impressive and I can imagine there are many businesses that would save a lot of time and man-hours by incorporating a solution like this.

I've often thought about creating products like these but as a one-man operation I am daunted by the "getting customers" part of the endeavour. How do you get a product like this into the hands of people who make the decisions in a business? (For anyone, not just OP). PPC AdWords campaigns? Cold-calling? Networking your ass off? Pay someone? Basically, how does one solve the "discoverability problem"?

siftrics · on March 3, 2020

Surprisingly, Hacker News has been our number one source of leads. We tried Google Ads and Reddit Ads, but the signup rate was literally three orders of magnitude lower than organic traffic from Hacker News and Reddit.

appleiigs · on March 3, 2020

Is your product only on the cloud? My privacy/internet security team won't let me use products that save customer or vendor data on the cloud because you might get hacked. Only giants, like Microsoft, have been approved after an evaluation.

siftrics · on March 4, 2020

More than half of our customers have asked to be able to skip our cloud and go directly to their database. We’re working on this right now. It’s scheduled to be released this week, so keep an eye open.

In the meantime, if you have any questions, feel free to send me an email at siftrics@siftrics.com. I’d love to hop on the phone or do a Zoom meeting or a Google Hangouts.

innomatics · on March 3, 2020

Will check it out thanks, we had evaluated ABBYY but it didn't suit. Does your product do something like key value or table detection?

siftrics · on March 3, 2020

We do table recognition and pride ourselves on being better at it than ABBYY. We can handle variable number of rows in a table and we take that into account when determining the position of other text on the page.

Feel free to email me at siftrics@siftrics.com with any questions. We can setup a phone call, zoom meeting, or google hangouts too, if you’d like.

72deluxe · on March 3, 2020

I like how your demo is clear and also in Linux! Surprising.

The page is clear and easy to understand, looks good. Well done.

siftrics · on March 4, 2020

Thank you for the kind words!

I like to say that anyone with a good old ThinkPad and an internet connection can mint fortunes and build empires :-)

malikolivier · on March 4, 2020

How is support for languages other than English?

I am especially thinking about Japanese. Our company could probably find good uses of such service if it had Japanese support.

siftrics · on March 4, 2020

Yes, Japanese is supported! Almost all languages are supported.

If you have any questions or need help trying it out, please email me at siftrics@siftrics.com. We can hop on the phone, too, if you'd like.

tastyminerals · on March 4, 2020

Gini GmbH performs document processing for almost all German banks and for many accounting companies. For banks it does realtime invoice photo processing -- OCR and extraction of amount, bank information, receiver etc. For accounting it extracts all kind of data from a PDF. Unfortunately, only for German language market. But here you go, ABBYY by far is not the only one. In fact ABBYY does only OCR and has some mediocre table detection. That's it.

zmix · on March 4, 2020

I do not remember which of the two it was, but 'poppler' or 'pdfbox' (they may use the same backend) created great HTML output, with absolute positions. They also have an XML mode, which is easily transformed.

Of course, there is absolutely no semantics, just display.

aidos · on March 3, 2020

That’s actually often just a consequence of the subsetting (I think). Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.

kbenson · on March 3, 2020

> That’s actually often just a consequence of the subsetting (I think).

I would believe that. It was a pretty poor obfuscation method as they go, if it was intended for that.

> Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.

Oh, I did. That's the flip side of my second paragraph above. When there's a business need to work around complications or obfuscations, that will also happen. :)

speedplane · on March 4, 2020

> PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned.

Can't stress this enough. The next time you open a multi-column PDF in adobe reader and it selects a set of lines or a paragraph in the way you would expect, know that there is a huge amount of technology going on behind the scenes trying to figure out the start and end of each line and paragraph.

speedplane · on March 4, 2020

> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are ... like zero dollars going into this problem.

Converting PDFs to HTML well is a very hard problem, but hard by itself to create a very big company. When processing PDFs or documents generally, the value is not in the format, it's in the substantive content.

The real money is not going from PDF to HTML, but from HTML (or any doc format) into structured knowledge. There are plenty of companies trying to do this (including mine! www.docketalarm.com), and I agree it has the potential to be as big as self-driving cars. However, technology to understand human language and ideas is not nearly as well developed as technology to understand images, video, and radar (what self-driving care rely on).

The problem is much more difficult to solve than building safer-than-human self-driving cars. If you can build a machine that truly understands text, you have built a general AI.

samplatt · on March 4, 2020

>like zero dollars going into this problem

There's a lot more than zero dollars going into this... it's just that the end result is universally something that's "good enough for this one use-case for this one company" and that's as far as it gets.

jimmaswell · on March 3, 2020

Rendering the document to an image and using OCR on it would bypass a lot of trouble trying to make sense of the source, wouldn't it?

Alex3917 · on March 3, 2020

Not really, it's just a different set of challenges. The original article sums it up well, in terms of a lack of text-order hints. I haven't really tried incorporating OCR approaches at all, but I suspect they could probably be used to detect hidden text.

The basic issue imho is that NLP algorithms are very inaccurate even with perfect input. E.g. even with perfect input, they're maybe only 75% accurate. And even an a text-processing algorithm that's like 99.9% accurate will yield input to your NLP algorithms that's like 50% accurate, so any results will be mostly unusable.

tastyminerals · on March 4, 2020

NLP algorithms are just fine. It is the combination of regexes, NLP and deep learning that allows you to achieve good extraction results. So, basically OCR / pdf parser -> jpeg/xml/json -> regexes + NLP / DL extractor.

bitL · on March 3, 2020

Semantic segmentation to identify blocks and OCR to convert to text - I think OneNote is already doing that. PDF is a horrible format for representing text, though PostScript is even worse.

Ididntdothis · on March 3, 2020

“The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, ”

At least :)

perl4ever · on March 3, 2020

Since you can always print a PDF to a bitmap and use OCR, I assume you're implicitly asking for something that does substantially better. How much better, and why?

BigBubbleButt · on March 3, 2020

> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext...would accrue at least as much profit [as self-driving cars] to any company that can solve it.

Can you explain a bit more about why this is so valuable? I don't know anything about this industry.

dandelo1953 · on March 3, 2020

Does this mean some abstraction is lost between the creation phase and final "save to pdf" phase? It'd seem ridiculous to not easily be able to track blocks while it's a WIP.....

dreamcompiler · on March 4, 2020

Except the stakes are lower. Nobody dies if a PDF extraction isn't perfect.

calf · on March 3, 2020

I don't know if PDF has truly evolved from its desktop publishing origins, but it is a terrible format because it no longer contains the higher level source information that you would have in an InDesign or a LaTeX file. PDF/Postscript were meant to represent optical fidelity and thus are too low-level abstractions for a lot of end-user, word processing tasks (such as detecting layout features), and thus trying to reverse engineer the "design intent" from them feels like doing work that is unecessarily tedious. But that's the way it seems to be given the popularity of the format.

rbobby · on March 3, 2020

> double-tap to zoom

Why wouldn't you just zoom with the center point being where the tap occurred?

abiogenesis · on March 4, 2020

That's probably what they did for v1.0 after they saw it is not that easy to zoom such that the whole paragraph always fits into the visible area.

tolmasky · on March 4, 2020

Where to center to is only one vector, the other is how much to zoom: ideally it’s such that the text block fits on the screen. But again, that requires knowing the bounds of the text block. Zooming by a constant wherever you tap is a much less useful feature for text (vs. a map for instance), but I think it’s what we defaulted to (can’t remember if it was that of just nothing).

giovannibonetti · on March 3, 2020

One of the main features of the product I work on is data extraction from a specific type of PDF. If you want to build something similar these are my recommendations for you:

- Use https://github.com/flexpaper/pdf2json to convert the PDF in an array of (x, y, text) tuples

- Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.

Here is the section of our Dockerfile that builds pdf2json for those of you that might need it:

# Download and install pdf2json ARG PDF2JSON_VERSION=0.70 RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \ && cd $HOME/pdf2json-$PDF2JSON_VERSION \ && wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \ && tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \ && ./configure > /dev/null 2>&1 \ && make > /dev/null 2>&1 \ && make install > /dev/null \ && rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \ && cd

robinhowlett · on March 3, 2020

Thanks for the links - agree about the (x,y,text) callout but other metadata like font size can be useful too.

Regexes have limitations but I was able them to leverage them sufficiently for PDFs from a single source.

I parsed over 1 million PDFs that had a fairly complex layout using Apache PDFBox and wrote about it here: https://www.robinhowlett.com/blog/2019/11/29/parsing-structu...

Defenestresque · on March 3, 2020

I thoroughly enjoyed both the blog post (as an accessible but thorough explanation of your experience with PDF data extraction) and the linked news article [0] as an all-too-familiar story of a company realizing that a creative person is using their freely-available data in novel and exciting ways and immediately requesting that they shut it down, because faced with the perceived dichotomy of maintaining control versus encouraging progress they will often play on the safe side.

[0] https://www.thoroughbreddailynews.com/getting-from-cease-and...

giovannibonetti · on March 3, 2020

Oh, yeah, pdf2json returns font sizes as well. I forgot to mention that.

pierre · on March 3, 2020

pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :

https://github.com/AXATechLab/pdf2json

Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....

hackcasual · on March 3, 2020

I worked on an online retailer's book scan ingestion pipeline. It's funny because we soon got most of our "scans" as print-ready PDFs, but we still ran them through the OCR pipeline (that would use the underlying pdf text) since parsing it any other way was a small nightmare.

minerals29 · on March 4, 2020

I am an ML engineer in one of the PDF extraction companies processing thousands of invoices and receipts per day in realtime. Before we started adding ML all our processing logic was build on top of hundreds of regexes and gazetteers. Even until now handcrafted rules are the backbone of our extraction system whereas ML is used as fallback. Yes, regexes accumulate tech debt and become a maintenance blackhole but if they work, they are faster and more accurate than any fancy DL tech out there.

serhart · on March 3, 2020

> Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.

Most programming languages offer a regex engine capable of matching non-regular languages. I agree though, if you are actually trying to _parse_ text then a regex is not the right tool. It just depends on your use case.

dunham · on March 3, 2020

For simple cases, I've also found "pdftotext -layout" useful. For a quick on-off job, this would save someone the trouble of assembling the lines themselves.

I have used this in the past to extract tables, but it doesn't help much in cases where you need font size information.

daniel-levin · on March 3, 2020

I’m a contractor. One of my gigs involved writing parsers for 20-something different kinds of pdf bank statements. It’s a dark art. Once you’ve done it 20 times it becomes a lot easier. Now we simply POST a pdf to my service and it gets parsed and the data it contains gets chucked into a database. You can go extremely far with naive parsers. That is, regex combined with positionally-aware fixed-length formatting rules. I’m available for hire re. structured extraction from PDFs. I’ve also got a few OCR tricks up my sleeve (eg for when OCR thinks 0 and 6 are the same)

pacoverdi · on March 3, 2020

Many years ago, I regularly had to parse specifications of protocols from various electronic exchanges. The general approach I used was to do a first pass using a Linux tool to convert it to text: pdftotext. Something like:

    pdftotext -layout -nopgbrk -eol unix -f $firstpage -l $lastpage -y 58 -x 0 -H 741 -W 596 "$FILE"

After that, it was a matter of writing and tweaking custom text parsers (in python or java) until the output was acceptable, generally an XML file consumed by the build (mainly to generate code).

A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).

As someone else said, this is like black magic, but kind of fun :)

Edit: grammar

dredmorbius · on March 3, 2020

I've discovered page-oriented processing in awk, which is a godsend for parsing PDFs.

See:

https://news.ycombinator.com/item?id=22156456

In the GNU Awk User's Guide:

https://www.gnu.org/software/gawk/manual/html_node/Multiple-...

Tracking column and field widths across page breaks is ... interesting, but more tractable.

mtlogstdo · on March 3, 2020

I worked for an epub firm that used a similar approach a while ago - we took PDFs and produced Flash (yes, that old) versions for online, and created iOS and Android apps for the publisher.

I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!

haberman · on March 3, 2020

Oh my goodness, this whole thread is deja vu from some code I wrote to parse my bank statements. I arrived at exactly the same solution of "pdftotext -layout" followed by a custom parser in Python. And ran into the same difficulty with tables: I wrote a custom table parser that uses heuristics to decide where column breaks are.

hnick · on March 3, 2020

I work in the print industry and some clients have the naive idea they'll save money by formatting their own documents (naive because usually this just means a lot more work for us, which they end up paying for).

We need some metadata to rearrange and sort PDF pages for mailing and delivery (such as name, address, and start/end page for that customer).

Our general rule is you provide metadata in an external file to make it easy for us. Otherwise, we run pdftotext and hope there's a consistent formatting for the output (e.g. every first page has "Issue Date:", "Dear XYZ,", or something written on it).

If that doesn't work then we're re-negotiating. It is not too difficult usually to build a parser for one family of PDF files based on a common setup as you've said and you get to learn various tricks. It is very difficult though to write a general parser.

Personally, I found parsing postscript easier since usually it was presented linearly.

just_myles · on March 4, 2020

I can cosign on this methodology. I used to work in an organization that used to build pdfs for accounting and licensing documentation. I used a proprietary tool (Planetpress :( ) to generate the documents using metadata from a separate input file (csv or xml) to determine what column maps to what field.

Good thing about this was as you have already outlined: It allowed for some flexibility in what was acceptable input data. For specific address formats or names we could accept multiple formats as long as they were consistent and in the proper position in the input file.

Regarding renegotiating: We didn't get that far. However, if a customer within our organization was enlisting our expertise and could not produce an acceptable input file, then we would go back to them and explain the format that we require in order to generate the necessary documents. Of course, creating our document through our data pipelines is obviously the better choice, but this was not an option in some cases at the time.

As far as doing the work of creating these documents in a tool like Planetpress is concerned, well, don't use Planetpress. You are better of doing it in your favorite language of choice's libraries tbh. Nothing worse than having to use proprietary code (Presstalk/Postscript.) that you have to learn and never be able to use anywhere.

hnick · on March 4, 2020

By re-negotiating I mean in terms of quoting billable hours. A rule of thumb for a typical Postscript scraper was around 20 hours end to end (dev, testing, and integration into our workflow system).

The problem we have with a lot of client files is that they look fine but printers don't care about "look fine", they crash hard when they run out of virtual memory due to poor structure. And usually without a helpful error message, so that's more billable hours to diagnose. The most common culprit is workflows that develop single document PDFs then merge them resulting in thousands of similar and highly redundant subset fonts.

Quarrelsome · on March 3, 2020

Any tricks for decimal points versus noise? Its a terrifying outcome and all I've got is doing statistical analysis on the data you've already got and highlighting "outliers".

kevin_thibedeau · on March 3, 2020

Change the decimal point in the font to something distinctive before rasterizing.

_0z6m · on March 3, 2020

For something like bank statements, I'd use the rigidly-defined formatting (both number formatting and field position) to inform how to interpret OCR misfires. My larger concern then would be missing a leading 1 (1500.00 v 500.00), but checking for dark pixels immediately preceding the number will flag those errors. And I suppose looking for dark pixels between numbers could help with missed decimals too.

mipmap04 · on March 3, 2020

I've done this a bit. I define ranges per numeric field and if it exceeds or is below that range, I send it to another queue for manual review. Sometimes I'll write rules where if it's a dollar amount that usually ends ".00" and I don't read a decimal but I do have "00", then I'll just fix that automatically if it's outside my range.

HPsquared · on March 3, 2020

(Novice speaking) Maybe there's something about looking for the spacing / kerning that is taken up by a decimal point? (Not sure if OCR tools have any way to look for this)

throwaway3157 · on March 3, 2020

Do you have a blog? I'd enjoy reading some of your tricks.

Also, how do you manage things when one of those banks decides to change the layout/format?

saradhi · on March 3, 2020

Interestingly, I was doing the similar stuff for 3 years to a US company. Curious, is your client a legal tech company? Mine was.

The experience helped me to roll out an API, as https://extracttable.com, for developers.

OCR tricks? Assuming post processing dev stuff - may I know your OCR engine. We are supported with Kofax and openText along with cloud engines like GVision as a backup.

slig · on March 3, 2020

Maybe there's a SasS opportunity for you to explore.

LeonM · on March 3, 2020

I build such a service, but it is impossible to guarantee any reliable result. I ended up shutting it down.

The PDF standard is a mess, and the number of 'tricks' I've seen done is astonishing.

Example: to add shade or border effect to text, most PDF generators simple add the text twice with a subtle offset and different colors. Result: your SaaS service returns every sentence twice.

Off course there were workarounds, but at some point it became unmaintanable.

shim__ · on March 3, 2020

I'm actually surprised that PDF hasn't been superseded by some form of embedded HTML by now.

Fnoord · on March 3, 2020

It partly has: ePub [1], an open format for ebooks, contain HTML.

[1] https://en.wikipedia.org/wiki/EPUB

lmm · on March 4, 2020

I'd say exactly the opposite. PDF makes it easy to create a document that looks exactly the way you want it to, which seems to be all that most web designers want (witness all the sites that force a narrow column on a large screen and won't reflow their text properly on a small screen).

zo1 · on March 3, 2020

In a way it has. In my experience, there have been multiple times where a "generate PDF" requirement has come up, with the best viable solution being "develop it in HTML using standard tech" followed by "and then convert it to PDF".

mLuby · on March 3, 2020

I blame CSS.

faeyanpiraat · on March 3, 2020

siftrics · on March 3, 2020

Hi! I’m the founder of a startup (https://siftrics.com) in this exact space.

The demand for automating text extraction is still very high — or at least it feels like it when you’re working around the clock to cater to 3 of your customers, only to wake up to 10 more the next day. We’re small but growing extremely quickly.

JoblessWonder · on March 3, 2020

As someone who works in aviation... what made you choose an avionics company as your demo business?

I've bookmarked your site for future research... but the aviation part has me curious!

kelvin0 · on March 3, 2020

What space are your customers in? Healthcare? Government?

siftrics · on March 3, 2020

Everything. Insurance companies to fledgling AI startups.

It’s definitely harder to get government business because the sales process is so long and compliance is so stringent. That said, we are GDPR compliant.

slig · on March 3, 2020

Great demo video. Congrats on growing your startup!

kelvin0 · on March 3, 2020

Well I am putting the finishing touches on a front end that allows extracting PDF text visually. It's also able to adjust when the PDF page size vary for a given document type. Once you build the extractor for a document type, it can run on a batch of PDFs and store to Excel or Database (or any other format). I sense this tool facilitates and automates a lot of the 'dark art' you mention. Of course there are always difficult documents that don't fit exactly in the initial extraction paradigm, for those I use the big guns ...

penagwin · on March 3, 2020

Id also be interested in a blog or any basic tips/examples! I totally understand you don't want to give too much away, but I'm sure HN would love to see it!

just_myles · on March 4, 2020

I remember writing one of my first parsers was for a pdf and I had to employ a similar methodology where I had to rely on regex and "positionally-aware fixed-length" formatting rules. I would literally chunk specific groups by the number of spaces they contained lol. I had to do very little manual intervention but, damn it all, it worked :D .

malcolmhere · on March 3, 2020

I've written similar code for investment banks, to extract financial reporting data from PDFs. It's shocking to think how much of the financial world runs on this kind of tin-cans-on-a-piece-of-string solution.

BlueTemplar · on March 3, 2020

Do these pdfs even get printed ?

kabacha · on March 4, 2020

My first internship was at a small company that did PDF parsing and building for EU government agencies and it was really painful work but paid an absolute shitton.

sudhirj · on March 3, 2020

Are you open to doing more of this? Trying to do the same thing but I’d rather have an expert do it and focus on the app.

kelvin0 · on March 3, 2020

Are you building an app?

sudhirj · on March 3, 2020

Building personal finance app to keep track of multiple bank accounts and investments, categorising spends, etc.

Parsing statement PDFs from every bank is pretty hellish.

wdb · on March 3, 2020

That’s why the open banking api is amazing these days

In the past we did purposely make it more difficult to parse our PDFs

fluidcruft · on March 3, 2020

Do you have any tricks for dealing with missing unicode character mapping tables for embedded fonts?

graeme · on March 3, 2020

What’s your contact info? Didn’t see any on your github.

daniel-levin · on March 5, 2020

dan at threeaddresscode dot com

teddyc · on March 3, 2020

Can I PM you?

daniel-levin · on March 3, 2020

Of course.

teddyc · on March 3, 2020

I don't think Hacker News supports PMs.

I managed to find your email address from your GitHub profile. Going to send you an old fashioned email.

aggerdom · on March 4, 2020

Are you me? Wish that I had known the insertion order trick, though it isn't straightforward to implement with the stack I was using at a previous gig. (Tabula + Naive parsing + Pandas Data Munging). I can expand on a few issues challenges I've run into when parsing PDFs:

# Parser drift and maintenance hell

Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.

You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.

A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.

A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:

pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)

pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)

pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)

What's the common factor between these version changes? The return address is determining the version.

PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.

# PDFs that aren't standards compliant

In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.

TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.

Iwillgetby · on March 3, 2020

If you upload a pdf to google drive and download it 10 minutes later it will magically have BY FAR the best OCR results in the pdf. Note my pdf tests were fairly clean so your experience may not be the same.

I have used Google's fine OCR results to simulate a hacker.

- Download a youtube video that shows how to attack a server on the website hackthebox.eu

- Run ffmpeg to convert the video to images.

- Run a jpeg to pdf tool.

- Upload the pdf to google drive.

- Download the pdf from google drive.

- Grep for the command line identifiers "$" "#".

- Connect to hackthebox.eu vpn.

- Attack the same machine in the video.

DantesKite · on March 3, 2020

Right? I love the OCR for Google Drive. It's such a useful, hidden feature.

By the way, why do you wait 10 minutes? Is there a signal that the PDF is done processing?

Or is there just some kind of voodoo magic that seems to happen that just takes 10 minutes to do?

Iwillgetby · on March 3, 2020

2 minutes is probably long enough. I did notice that google drive doesn't seem to like it if you upload a lot of files. I have had files sit and never get OCR, but I forgot about them so they may have OCR on them now.

Also, I am not aware of a signal when it is done.

Ididntdothis · on March 3, 2020

You got to love modern software. It may do it or not. It may do it within an unknowable timeframe. But if it does it, it’s wonderful.

fireattack · on March 3, 2020

Google Drive can directly OCR jpeg or any image. Just upload and open it with Google Docs.

Now I think about it, I don't know what you mean by "upload a pdf to google drive and download it 10 minutes later".

Uploading and downloading a file shouldn't change it at all, at bit level.

anthk · on March 3, 2020

>- Run a jpeg to pdf tool.

ImageMagick. convert *.jpg out.pdf

ddeokbokki · on March 3, 2020

This solution is absolutely beautiful

Wiretrip · on March 3, 2020

PDF is, without a doubt, one of the worst file formats ever produced and should really be destroyed with fire... That said, as long as you think of PDF as an image format it's less soul destroying to deal with.

lm28469 · on March 3, 2020

PDF is good at what it's supposed to be good. Parsing pdf to extract data is like using a rock as a hammer and a screw as a nail, if you try hard enough it'll eventually work but it was never intended to be used that way.

mumblemumble · on March 3, 2020

I think my fastener analogy would probably involve something more like trying to remove a screw that's been epoxied in. Or perhaps trying to do your own repairs on a Samsung phone.

It's not that the thing you're trying to do is stupid. It's probably entirely legitimate, and driven by a real need. It's just that the original designers of the thing you're trying to work on didn't give a damn about your ability to work on it.

Finnucane · on March 3, 2020

Actually, parsing text data from a pdf is more like using the rock to unscrew a screw, in that it was not meant to be done that way at all. But yeah, the pdf was designed to provide a fixed-format document that could be displayed or printed with the same output regardless of the device used.

I'm not sure (I haven't thought about it a lot) that you could come up with a format that duplicates that function and is also easier to parse or edit.

anoncake · on March 3, 2020

It's closer to using a screwdriver to screw in a rock. The task isn't supposed to be done in the first place but the tool is the least wrong one.

mark-r · on March 3, 2020

I would think any word processing document format would duplicate that function and be better.

bachmeier · on March 3, 2020

It's pretty silly when you think about it. There's an underlying assumptions that you'll work with the data in the original format that you used to make the PDF.

_8ljf · on March 4, 2020

“PDF is good at what it's supposed to be good.”

QFT. PDF should really have been called “Print Description Format”. At heart it’s really just a long list of non-linear drawing instructions for plotting font glyphs; a sort of cut-down PostScript.

https://en.wikipedia.org/wiki/PostScript

(And, yes, I have done automated text extraction on raw PDF, via Python’s pdfminer. Even with library support, it is super nasty and brittle, and very document specific. Makes DOCX/XLSX parsing seem a walk in the park.)

What’s really annoying is that the PDF format is also extensible, which allows additional capabilities such as user-editable forms (XFDF) and Accessibility support.

https://www.adobe.com/accessibility/pdf/pdf-accessibility-ov...

Accessibility makes text content available as honest-to-goodness actual text, which is precisely what you want when doing text extraction. What’s good for disabled humans is good for machines too; who knew?

i.e. PDF format already offers the solution you seek. Yet you could probably count on the fingers of one hand the PDF generators that write Accessible PDF as standard.

(As for who’s to blame for that, I leave others to join up the dots.)

ProZsolt · on March 3, 2020

PDF is great what it meant to be, a digital printed paper, with its pros (It will look exactly the same anywhere) and cons (Can't easily extract data from it or modify it).

Currently, there is no viable alternative if you want the pros but not the cons

bobbylarrybobby · on March 3, 2020

For me, the biggest con of PDFs is that like physical books, the font family and size cannot be changed. This means you can't blow the text up without having to scroll horizontally to read each line or change the font to one you prefer for whatever reason. It boggles my mind that we accept throwing away the raw underlying text that forms a PDF. PDF is one step above a JPEG containing the same contents.

mumblemumble · on March 3, 2020

> Currently, there is no viable alternative if you want the pros but not the cons

I remember OpenXPS being much easier to work with. That might be due to cultural rather than structural differences, mind - fewer applications generate OpenXPS, so there's fewer applications to generate them in their own special snowflake ways.

ProZsolt · on March 3, 2020

This is the first time I heard of it. When I search for it I only find the Wikipedia article and 99 links to how to convert it to pdf.

The problem with this is that from an average person perspective it doesn't have the pros. There is no built-in or first-party app that can open this format on Mac and Linux. More than 99% of the users only want to read or print it. It's hard to convince them to use an alternative format when it's way more difficult to do the only thing they want to do.

degski · on March 3, 2020

It's a Windows-thing, since W7, IIRC. It's ok now, but it has been buggy for years, and yes, who eats xps-files, so better it is, but it's not more useful.

tonyedgecombe · on March 3, 2020

It was too late and probably too attached to Microsoft to succeed. It is still used as the spool file format for modern printer drivers on Windows.

carapace · on March 3, 2020

Screenshots of Smalltalk. (I'm joking.)

bob1029 · on March 3, 2020

We have to fill existing PDFs from a wide range of vendors and clients. Our approach is to raster all PDFs to 300DPI PNG images before doing anything with them.

Once you have something as a PNG (or any other format you can get into a Bitmap), throwing it against something like System.Drawing in .NET(core) is trivial. Once you are in this domain, you can do literally anything you want with that PDF. Barcodes, images, sideways text, html, OpenGL-rendered scenes, etc. It's the least stressful way I can imagine dealing with PDFs. For final delivery, we recombine the images into a PDF that simply has these as scaled 1:1 to the document. No one can tell the difference between source and destination PDF unless they look at the file size on disk.

This approach is non-ideal if minimal document size is a concern and you can't deal with the PNG bloat compared to native PDF. It is also problematic if you would like to perform text extraction. We use this technique for documents that are ultimately printed, emailed to customers, or submitted to long-term storage systems (which currently get populated with scanned content anyways).

rahimnathwani · on March 3, 2020

You could probably reduce file size by generating your additions as a single PDF, and then combining that with the original 'form', using something like

pdftk form.pdf multibackground additions.pdf output output.pdf

mopsi · on March 3, 2020

> No one can tell the difference between source and destination PDF unless they look at the file size on disk.

Not even when they try to select and copy text?

hnick · on March 3, 2020

You can add PDF tag commands to make rasterised text selectable and searchable, though they probably aren't doing that.

equasar · on March 3, 2020

Any recommended library for .NET to extract text by coordinates?

Rury · on March 3, 2020

There's itext7 (also for java). Not sure how it compares with other libraries, but it will parse text along with coordinates. You just need to write your own execution strategy to parse how you want.

From my experience, it seems to grab text just fine, the tricky part is identifying & grabbing what you want, and ignoring what you don't want... (for reasons mentioned in the article)

https://github.com/itext/itext7-dotnet

https://itextpdf.com/en/resources/examples/itext-7/parsing-p...

bob1029 · on March 3, 2020

I don't know that this could exist for all PDFs.

Sounds like you are in need of OCR if you want to be able to use arbitrary screen coords as a lookup constraint.

eterps · on March 3, 2020

Lots of people doing their daily jobs are not aware of the information loss that occurs whenever they are saving/exporting as PDF.

totololo · on March 3, 2020

In the consulting industry I’ve seen PDF being used precisely because third parties couldn’t mess with the content anymore.

KineticLensman · on March 3, 2020

Yes, the company I once worked for used to supply locked PDF copies to make it slightly harder for casual readers to re-use / steal our text.

ldenoue · on March 3, 2020

That’s the approach I’m using to reformat “reflow” PDFs for mobile in my app https://readerview.app/

darknoon · on March 3, 2020

The first link on your demo gives me an error (mobile safari) https://www.appblit.com/pdfreflow/viewdoc?url=http://arxiv.o...

Gatsky · on March 3, 2020

I have been waiting for this for so long. It really works, well done.

stronglikedan · on March 3, 2020

Tell that to the entire commercial print industry, where they work very well.

Ididntdothis · on March 3, 2020

Yup. I still have PTSD from a project where I needed to extract text from millions of PDFs

adrianN · on March 3, 2020

What alternative do you propose? Postscript?

Koshkin · on March 3, 2020

Why not, .ps.gz works pretty well.

hamburglar · on March 3, 2020

... and is much more difficult to extract text from than PDF, given that it's turing complete (hello halting problem) and doesn't even restrict your output to a particular bounding box.

goatlover · on March 3, 2020

It was never meant to be a data storage format. It's for reading and printing.

BlueTemplar · on March 3, 2020

Except it sucks for reading?

goatlover · on March 3, 2020

I haven't experienced problems reading articles and books in PDF format on my phone.

efreak · on March 5, 2020

I read ebooks on my Nintendo DSi for several years when I was in college; The low-resolution screen combined with my need for glasses (and dislike of wearing them) made reading PDF files unbearable. Later on I got a cheap android tablet and reading PDF files was easier, but still required constant panning and zooming. Today I use a more modern device (2013 Nexus 7 or 2014 NVidia Shield), and I still don't like PDF files. I usually open the PDF in word if possible, save it in another format, then convert to epub with calibre, and dump the other formats.

Epubs in comparison are easy, as all it takes is a single tap or button press to continue. When there's no DRM on the file (thanks HB, Baen) I read in FBReader with custom fonts, colors, and text size. It doesn't hurt any that the epub files I get are usually smaller than the PDF version of the same book.

Personally, I think the fact that Calibre's format converter has so many device templates for PDF conversion says a lot.

_8ljf · on March 4, 2020

Try being visually impaired.

drdeadringer · on March 3, 2020

I have a doubt. What am I missing?

grishka · on March 3, 2020

You clearly haven't ever worked with MP3.

sixhobbits · on March 3, 2020

As a meta point, it's really nice to see such a well-written, well-researched article that is obviously used as a form of lead generation for the company, and yet no in your face "call to actions" which try to stop you reading the article you came for and get out your wallet instead.

jiveturkey · on March 3, 2020

i mean except for the banner at the top and bottom! but yeah, an SEO article with actual substance, well formatted, not grey-on-grey[1], no trackers[2], is rare these days.

[1] recently read an SEO post on okta's site. who can read that garbage?

[2] only GA ... which isn't a 3rd-party tracker.

duckmysick · on March 5, 2020

> GA ... which isn't a 3rd-party tracker.

Why not? It's not self-hosted and results are stored elsewhere.

jiveturkey · on March 6, 2020

it doesn't correlate across sites by default -- the reasonable definition of a 3rd party tracker. by your definition, everything not complete self-hosted is a 3rd-party tracker. eg, netlify, which uses server logs to "self"-analyze would be a 3rd party tracker. it is not self-hosted and the data is stored elsewhere.

some might add: for the purpose of resale of the data, but I don't think that's a requirement to be classified as 3rd party tracker. the mere act of correlation, no matter what you then do with the data, makes you a 3rd party tracker. in case you think that's just semantics, this is important for GDPR and the new california law.

you can turn on the "doubleclick" option, which does do said correlation and tracks you. but that's up to the site to decide. GA doesn't do it by default.

dwheeler · on March 3, 2020

The best technique for having a PDF with extractable data is to include the data within the PDF itself. That is what LibreOffice can do, it can slip in the entire original document within a PDF. Since a compressed file is quite small, the resulting files are not that much larger, and then you don't need to fuss with OCR or anything else.

wenc · on March 3, 2020

Yes to embedding. In Canada, folks have always been able to e-file tax returns, but the CRA (Canada Revenue Agency) also has fillable PDF form for folks who insist on mailing in their returns (with their receipts and stuff so they don't have to store them and risk losing them).

When you're done filling the form, the PDF runs form validity checks and generates a 2D barcode [1] -- which stores your all field entry data -- on the first page. This 2D barcode can then be digitally extracted on the receiving end with either a 2D barcode scanner or a computer algorithm. No loss of fidelity.

Looks like Acrobat supports generation of QR, PDF417 and Data Matrix 2D barcodes.[2]

[1] https://www.canada.ca/en/revenue-agency/services/tax/busines...

[2] https://helpx.adobe.com/acrobat/using/pdf-barcode-form-field...

gruez · on March 3, 2020

>for folks who insist on mailing in their returns (with their receipts and stuff so they don't have to store them and risk losing them).

The Canadian tax agency offers free storage for whatever receipts you mail them? Sounds nifty. Does the IRS (or any other tax agency) do this?

wenc · on March 3, 2020

Just the receipts relevant to the tax return. If you e-file you're responsible for storing receipts up to 6 years in case of audit. (or something like that)

radarsat1 · on March 3, 2020

As long as you can trust that the contents of the embedded document is the same as what is displayed.

dwheeler · on March 3, 2020

If you're worried about malicious differences, "regular" PDFs are worse.

As noted in the article, it is extremely difficult to figure out the original text given only a "normal" PDF, so you end up using a lot of heuristics that sometimes guess correctly. There's no guarantee that you'll be able to extract the "original text" when you start with an arbitrary PDF without embedded data. So if you're extracting text, neither way guarantees that you'll get "original text" that exactly matches the displayed PDF if an attacker created the PDF.

That said, there's more you can do if you have an embedded OpenDocument file. For example, you could OCR the displayed PDF, and then show the differences with the embedded file. In some cases you could even regenerate the displayed PDF & do a comparison. There are lots of advantages when you have the embedded data.

72deluxe · on March 3, 2020

How does LibreOffice include the entire document with the PDF?

Is there a special "data" section of the PDF that includes this? Can you point me to any documentation regarding this? It sounds quite good TBH.

bsdubernerd · on March 3, 2020

It's nice to note how several of these problems already exist in much more structured document types, such as HTML.

Using white-on-white dark-hat SEO techniques for keyword boosting? Check. Custom fonts with random glyphs? Check. I didn't see custom encodings (yet).

We try to keep HTML semantic, but google has been interpreting pages to a much higher level in order to spot issues such as these. If you ever tried to work on a scraper, you know how it's very hard to get far nowdays without using a full-blown browser as a backend.

What worries me is that it's going to get massively worse. Despite me hating HTML/web interfaces, one big advantage for me is that everything which looks like text is normally selectable, as opposed to a standard native widget which isn't. It's just much more "usable", as a user, because everything you see can be manipulated.

We've seen already asm.js-based dynamic text layout inspired by tex with canvas rendering that has no selectable content and/or suffers from all the OP issues! Now, make it fast and popular with WASM...

"yay"

dredmorbius · on March 3, 2020

Hiding page content unless rendered via JS is the darkest dark pattern in HTML I've noted.

Though absolute-positioning of all text elements via CSS at some arbitrary level (I've seen it by paragraph), such that source order has no relationship to display order, is quite close.

ethanwillis · on March 3, 2020

I went down a rabbit hole while making a canvas based UI library from scratch.. and started reading about the history of NeWS, display postscript, and postscript in general.

I started reading the ISO spec for postscript used in modern PDFs. You can read it yourself here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD...

What actually needs to be done to extract text correctly is to be able to parse the postscript, have a way of figuring out how the raw text.. or the curves that draw the text.. are displayed (whether they are or not and in relation to each other) using information that the postscript gives you.

Edit: More than anything I think understanding deeply the class of PDFs you want to extract data from is the most important part. Trying to generalize it is where the real difficulty comes from.. as in most things.

mkjmkumar · on March 3, 2020

Around couple of years ago I am working on a home project and utilised Tesseract and Laptonica for OCR. Storage and search HDFS, HBase and SolrCloud on extracted text. You can find the details here on my website. I was very impressed with conversion of hand written pdf docs with 90% readable accuracy. I have named it as Content Data Store(CDS) http://ammozon.co.in/headtohead/?p=153 . Source code is open and you may find steps on installation and how to run here. http://ammozon.co.in/headtohead/?p=129 http://ammozon.co.in/headtohead/?p=126 A short demo http://ammozon.co.in/gif/ocr.gif

I didnot get time to enhance it further but planning to containerize the whole application. See if you find it useful in its current form.

hylian · on March 3, 2020

I had a similar problem and ended up using AWS' Textract tool to return the text as well as bounding box data for each letter, then overlayed that on a UI with an SVG of the original page, allowing the user to highlight handwritten and typed text. I plan to open source it so if anyone's interested let me know.

Not a fan of the potential vendor lock in though, so it's only really suitable for those in an already AWS environment not worried about them harvesting your data.

rwojo · on March 3, 2020

Very interested to see this as I was about to work on the same thing!

miki123211 · on March 3, 2020

I use a screen reader, so of course some kind of text extraction is how I read PDFs all the time. There were some nice gotchas I've found.

* Polish ebooks, which usually use Watermarks instead of DRM, sometimes hide their watermarks in a weird way the screen reader doesn't detect. Imagine hearing "This copy belongs to address at example dot com: one one three a f six nine c c" at the end of every page. Of course the hex string is usually much longer, about 32 chars long or so.

* Some tests I had to take included automatically generated alt texts for their images. The alt text contained full paths to the JPG files on the designer's hard drive. For example, there was one exercise where we were supposed to identify a building. Normally, it would be completely inaccessible, but the alt was something like "C:\Documents and Settings\Aneczka\Confidential\tests 20xx\history\colosseum.jpg".

* My German textbook had a few conversations between authors or editors in random places. They weren't visible, but my screen reader still could read them. I guess they used the PDF or Indesign project files themselves as a dirty workaround for the lack of a chat / notetaking app, kind of like programmers sometimes do with comments. They probably thought they were the only ones that will ever read them. They were mostly right, as the file was meant for printing, and I was probably the only one who managed to get an electronic copy.

* Some big companies, mostly carriers, sometimes give you contract templates. They let you familiarize yourself with the terms before you decide to sign, in which case they ask you for all the necessary personal info and give you a real contract. Sometimes though, they're quite lazy, and the template contracts are actually real contracts. The personal data of people that they were meant for is all there, just visually covered, usually by making it white on white, or by putting a rectangle object that covers them. Of course, for a screen reader, this makes no difference, and the data are still there.

Similar issues happen on websites, mostly with cookie banners, which are supposed to cover the whole site and make it impossible to use before closing. However, for a screen reader, they sometimes appear at the very beginning or end of the page, and interacting with the site is possible without even realizing they're there.

tyingq · on March 3, 2020

I almost always have to resort to a dedicated parser for that specific pdf. I use it, for example, to injest invoice data from suppliers that won't send me plain text. Always end up with a parser per supplier. And copious amounts of sanity checking to notify me when they break/change the format.

saradhi · on March 3, 2020

I'm an ML engineer, worked as a part time data engineer consultant for a medical lines/claims extraction company, for 3 years, which majorly involved in extracting the tabular data from the PDFs and Images. Developer rules or parsers as such is JUST no help. You end up creating a new rule every time you miss the data extraction.

With that in consideration, and the existing resources are little help especially on skewed, blurry, handwritten and 2 different table structure in the input, I ended up creating an API service to extract tabular data from Images and PDFs - hosted as https://extracttable.com . We cared it to be robust, average extraction time on images is under 5 seconds. On top of maintaining accuracy, A bad extraction is eligible for credit usage refund, which literally not any service offer it.

i Invite HN users to give it a try and feel free to email saradhi@extracttable.com for extra API credits for the trail.

jazzido · on March 3, 2020

Hi, author and maintainer of Tabula (https://github.com/tabulapdf/tabula). We've been trying to contact you about the "Tabula Pro" version that you are offering.

Feel free to reachme at manuel at jazzido dot com

staticautomatic · on March 3, 2020

Edit: See reply below

Am I reading the repos correctly? It looks like Extractable copied Tabula (MIT) to its own repo rather than forking it, removed the attribution, and then tried to re-license it as Apache 2.0. If so, that would be pretty fucked up.

https://github.com/tabulapdf

https://github.com/ExtractTable/tabulapro

jazzido · on March 3, 2020

Not really. They import tabula_py, which is a Python wrapper around tabula-java (the library of which I'm a maintainer).

Still, I would have loved at least a heads up from the team that sells Tabula Pro. I know they're not required to do so, but hey, they're kinda piggybacking on Tabula's "reputation".

bruckie · on March 3, 2020

If you control the Tabula trademark (which doesn't necessarily require a formal registration), you may be able to prohibit them from using the TabulaPro name. That's exactly what trademark law is for.

(IANAL)

wpietri · on March 3, 2020

You're being much more polite here than I would be. Even if it isn't illegal, what they've done is a giant dick move.

saradhi · on March 4, 2020

William, the intention of "TabulaPro" is to give the developers a chance to use a single library instead of switching ExtractTable for images and tabula-py for text PDFs.

What do you recommend us to do, to not make you feel we made a dick move.

TIA

wpietri · on March 4, 2020

Well, let me ask a few questions:

Did you ask permission of the original author to use a derived name?

Did you discuss your plan to commercialize the original author's work with the author? Before starting out?

Since starting a commercial project, how much money have you given to the original author?

saradhi · on March 4, 2020

- No, No, Zero.

"commercialize the original author's work with the author" - No, but let me highlight this, any extraction with tabula-py is not commercialized - you can look into the wrapper too :) or even compare the results with tabula-py vs tabulaPro.

Copying the TabulaPro description here, "TabulaPro is a layer on tabula-py library to extract tables from Scan PDFs and Images." - we respect every effort of the contributors & author, never intended to plagiarize.

I understand the misinterpretation here is that we are charging for the open-sourced library because of the name. We already informed author in the email about unpublishing the library, this morning, I just deleted the project and came here to mention it is deleted :)

wpietri · on March 4, 2020

Sorry, Saradhi, I don't think you can reasonably claim there was no intention to plagiarize. Adding a "pro" to something is clearly meant to suggest it's the paid version of something. And it's equally clear that "TabulaPro" is derived from "Tabula".

It may be that you didn't realize that people would see your appropriation as wrong, although I have a hard time believing that as well given that the author tried to contact you and was ignored. As they say, "The wicked flee when no man pursueth."

So what I see here is somebody knowingly doing something dodgy and then panicking when getting caught. If you'd really like to make amends, I'd start with some serious introspection on what you actually did, and an honest conversation with the original author that hopefully includes a proper [1] apology.

[1] Meaning it includes an explicit recognition of your error and the harms done, a clear expression of regret, and a sincere offer to make amends. E.g., https://greatergood.berkeley.edu/article/item/the_three_part...

wpietri · on March 4, 2020

And I'm going add it's really weird that your answer ("No, No, Zero") is exactly the same as what the library author said [1] two hours before you posted. But you do that again without acknowledging the author, and with just enough format difference that it's not a copy-paste. It's extremely hard for me to imagine you didn't read what he said before writing that; it's just too similar.

[1] https://news.ycombinator.com/item?id=22483334

saradhi · on March 4, 2020

I understand how the whole thing interpreted as, and I agree with you. I'll do a note as suggested. Thanks for guiding and the link too.

I would like to get the author's comment on "tried to contact you and was ignored", as I was the one who emailed yesterday.

jazzido · on March 4, 2020

Author here.

1) No. 2) No. 3) Zero.

staticautomatic · on March 3, 2020

Thanks for clarifying.

CaptArmchair · on March 3, 2020

I chuckled at your "the worst image" sample. Which still looked quite decent all things considered.

You're "handwritten" example looks a bit "too decent" as well. I can see how that works. You first look for the edges of the table, and then you evaluate the symbol in each cell as something that matches unicode.

So, how well does this cope with increasing degradation? i.e. pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?

saradhi · on March 4, 2020

"pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?"

"The Worst Image" is a close match to that, except it is a print.

Regarding increasing degradation - as stated above, the OCR engine is not proprietary - we confined ourselves to detect the structure at this moment, and started with the most common problems.

aasasd · on March 3, 2020

What a glorious format for storing mankind's knowledge. Consider that by now displays have arbitrary sizes and a variety of proportions, and that papers are often never printed but only read from screens. To reflow text for different screen sizes, you need its ‘semantic’ structure.

And meanwhile if you say on HN that HTML should be used instead of PDF for papers, people will jump on you insisting that they need PDF for precise formatting of their papers—which mostly barely differ from Markdown by having two columns and formulas. What exactly they need ‘precise formatting’ for, and why it can't be solved with MathML and image fallback, they can't say.

People feeling the urge to defend PDF might want to pick up at this point in the discussion: https://news.ycombinator.com/item?id=21454636

mLuby · on March 3, 2020

Not everything needs to look good on every screen size. I don't expect to be able to read academic papers on my smartwatch, and a simple alarm clock app looks kind of silly when it's fullscreen across a desktop monitor. Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes.

PDF is simply better than HTML when it comes to preserving layout as the author intended.

aasasd · on March 3, 2020

I wonder if you realize that both your points wildly miss what I said.

First, there's no need to stretch my argument to the point of it being ridiculous. I don't have to reach for a watch to suffer from PDF. Even a tablet is enough: I don't see many 14" tablets flying off the shelves. I also know for sure that the vast majority of ubiquitous communicator devices, aka smartphones, are about the same size as mine, so everyone with those is guaranteed to have the same shitty experience with papers on their communicators and will have to sedentary-lifestyle their ass off in front of a big display for barely any reason.

Secondly:

> Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes. PDF is simply better than HTML when it comes to preserving layout as the author intended.

As I wrote right there above, still no explanation of why the layout makes that difference and why preserving it is so important when most papers are just walls of text + images + some formulas. Somehow I'm able to read those very things off HTML pages just fine.