PDF is a pretty interesting format. The spec is actually a great read. It's amazing how many features they've needed to add over the years to support everyone's use cases.
It's a display format that doesn't have a whole lot of semantic meaning for the most part. Often every character is individually placed so even extracting words is a pain. It's insane that OCR (which it sounds like this uses) is the easiest way to deal with extraction.
I highly recommend having a look inside a couple of pdfs to see how they look. I've posted about this before but the trick is to expand the streams.
Sometimes, there's real text broken into lines and then you're pretty good. Often it's a subset font where the codes used in the script don't correspond to their visual glyphs. When that happens there might be a unicode map that lets you know which internal char maps to which external char (used for copying text from the PDF by viewers). Sometimes that's missing and you can rebuild it from other encoding information attached to the font. Other times you can't get the relationship and all you're left with is the randomly selected codes for each character — at that point it's not too dissimilar from a simple Enigma machine problem I guess.
I sometimes see documents in which the same font has been subset again and again, once for each word. If you have a unicode map for each one, that's fine, if not it's not going to be much fun. In that case every word is going to look like random characters in the PDF and those character-glyph relationships are going to change from word to word.
Other times the glyphs are rendered as vector paths at write time and you're down to trying to find the character from the outline it the shape. I deal with this a lot and there are common patterns but normally each glyph will be broken into several bits itself, so you have to find which bits go with which glyphs before you even start.
Does anyone have a technical paper to hand, we could crack it open and take a look for fun.
EDIT If you're trying to reliably convert everything, in a way you're better off catering for the lowest common denominator so you can do it consistently. Here, that means assuming you don't have any raw text data to work with and you have to do everything with image recognition. Either way, it's a fun problem.
Some time ago I came to a similar conclusion: In most cases, the only way to properly process PDF files is to render them and work on the raster images.
I was involved in a project where we needed to determine the final size of an image in a PDF document.
This seemed simple: Just keep track of all transformation matrices applied to the image, then calculate the final size.
But we underestimated the nonsense complexity of PDF: The image could be a real image or an embedded EPS, which are completely different cases. The image could have inner transparency, but could can also have an outer alpha mask by the PDF document. Then there are clipping paths, but be aware of the always implicitly present clipping path that is the page boundary. Oh, and an image may be overlapped by text, or even another image, in which case you need to to the same processing for that one, too. And so on.
After wasting lots of time almost rebuilding a PDF renderer accidentally, we decided to use an existing renderer instead.
Turned out the only feasible solution was to render the PDF twice: with and without the image, and to compare the results pixel by pixel.
I'm afraid the modern web might develop in a similar direction.
This looks really cool and is badly needed. Our company would kill for a PDF to semantic HTML algorithm (or service) too, using machine learning based on computer vision. Existing options just vomit enough CSS to match the PDF output, rather than mark up into headings, tables and the like.
What I think would be a really nice killer app would be using OCR to extract formulas directly into Matlab code. Would be awesome for reproducibility studies or just people trying to implement algorithms for whatever reason.
This project seems to convert the PDF into an image before doing the semantic annotation, so it would work on scans as well. This doesn't give you the text, but it gets you halfway there. The other half is can be done by passing the discovered regions into an OCR engine to pull out the text.
The one time I needed to turn a scanned PDF (600+ page book) into searchable text, I used this Ruby script https://github.com/gkovacs/pdfocr/ , which pulls out individual pages using pdftk, turns them into images to feed into an OCR engine of your choice (Tesseract seems to be the gold standard) and then puts them back together. It can blow up the file size tremendously, but worked well enough for my use case. (I did write a very special purpose PDF compressor to shrink the file back, but that was more for fun.)
I havnt had a chance to read through this completely yet, but I'm curious if this method is agnostic to how the PDF was created originally (LATEX, Adobe, scanned images). It reads like that doesnt matter (treating it as an image) but I wanted to make sure.
my experience with pdf is that its a pretty open-ended, and thus pretty difficult, format to work with. There's not a whole lot in the way of "what is this thing supposed to be" semantics encoded into the spec. Even pdf-to-html is kind of a crapshoot.
It's a display format that doesn't have a whole lot of semantic meaning for the most part. Often every character is individually placed so even extracting words is a pain. It's insane that OCR (which it sounds like this uses) is the easiest way to deal with extraction.
I highly recommend having a look inside a couple of pdfs to see how they look. I've posted about this before but the trick is to expand the streams.