This is why iPhone didn't initially ship with double-tap to zoom for PDF paragraphs (like it had for blocks on web pages). I know because I was assigned the feature, and I went over to the PDF guy to ask how I would determine on an arbitrary PDF what was probably a "block" (paragraph), and I got a huge explanation on how hard it would be. I relayed this to my manager and the bug was punted.
Edit: To add a little more color, given that none of us was (or at least certainly I wasn't) an expert on the PDF format, we had so far treated the bug like a bug of probably at-most moderate complexity (just have to read up on PDF and figure out what the base unit is or whatever). After discovering what this article talks about, it became evident that any solution we cobbled together in the time we had left would really just be signing up for an endless stream of it-doesn't-work-quite-right bugs. So, a feature that would become a bug emitter. I remember in particular considering one of the main use cases: scientific articles that are usually in two columns, AND also used justified text. A lot of times the spaces between words could be as large as the spaces between columns, so the statistical "grouping" of characters to try to identify the "macro rectangle" shape could get tricky without severely special-casing for this. All this being said, as the story should make clear, I put about one day of thought into this before the decision was made to avoid it for 1.0, so far all I know there are actually really good solutions to this. Even writing this now I am starting to think of fun ways to deal with this, but at the time, it was one of a huge list of things that needed to get done and had been underestimated in complexity.
> I know because I was assigned the feature, and I went over to the PDF guy to ask how I would determine on an arbitrary PDF what was probably a "block" (paragraph), and I got a huge explanation on how hard it would be.
The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are hundreds of billions of dollars going into self-driving cars, and like zero dollars going into this problem.
What are the groups that would benefit most from the PDF-to-HTML conversion? Who are the customers that would drive this profit? I tried to make those sentences not sound contentious but unfortunately they do, but I am genuinely curious about this space and who is feeling the lack of this technology most.
Almost any business that has physical suppliers or business customers.
PDF is de-facto standard for any invoicing, POs, quotes, etc.
If you solve the problem you can effectively programmatically deal with invoicing/payments/ large parts of ordering/dispensing. It's a no brainer to add it on to almost any financial/procurement software that deals with inter business stuff.
Any small-medium physical business can probably half their financial department if you can dependably solve this issue.
A business that invests in building a machine that reads data, produced by a 3rd party machine, using format intended for lay humans to read, is not investing in the right tech IMO.
Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process, or adopting interchange formats.
Problem for some small-business is the cost (process changes, licencing etc) of adopting interchange formats and working with large vendors is prohibitive at their scale e.g. the airline BSP system.
I agree that solving the problem generally i.e. replacing an accounts payable staff person capable of processing arbitrary invoice documents will be comparable to self-driving in difficulty.
If a company deals with a lot of a single type of PDF, then the approach could be economical. I am actually involved in a project looking at doing this with AWS Textract.
> A business that invests in building a machine that reads data, produced by a 3rd party machine, using format intended for lay humans to read, is not investing in the right tech IMO.
Building machines that understand formats that are understood by humans is exactly what we should be doing. People should read, write, and process information in a format that is comfortable and optimized to them. Machines should bend to us, we should not bend to them.
If businesses only dealt with machine readable formats, everyone's computer would still be using the command line.
And there's real condescension in your post:
> Small-mediums should be looking to consolidate buying through a few good suppliers and working with them directly to automate process
You're saying that businesses need to change their business to accommodate data formats, but it should be the other way around.
The proliferation of computers in business over the last 50 years is precisely because businesses can save money/expand capacity by adapting the business processes to the capabilities of the computers.
Over that time, computers have become more friendly to humans, but businesses have adapted and humans been trained to use what computers can do.
Yes, most invoices are in PDF but only about 40% of them are native PDF meaning they are actual documents not scanned images converted to PDFs. There are are also compound PDF invoices which contain images. So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.
This is a huge pet peeve of mine. Most invoices are generated on a computer (often in Word) but a huge fraction of the people who generate them don't know how to export to a PDF. So they print the invoice on paper, scan it back in to a PDF, and email that to you. Thus the proliferation of bitmap PDFs.
> So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.
You can go further. Invoices often contain block sections of text with important terms of the invoice, such as shipping time information, insurance, warranties, etc. To build something that works universally, you also need very good natural language processing.
If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.
> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?
This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.
> The market SOTA Abbyy is far from being accurate.
While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.
For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.
So I have a lot of experience with basically the same problem just from working on this: https://www.prettyfwd.com. As an example of the opportunity size just in the email domain, the amount of personal non-spam email sent every day is like 100x the total size of Wikipedia, but nothing is really done with any of this information because of this challenge. Basically applications are things like:
- Better search engine results
- Identifying experts within a company
- Better machine translation
- Finding accounting fraud
- Automating legal processes
For context, the reason why Facebook is the most successful social network is that they're able to turn behavioral residue into content. If you can get better at taking garbage data and repackaging it into something useful, it stands to reason that there are lots of other companies the size of Facebook that can be created.
I often ponder how much of the "old world" will get "digitalized" — translated in numeric form, bits. And how much will just disappear. The question might seem trivial if you think of books, but now think of architecture, language itself (as it evolves), etc.
There's almost no question in my mind that most new data will endure in some form, by virtue of being digital from day 1.
The endgame for such a company, imho, is to become the "source entity" of information management (in abstracted form), whose two major products are one to express this information in the digital space, and the other in the analog/physical space. You may imagine variations of both (e.g. AR/VR for the former).
Kinda like language in the brain is "abstract" (A) (concept = pattern of neurons firing) and then speech "translates" into a given language, like English (B) or French (C) (different sets of neurons). So from A you easily go to either B or C or D... We've observed that Deep Learning actually does that for translation (there's a "new" "hidden" language in the neural net that expresses all human languages in a generic form of sorts, i.e. "A" in the above example).
The similarities of the ontology of language, and the ontology of information in a system (e.g. business) are remarkable — and what you want is really this fundamental object A, this abstract form which then generates all possible expressions of it (among which a little subset of ~1,000 gives you human languages, a mere 300 active iirc; and you might extend that into any formal language fitting the domain, like engineering math/physics, programming code, measurements/KPI, etc.
It's a daunting task for sure but doable because the space is highly finite (nothing like behavior for instance; and you make it finite through formalization, provided your first goal is to translate e.g. business knowledge, not Shakespeare). It's also a one-off thing because then you may just iterate (refine) or fork, if the basis is sound enough.
I know it all sounds sci-fi but having looked at the problem from many angles, I've seen the PoC for every step (notably linguistics software before neural nets was really interesting, producing topological graphs in n dimensions of concepts e.g. by association). I'm pretty sure that's the future paradigm of "information encoding" and subsequent decoding, expression.
It's just really big, like telling people in the 1950's that because of this IBM thing, eventually everybody will have to get up to speed like it's 1990 already. But some people "knew", as in seeing the "possible" and even "likely". These were the ones who went on to make those techs and products.
Digital data is arguably more fragile than analogue, offline, paper (or papyrus, or clay tablet) media. We have documents over 3000 years old that can still be read. Meanwhile, the proprietary software necessary to access many existing digital data formats is tied to obsolete hardware, working examples of which may no longer exist, emulators for which may not exist, and insufficient documentation may exist to even enable their creation. Just as one example, see the difficulty in enabling modern access to the BBC's 1986 Domesday Project.
Academics and other people that rely on scientific publications. Most of the world's knowledge in science is locked into PDFs and screenshots (or even pictures) of manufacturer's (often proprietary) software... So extracting it in a more structured way would be a win (so HTML may not be best). On a related note, I've seen people using Okular to convert PDF tables to a usable form (to be honest its table extraction tool is one of the best i've seen despite being pretty manual).
> What are the groups that would benefit most from the PDF-to-HTML conversion? Who are the customers that would drive this profit? I tried to make those sentences not sound contentious but unfortunately they do, but I am genuinely curious about this space and who is feeling the lack of this technology most.
Legal technology. Pretty much everything a lawyer submits to a court is in PDF, or is physically mailed and then scanned in as PDF. If you want to build any technology that understands the law, you have to understand PDFs.
Organisations that have existing business processes to publish to print and pdf but now want to publish in responsive formata for mobile or even desktop web.
Changing their process might be more expensive than paying a lot of money for them to carry on as is for a few more years while getting the benefit of modern eyes on their content.
Edit: concrete example would be government publications like budget narrative documents.
I’ve done a bunch of this work myself and while it’s a bit of a pain to do in general, you can make some reasonable attempts at getting something workable for your use cases.
PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned. Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.
> Fonts in PDF’s are insane because they’re often subset so they only include the required glyphs and the characters are remapped back to 1, 2, 3 etc instead of usual ascii codes.
I've actually seen obfuscation used in a PDF where they load in a custom font that changes the character mapping, so the text you get out of the PDF is gibberish, but the fonts displayed on rendering are correct (a simple character substitution cipher).
The important thing to remember whenever you think something should be simple, is that someone somewhere has a business need for it to be more complicated, so you'll likely have to deal with that complication at some point.
Your website demo video is impressive and I can imagine there are many businesses that would save a lot of time and man-hours by incorporating a solution like this.
I've often thought about creating products like these but as a one-man operation I am daunted by the "getting customers" part of the endeavour. How do you get a product like this into the hands of people who make the decisions in a business? (For anyone, not just OP). PPC AdWords campaigns? Cold-calling? Networking your ass off? Pay someone? Basically, how does one solve the "discoverability problem"?
Surprisingly, Hacker News has been our number one source of leads. We tried Google Ads and Reddit Ads, but the signup rate was literally three orders of magnitude lower than organic traffic from Hacker News and Reddit.
Is your product only on the cloud? My privacy/internet security team won't let me use products that save customer or vendor data on the cloud because you might get hacked. Only giants, like Microsoft, have been approved after an evaluation.
More than half of our customers have asked to be able to skip our cloud and go directly to their database. We’re working on this right now. It’s scheduled to be released this week, so keep an eye open.
In the meantime, if you have any questions, feel free to send me an email at siftrics@siftrics.com. I’d love to hop on the phone or do a Zoom meeting or a Google Hangouts.
We do table recognition and pride ourselves on being better at it than ABBYY. We can handle variable number of rows in a table and we take that into account when determining the position of other text on the page.
Feel free to email me at siftrics@siftrics.com with any questions. We can setup a phone call, zoom meeting, or google hangouts too, if you’d like.
Gini GmbH performs document processing for almost all German banks and for many accounting companies. For banks it does realtime invoice photo processing -- OCR and extraction of amount, bank information, receiver etc. For accounting it extracts all kind of data from a PDF. Unfortunately, only for German language market. But here you go, ABBYY by far is not the only one. In fact ABBYY does only OCR and has some mediocre table detection. That's it.
I do not remember which of the two it was, but 'poppler' or 'pdfbox' (they may use the same backend) created great HTML output, with absolute positions. They also have an XML mode, which is easily transformed.
Of course, there is absolutely no semantics, just display.
That’s actually often just a consequence of the subsetting (I think). Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.
> That’s actually often just a consequence of the subsetting (I think).
I would believe that. It was a pretty poor obfuscation method as they go, if it was intended for that.
> Believe it or not, you can often rebuild the cmaps using information in the pdf to fix the mapping and make the extraction work again.
Oh, I did. That's the flip side of my second paragraph above. When there's a business need to work around complications or obfuscations, that will also happen. :)
> PDFs are incredibly flexible. Text can be specified in a bunch of ways. Glyphs can be defined to the nth degree. Text sometimes isn’t text at all. There’s no layout engine and everything is absolutely positioned.
Can't stress this enough. The next time you open a multi-column PDF in adobe reader and it selects a set of lines or a paragraph in the way you would expect, know that there is a huge amount of technology going on behind the scenes trying to figure out the start and end of each line and paragraph.
> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, and would accrue at least as much profit to any company that can solve it. But there are ... like zero dollars going into this problem.
Converting PDFs to HTML well is a very hard problem, but hard by itself to create a very big company. When processing PDFs or documents generally, the value is not in the format, it's in the substantive content.
The real money is not going from PDF to HTML, but from HTML (or any doc format) into structured knowledge. There are plenty of companies trying to do this (including mine! www.docketalarm.com), and I agree it has the potential to be as big as self-driving cars. However, technology to understand human language and ideas is not nearly as well developed as technology to understand images, video, and radar (what self-driving care rely on).
The problem is much more difficult to solve than building safer-than-human self-driving cars. If you can build a machine that truly understands text, you have built a general AI.
There's a lot more than zero dollars going into this... it's just that the end result is universally something that's "good enough for this one use-case for this one company" and that's as far as it gets.
Not really, it's just a different set of challenges. The original article sums it up well, in terms of a lack of text-order hints. I haven't really tried incorporating OCR approaches at all, but I suspect they could probably be used to detect hidden text.
The basic issue imho is that NLP algorithms are very inaccurate even with perfect input. E.g. even with perfect input, they're maybe only 75% accurate. And even an a text-processing algorithm that's like 99.9% accurate will yield input to your NLP algorithms that's like 50% accurate, so any results will be mostly unusable.
NLP algorithms are just fine. It is the combination of regexes, NLP and deep learning that allows you to achieve good extraction results. So, basically OCR / pdf parser -> jpeg/xml/json -> regexes + NLP / DL extractor.
Semantic segmentation to identify blocks and OCR to convert to text - I think OneNote is already doing that. PDF is a horrible format for representing text, though PostScript is even worse.
“The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext is probably comparable in difficulty to building level 5 self-driving cars, ”
Since you can always print a PDF to a bitmap and use OCR, I assume you're implicitly asking for something that does substantially better. How much better, and why?
> The funny thing is that creating a universal algorithm to convert PDFs and/or HTML to plaintext...would accrue at least as much profit [as self-driving cars] to any company that can solve it.
Can you explain a bit more about why this is so valuable? I don't know anything about this industry.
Does this mean some abstraction is lost between the creation phase and final "save to pdf" phase? It'd seem ridiculous to not easily be able to track blocks while it's a WIP.....
I don't know if PDF has truly evolved from its desktop publishing origins, but it is a terrible format because it no longer contains the higher level source information that you would have in an InDesign or a LaTeX file. PDF/Postscript were meant to represent optical fidelity and thus are too low-level abstractions for a lot of end-user, word processing tasks (such as detecting layout features), and thus trying to reverse engineer the "design intent" from them feels like doing work that is unecessarily tedious. But that's the way it seems to be given the popularity of the format.
Where to center to is only one vector, the other is how much to zoom: ideally it’s such that the text block fits on the screen. But again, that requires knowing the bounds of the text block. Zooming by a constant wherever you tap is a much less useful feature for text (vs. a map for instance), but I think it’s what we defaulted to (can’t remember if it was that of just nothing).
One of the main features of the product I work on is data extraction from a specific type of PDF. If you want to build something similar these are my recommendations for you:
- Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.
Here is the section of our Dockerfile that builds pdf2json for those of you that might need it:
# Download and install pdf2json
ARG PDF2JSON_VERSION=0.70
RUN mkdir -p $HOME/pdf2json-$PDF2JSON_VERSION \
&& cd $HOME/pdf2json-$PDF2JSON_VERSION \
&& wget -q https://github.com/flexpaper/pdf2json/releases/download/$PDF... \
&& tar xzf pdf2json-$PDF2JSON_VERSION.tar.gz \
&& ./configure > /dev/null 2>&1 \
&& make > /dev/null 2>&1 \
&& make install > /dev/null \
&& rm -Rf $HOME/pdf2json-$PDF2JSON_VERSION \
&& cd
I thoroughly enjoyed both the blog post (as an accessible but thorough explanation of your experience with PDF data extraction) and the linked news article [0] as an all-too-familiar story of a company realizing that a creative person is using their freely-available data in novel and exciting ways and immediately requesting that they shut it down, because faced with the perceived dichotomy of maintaining control versus encouraging progress they will often play on the safe side.
pdf2json font name can be uncorrect sometime as it does only extract them based on a pre-set collection of fonts. I suggest using this fork that fix it :
Bounding box also can be off with pdf2json. Pdf.js do a better job but have a tendency to no handling some ligature/glyph well, transforming word like finish to "f nish" sometime (eating the i in this case). pdfminer (python) is the best solution yet but a thousand time slower....
I worked on an online retailer's book scan ingestion pipeline. It's funny because we soon got most of our "scans" as print-ready PDFs, but we still ran them through the OCR pipeline (that would use the underlying pdf text) since parsing it any other way was a small nightmare.
I am an ML engineer in one of the PDF extraction companies processing thousands of invoices and receipts per day in realtime. Before we started adding ML all our processing logic was build on top of hundreds of regexes and gazetteers. Even until now handcrafted rules are the backbone of our extraction system whereas ML is used as fallback.
Yes, regexes accumulate tech debt and become a maintenance blackhole but if they work, they are faster and more accurate than any fancy DL tech out there.
> Use a good text parsing library. Regexes are probably not enough for your use case. In case you are not aware of the limitations of regexes you may want to learn about Chomsky hierarchy of formal languages.
Most programming languages offer a regex engine capable of matching non-regular languages. I agree though, if you are actually trying to _parse_ text then a regex is not the right tool. It just depends on your use case.
For simple cases, I've also found "pdftotext -layout" useful. For a quick on-off job, this would save someone the trouble of assembling the lines themselves.
I have used this in the past to extract tables, but it doesn't help much in cases where you need font size information.
I’m a contractor. One of my gigs involved writing parsers for 20-something different kinds of pdf bank statements. It’s a dark art. Once you’ve done it 20 times it becomes a lot easier. Now we simply POST a pdf to my service and it gets parsed and the data it contains gets chucked into a database. You can go extremely far with naive parsers. That is, regex combined with positionally-aware fixed-length formatting rules. I’m available for hire re. structured extraction from PDFs. I’ve also got a few OCR tricks up my sleeve (eg for when OCR thinks 0 and 6 are the same)
Many years ago, I regularly had to parse specifications of protocols from various electronic exchanges. The general approach I used was to do a first pass using a Linux tool to convert it to text: pdftotext. Something like:
After that, it was a matter of writing and tweaking custom text parsers (in python or java) until the output was acceptable, generally an XML file consumed by the build (mainly to generate code).
A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).
As someone else said, this is like black magic, but kind of fun :)
I worked for an epub firm that used a similar approach a while ago - we took PDFs and produced Flash (yes, that old) versions for online, and created iOS and Android apps for the publisher.
I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!
Oh my goodness, this whole thread is deja vu from some code I wrote to parse my bank statements. I arrived at exactly the same solution of "pdftotext -layout" followed by a custom parser in Python. And ran into the same difficulty with tables: I wrote a custom table parser that uses heuristics to decide where column breaks are.
I work in the print industry and some clients have the naive idea they'll save money by formatting their own documents (naive because usually this just means a lot more work for us, which they end up paying for).
We need some metadata to rearrange and sort PDF pages for mailing and delivery (such as name, address, and start/end page for that customer).
Our general rule is you provide metadata in an external file to make it easy for us. Otherwise, we run pdftotext and hope there's a consistent formatting for the output (e.g. every first page has "Issue Date:", "Dear XYZ,", or something written on it).
If that doesn't work then we're re-negotiating. It is not too difficult usually to build a parser for one family of PDF files based on a common setup as you've said and you get to learn various tricks. It is very difficult though to write a general parser.
Personally, I found parsing postscript easier since usually it was presented linearly.
I can cosign on this methodology. I used to work in an organization that used to build pdfs for accounting and licensing documentation. I used a proprietary tool (Planetpress :( ) to generate the documents using metadata from a separate input file (csv or xml) to determine what column maps to what field.
Good thing about this was as you have already outlined: It allowed for some flexibility in what was acceptable input data. For specific address formats or names we could accept multiple formats as long as they were consistent and in the proper position in the input file.
Regarding renegotiating: We didn't get that far. However, if a customer within our organization was enlisting our expertise and could not produce an acceptable input file, then we would go back to them and explain the format that we require in order to generate the necessary documents. Of course, creating our document through our data pipelines is obviously the better choice, but this was not an option in some cases at the time.
As far as doing the work of creating these documents in a tool like Planetpress is concerned, well, don't use Planetpress. You are better of doing it in your favorite language of choice's libraries tbh. Nothing worse than having to use proprietary code (Presstalk/Postscript.) that you have to learn and never be able to use anywhere.
By re-negotiating I mean in terms of quoting billable hours. A rule of thumb for a typical Postscript scraper was around 20 hours end to end (dev, testing, and integration into our workflow system).
The problem we have with a lot of client files is that they look fine but printers don't care about "look fine", they crash hard when they run out of virtual memory due to poor structure. And usually without a helpful error message, so that's more billable hours to diagnose. The most common culprit is workflows that develop single document PDFs then merge them resulting in thousands of similar and highly redundant subset fonts.
Any tricks for decimal points versus noise? Its a terrifying outcome and all I've got is doing statistical analysis on the data you've already got and highlighting "outliers".
For something like bank statements, I'd use the rigidly-defined formatting (both number formatting and field position) to inform how to interpret OCR misfires. My larger concern then would be missing a leading 1 (1500.00 v 500.00), but checking for dark pixels immediately preceding the number will flag those errors. And I suppose looking for dark pixels between numbers could help with missed decimals too.
I've done this a bit. I define ranges per numeric field and if it exceeds or is below that range, I send it to another queue for manual review. Sometimes I'll write rules where if it's a dollar amount that usually ends ".00" and I don't read a decimal but I do have "00", then I'll just fix that automatically if it's outside my range.
(Novice speaking) Maybe there's something about looking for the spacing / kerning that is taken up by a decimal point? (Not sure if OCR tools have any way to look for this)
OCR tricks? Assuming post processing dev stuff - may I know your OCR engine. We are supported with Kofax and openText along with cloud engines like GVision as a backup.
I build such a service, but it is impossible to guarantee any reliable result. I ended up shutting it down.
The PDF standard is a mess, and the number of 'tricks' I've seen done is astonishing.
Example: to add shade or border effect to text, most PDF generators simple add the text twice with a subtle offset and different colors. Result: your SaaS service returns every sentence twice.
Off course there were workarounds, but at some point it became unmaintanable.
I'd say exactly the opposite. PDF makes it easy to create a document that looks exactly the way you want it to, which seems to be all that most web designers want (witness all the sites that force a narrow column on a large screen and won't reflow their text properly on a small screen).
In a way it has. In my experience, there have been multiple times where a "generate PDF" requirement has come up, with the best viable solution being "develop it in HTML using standard tech" followed by "and then convert it to PDF".
The demand for automating text extraction is still very high — or at least it feels like it when you’re working around the clock to cater to 3 of your customers, only to wake up to 10 more the next day. We’re small but growing extremely quickly.
Everything. Insurance companies to fledgling AI startups.
It’s definitely harder to get government business because the sales process is so long and compliance is so stringent. That said, we are GDPR compliant.
Well I am putting the finishing touches on a front end that allows extracting PDF text visually. It's also able to adjust when the PDF page size vary for a given document type. Once you build the extractor for a document type, it can run on a batch of PDFs and store to Excel or Database (or any other format).
I sense this tool facilitates and automates a lot of the 'dark art' you mention. Of course there are always difficult documents that don't fit exactly in the initial extraction paradigm, for those I use the big guns ...
Id also be interested in a blog or any basic tips/examples! I totally understand you don't want to give too much away, but I'm sure HN would love to see it!
I remember writing one of my first parsers was for a pdf and I had to employ a similar methodology where I had to rely on regex and "positionally-aware fixed-length" formatting rules. I would literally chunk specific groups by the number of spaces they contained lol. I had to do very little manual intervention but, damn it all, it worked :D .
I've written similar code for investment banks, to extract financial reporting data from PDFs. It's shocking to think how much of the financial world runs on this kind of tin-cans-on-a-piece-of-string solution.
My first internship was at a small company that did PDF parsing and building for EU government agencies and it was really painful work but paid an absolute shitton.
Are you me? Wish that I had known the insertion order trick, though it isn't straightforward to implement with the stack I was using at a previous gig. (Tabula + Naive parsing + Pandas Data Munging). I can expand on a few issues challenges I've run into when parsing PDFs:
# Parser drift and maintenance hell
Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.
You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.
A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.
A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:
pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)
pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)
pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)
What's the common factor between these version changes? The return address is determining the version.
PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.
# PDFs that aren't standards compliant
In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.
TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.
If you upload a pdf to google drive and download it 10 minutes later it will magically have BY FAR the best OCR results in the pdf. Note my pdf tests were fairly clean so your experience may not be the same.
I have used Google's fine OCR results to simulate a hacker.
- Download a youtube video that shows how to attack a server on the website hackthebox.eu
2 minutes is probably long enough. I did notice that google drive doesn't seem to like it if you upload a lot of files. I have had files sit and never get OCR, but I forgot about them so they may have OCR on them now.
PDF is, without a doubt, one of the worst file formats ever produced and should really be destroyed with fire... That said, as long as you think of PDF as an image format it's less soul destroying to deal with.
PDF is good at what it's supposed to be good.
Parsing pdf to extract data is like using a rock as a hammer and a screw as a nail, if you try hard enough it'll eventually work but it was never intended to be used that way.
I think my fastener analogy would probably involve something more like trying to remove a screw that's been epoxied in. Or perhaps trying to do your own repairs on a Samsung phone.
It's not that the thing you're trying to do is stupid. It's probably entirely legitimate, and driven by a real need. It's just that the original designers of the thing you're trying to work on didn't give a damn about your ability to work on it.
Actually, parsing text data from a pdf is more like using the rock to unscrew a screw, in that it was not meant to be done that way at all. But yeah, the pdf was designed to provide a fixed-format document that could be displayed or printed with the same output regardless of the device used.
I'm not sure (I haven't thought about it a lot) that you could come up with a format that duplicates that function and is also easier to parse or edit.
It's pretty silly when you think about it. There's an underlying assumptions that you'll work with the data in the original format that you used to make the PDF.
QFT. PDF should really have been called “Print Description Format”. At heart it’s really just a long list of non-linear drawing instructions for plotting font glyphs; a sort of cut-down PostScript.
(And, yes, I have done automated text extraction on raw PDF, via Python’s pdfminer. Even with library support, it is super nasty and brittle, and very document specific. Makes DOCX/XLSX parsing seem a walk in the park.)
What’s really annoying is that the PDF format is also extensible, which allows additional capabilities such as user-editable forms (XFDF) and Accessibility support.
Accessibility makes text content available as honest-to-goodness actual text, which is precisely what you want when doing text extraction. What’s good for disabled humans is good for machines too; who knew?
i.e. PDF format already offers the solution you seek. Yet you could probably count on the fingers of one hand the PDF generators that write Accessible PDF as standard.
(As for who’s to blame for that, I leave others to join up the dots.)
PDF is great what it meant to be, a digital printed paper, with its pros (It will look exactly the same anywhere) and cons (Can't easily extract data from it or modify it).
Currently, there is no viable alternative if you want the pros but not the cons
For me, the biggest con of PDFs is that like physical books, the font family and size cannot be changed. This means you can't blow the text up without having to scroll horizontally to read each line or change the font to one you prefer for whatever reason. It boggles my mind that we accept throwing away the raw underlying text that forms a PDF. PDF is one step above a JPEG containing the same contents.
> Currently, there is no viable alternative if you want the pros but not the cons
I remember OpenXPS being much easier to work with. That might be due to cultural rather than structural differences, mind - fewer applications generate OpenXPS, so there's fewer applications to generate them in their own special snowflake ways.
This is the first time I heard of it. When I search for it I only find the Wikipedia article and 99 links to how to convert it to pdf.
The problem with this is that from an average person perspective it doesn't have the pros. There is no built-in or first-party app that can open this format on Mac and Linux.
More than 99% of the users only want to read or print it. It's hard to convince them to use an alternative format when it's way more difficult to do the only thing they want to do.
It's a Windows-thing, since W7, IIRC. It's ok now, but it has been buggy for years, and yes, who eats xps-files, so better it is, but it's not more useful.
We have to fill existing PDFs from a wide range of vendors and clients. Our approach is to raster all PDFs to 300DPI PNG images before doing anything with them.
Once you have something as a PNG (or any other format you can get into a Bitmap), throwing it against something like System.Drawing in .NET(core) is trivial. Once you are in this domain, you can do literally anything you want with that PDF. Barcodes, images, sideways text, html, OpenGL-rendered scenes, etc. It's the least stressful way I can imagine dealing with PDFs. For final delivery, we recombine the images into a PDF that simply has these as scaled 1:1 to the document. No one can tell the difference between source and destination PDF unless they look at the file size on disk.
This approach is non-ideal if minimal document size is a concern and you can't deal with the PNG bloat compared to native PDF. It is also problematic if you would like to perform text extraction. We use this technique for documents that are ultimately printed, emailed to customers, or submitted to long-term storage systems (which currently get populated with scanned content anyways).
You could probably reduce file size by generating your additions as a single PDF, and then combining that with the original 'form', using something like
There's itext7 (also for java). Not sure how it compares with other libraries, but it will parse text along with coordinates. You just need to write your own execution strategy to parse how you want.
From my experience, it seems to grab text just fine, the tricky part is identifying & grabbing what you want, and ignoring what you don't want... (for reasons mentioned in the article)
... and is much more difficult to extract text from than PDF, given that it's turing complete (hello halting problem) and doesn't even restrict your output to a particular bounding box.
I read ebooks on my Nintendo DSi for several years when I was in college; The low-resolution screen combined with my need for glasses (and dislike of wearing them) made reading PDF files unbearable. Later on I got a cheap android tablet and reading PDF files was easier, but still required constant panning and zooming. Today I use a more modern device (2013 Nexus 7 or 2014 NVidia Shield), and I still don't like PDF files. I usually open the PDF in word if possible, save it in another format, then convert to epub with calibre, and dump the other formats.
Epubs in comparison are easy, as all it takes is a single tap or button press to continue. When there's no DRM on the file (thanks HB, Baen) I read in FBReader with custom fonts, colors, and text size. It doesn't hurt any that the epub files I get are usually smaller than the PDF version of the same book.
Personally, I think the fact that Calibre's format converter has so many device templates for PDF conversion says a lot.
As a meta point, it's really nice to see such a well-written, well-researched article that is obviously used as a form of lead generation for the company, and yet no in your face "call to actions" which try to stop you reading the article you came for and get out your wallet instead.
i mean except for the banner at the top and bottom! but yeah, an SEO article with actual substance, well formatted, not grey-on-grey[1], no trackers[2], is rare these days.
[1] recently read an SEO post on okta's site. who can read that garbage?
it doesn't correlate across sites by default -- the reasonable definition of a 3rd party tracker. by your definition, everything not complete self-hosted is a 3rd-party tracker. eg, netlify, which uses server logs to "self"-analyze would be a 3rd party tracker. it is not self-hosted and the data is stored elsewhere.
some might add: for the purpose of resale of the data, but I don't think that's a requirement to be classified as 3rd party tracker. the mere act of correlation, no matter what you then do with the data, makes you a 3rd party tracker. in case you think that's just semantics, this is important for GDPR and the new california law.
you can turn on the "doubleclick" option, which does do said correlation and tracks you. but that's up to the site to decide. GA doesn't do it by default.
The best technique for having a PDF with extractable data is to include the data within the PDF itself. That is what LibreOffice can do, it can slip in the entire original document within a PDF. Since a compressed file is quite small, the resulting files are not that much larger, and then you don't need to fuss with OCR or anything else.
Yes to embedding. In Canada, folks have always been able to e-file tax returns, but the CRA (Canada Revenue Agency) also has fillable PDF form for folks who insist on mailing in their returns (with their receipts and stuff so they don't have to store them and risk losing them).
When you're done filling the form, the PDF runs form validity checks and generates a 2D barcode [1] -- which stores your all field entry data -- on the first page. This 2D barcode can then be digitally extracted on the receiving end with either a 2D barcode scanner or a computer algorithm. No loss of fidelity.
Looks like Acrobat supports generation of QR, PDF417 and Data Matrix 2D barcodes.[2]
Just the receipts relevant to the tax return. If you e-file you're responsible for storing receipts up to 6 years in case of audit. (or something like that)
If you're worried about malicious differences, "regular" PDFs are worse.
As noted in the article, it is extremely difficult to figure out the original text given only a "normal" PDF, so you end up using a lot of heuristics that sometimes guess correctly. There's no guarantee that you'll be able to extract the "original text" when you start with an arbitrary PDF without embedded data. So if you're extracting text, neither way guarantees that you'll get "original text" that exactly matches the displayed PDF if an attacker created the PDF.
That said, there's more you can do if you have an embedded OpenDocument file. For example, you could OCR the displayed PDF, and then show the differences with the embedded file. In some cases you could even regenerate the displayed PDF & do a comparison. There are lots of advantages when you have the embedded data.
It's nice to note how several of these problems already exist in much more structured document types, such as HTML.
Using white-on-white dark-hat SEO techniques for keyword boosting? Check. Custom fonts with random glyphs? Check. I didn't see custom encodings (yet).
We try to keep HTML semantic, but google has been interpreting pages to a much higher level in order to spot issues such as these. If you ever tried to work on a scraper, you know how it's very hard to get far nowdays without using a full-blown browser as a backend.
What worries me is that it's going to get massively worse. Despite me hating HTML/web interfaces, one big advantage for me is that everything which looks like text is normally selectable, as opposed to a standard native widget which isn't. It's just much more "usable", as a user, because everything you see can be manipulated.
We've seen already asm.js-based dynamic text layout inspired by tex with canvas rendering that has no selectable content and/or suffers from all the OP issues! Now, make it fast and popular with WASM...
Hiding page content unless rendered via JS is the darkest dark pattern in HTML I've noted.
Though absolute-positioning of all text elements via CSS at some arbitrary level (I've seen it by paragraph), such that source order has no relationship to display order, is quite close.
I went down a rabbit hole while making a canvas based UI library from scratch.. and started reading about the history of NeWS, display postscript, and postscript in general.
What actually needs to be done to extract text correctly is to be able to parse the postscript, have a way of figuring out how the raw text.. or the curves that draw the text.. are displayed (whether they are or not and in relation to each other) using information that the postscript gives you.
Edit: More than anything I think understanding deeply the class of PDFs you want to extract data from is the most important part. Trying to generalize it is where the real difficulty comes from.. as in most things.
Around couple of years ago I am working on a home project and utilised Tesseract and Laptonica for OCR. Storage and search HDFS, HBase and SolrCloud on extracted text. You can find the details here on my website. I was very impressed with conversion of hand written pdf docs with 90% readable accuracy. I have named it as Content Data Store(CDS) http://ammozon.co.in/headtohead/?p=153 . Source code is open and you may find steps on installation and how to run here.
http://ammozon.co.in/headtohead/?p=129http://ammozon.co.in/headtohead/?p=126
A short demo
http://ammozon.co.in/gif/ocr.gif
I didnot get time to enhance it further but planning to containerize the whole application. See if you find it useful in its current form.
I had a similar problem and ended up using AWS' Textract tool to return the text as well as bounding box data for each letter, then overlayed that on a UI with an SVG of the original page, allowing the user to highlight handwritten and typed text. I plan to open source it so if anyone's interested let me know.
Not a fan of the potential vendor lock in though, so it's only really suitable for those in an already AWS environment not worried about them harvesting your data.
I use a screen reader, so of course some kind of text extraction is how I read PDFs all the time. There were some nice gotchas I've found.
* Polish ebooks, which usually use Watermarks instead of DRM, sometimes hide their watermarks in a weird way the screen reader doesn't detect. Imagine hearing "This copy belongs to address at example dot com: one one three a f six nine c c" at the end of every page. Of course the hex string is usually much longer, about 32 chars long or so.
* Some tests I had to take included automatically generated alt texts for their images. The alt text contained full paths to the JPG files on the designer's hard drive. For example, there was one exercise where we were supposed to identify a building. Normally, it would be completely inaccessible, but the alt was something like "C:\Documents and Settings\Aneczka\Confidential\tests 20xx\history\colosseum.jpg".
* My German textbook had a few conversations between authors or editors in random places. They weren't visible, but my screen reader still could read them. I guess they used the PDF or Indesign project files themselves as a dirty workaround for the lack of a chat / notetaking app, kind of like programmers sometimes do with comments. They probably thought they were the only ones that will ever read them. They were mostly right, as the file was meant for printing, and I was probably the only one who managed to get an electronic copy.
* Some big companies, mostly carriers, sometimes give you contract templates. They let you familiarize yourself with the terms before you decide to sign, in which case they ask you for all the necessary personal info and give you a real contract. Sometimes though, they're quite lazy, and the template contracts are actually real contracts. The personal data of people that they were meant for is all there, just visually covered, usually by making it white on white, or by putting a rectangle object that covers them. Of course, for a screen reader, this makes no difference, and the data are still there.
Similar issues happen on websites, mostly with cookie banners, which are supposed to cover the whole site and make it impossible to use before closing. However, for a screen reader, they sometimes appear at the very beginning or end of the page, and interacting with the site is possible without even realizing they're there.
I almost always have to resort to a dedicated parser for that specific pdf. I use it, for example, to injest invoice data from suppliers that won't send me plain text. Always end up with a parser per supplier. And copious amounts of sanity checking to notify me when they break/change the format.
I'm an ML engineer, worked as a part time data engineer consultant for a medical lines/claims extraction company, for 3 years, which majorly involved in extracting the tabular data from the PDFs and Images. Developer rules or parsers as such is JUST no help. You end up creating a new rule every time you miss the data extraction.
With that in consideration, and the existing resources are little help especially on skewed, blurry, handwritten and 2 different table structure in the input, I ended up creating an API service to extract tabular data from Images and PDFs - hosted as https://extracttable.com . We cared it to be robust, average extraction time on images is under 5 seconds. On top of maintaining accuracy, A bad extraction is eligible for credit usage refund, which literally not any service offer it.
i Invite HN users to give it a try and feel free to email saradhi@extracttable.com for extra API credits for the trail.
Hi, author and maintainer of Tabula (https://github.com/tabulapdf/tabula). We've been trying to contact you about the "Tabula Pro" version that you are offering.
Am I reading the repos correctly? It looks like Extractable copied Tabula (MIT) to its own repo rather than forking it, removed the attribution, and then tried to re-license it as Apache 2.0. If so, that would be pretty fucked up.
Not really. They import tabula_py, which is a Python wrapper around tabula-java (the library of which I'm a maintainer).
Still, I would have loved at least a heads up from the team that sells Tabula Pro. I know they're not required to do so, but hey, they're kinda piggybacking on Tabula's "reputation".
If you control the Tabula trademark (which doesn't necessarily require a formal registration), you may be able to prohibit them from using the TabulaPro name. That's exactly what trademark law is for.
William, the intention of "TabulaPro" is to give the developers a chance to use a single library instead of switching ExtractTable for images and tabula-py for text PDFs.
What do you recommend us to do, to not make you feel we made a dick move.
"commercialize the original author's work with the author"
- No, but let me highlight this, any extraction with tabula-py is not commercialized - you can look into the wrapper too :) or even compare the results with tabula-py vs tabulaPro.
Copying the TabulaPro description here, "TabulaPro is a layer on tabula-py library to extract tables from Scan PDFs and Images." - we respect every effort of the contributors & author, never intended to plagiarize.
I understand the misinterpretation here is that we are charging for the open-sourced library because of the name. We already informed author in the email about unpublishing the library, this morning, I just deleted the project and came here to mention it is deleted :)
Sorry, Saradhi, I don't think you can reasonably claim there was no intention to plagiarize. Adding a "pro" to something is clearly meant to suggest it's the paid version of something. And it's equally clear that "TabulaPro" is derived from "Tabula".
It may be that you didn't realize that people would see your appropriation as wrong, although I have a hard time believing that as well given that the author tried to contact you and was ignored. As they say, "The wicked flee when no man pursueth."
So what I see here is somebody knowingly doing something dodgy and then panicking when getting caught. If you'd really like to make amends, I'd start with some serious introspection on what you actually did, and an honest conversation with the original author that hopefully includes a proper [1] apology.
And I'm going add it's really weird that your answer ("No, No, Zero") is exactly the same as what the library author said [1] two hours before you posted. But you do that again without acknowledging the author, and with just enough format difference that it's not a copy-paste. It's extremely hard for me to imagine you didn't read what he said before writing that; it's just too similar.
I chuckled at your "the worst image" sample. Which still looked quite decent all things considered.
You're "handwritten" example looks a bit "too decent" as well. I can see how that works. You first look for the edges of the table, and then you evaluate the symbol in each cell as something that matches unicode.
So, how well does this cope with increasing degradation? i.e. pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?
"pencil written notes that bleed outside cell borders, curve around borders, etc.? Stamps and symbols (watermarks) across tables?"
"The Worst Image" is a close match to that, except it is a print.
Regarding increasing degradation - as stated above, the OCR engine is not proprietary - we confined ourselves to detect the structure at this moment, and started with the most common problems.
What a glorious format for storing mankind's knowledge. Consider that by now displays have arbitrary sizes and a variety of proportions, and that papers are often never printed but only read from screens. To reflow text for different screen sizes, you need its ‘semantic’ structure.
And meanwhile if you say on HN that HTML should be used instead of PDF for papers, people will jump on you insisting that they need PDF for precise formatting of their papers—which mostly barely differ from Markdown by having two columns and formulas. What exactly they need ‘precise formatting’ for, and why it can't be solved with MathML and image fallback, they can't say.
Not everything needs to look good on every screen size. I don't expect to be able to read academic papers on my smartwatch, and a simple alarm clock app looks kind of silly when it's fullscreen across a desktop monitor. Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes.
PDF is simply better than HTML when it comes to preserving layout as the author intended.
I wonder if you realize that both your points wildly miss what I said.
First, there's no need to stretch my argument to the point of it being ridiculous. I don't have to reach for a watch to suffer from PDF. Even a tablet is enough: I don't see many 14" tablets flying off the shelves. I also know for sure that the vast majority of ubiquitous communicator devices, aka smartphones, are about the same size as mine, so everyone with those is guaranteed to have the same shitty experience with papers on their communicators and will have to sedentary-lifestyle their ass off in front of a big display for barely any reason.
Secondly:
> Likewise, when layout makes the difference between reader understanding or confusion, it's hard to trust automatic reflowing on unknown screen sizes. PDF is simply better than HTML when it comes to preserving layout as the author intended.
As I wrote right there above, still no explanation of why the layout makes that difference and why preserving it is so important when most papers are just walls of text + images + some formulas. Somehow I'm able to read those very things off HTML pages just fine.
Some people are very "funny" about the layout of items and text and want it to be preserved identically to their "vision" when they created it. For example, every "marketing" individual when they see a webpage seem to want it pixel-perfect.
I think it's the artist in them.
This is understandable in some instances:
a. Picasso's or Monet's works probably wouldn't be as good if you just roll them up into a ball. Sure, the component parts are still there (it's just paper/canvas and paint after all!) but the result isn't what they intended.
b. A car that has hit a tree is made up of the composite parts but isn't quite as appealing (or useful) as the car before hitting the tree.
c. A wedding cake doesn't look as good if the ingredients are just thrown over the wedding party's table. The ingredients are there, but it just isn't the same...
That's a tradeoff between complex formatting and accessibility of the result. Authors are making readers sit in front of desktops/laptops for some wins in formatting. Considering that papers, at least ones that I see, are all just columns of text, images and formulas, the win seems to be marginal, while the loss in accessibility is maddening with the current tech-ecosphere.
> A PDF isn’t for storage it’s for display. It’s the equivalent of a printout.
This conjecture would have some practical relevance if I had access to the same papers in other formats, preferably HTML. Yet I'm saddened time and again to find that I don't.
In fact, producing HTML or PDF from the same source was exactly my proposed route before I was told that apparently Tex is only good for printing or PDFs. I hope that this is false, but not in a position to argue currently.
But when you access a paper it’s for reading it, correct?
It is worrying if places that are “libraries” of knowledge aren’t taking the opportunity to keep searchable/parseable data, but it’s no worse than a library of books.
That's not my complaint in the first place. The problem is that while we progressed beyond books on the device side in terms of even just the viewport, we seemingly can't move past the letter-sized paged format. The format may be a bit better than books—what with it being easily distributed and with occasionally copyable text—but not enough so.
I'm not even touching the topic of info extraction here, since it's pretty hard on its own and despite it also being better with HTML.
Yeah, it's better with HTML than with PDF, but it's still pretty terrible... Use some actually structured data format like XML (XHTML would be good), because you don't want to include a complete browser just to search for text
HTML has all the same problems and degrades over time. A PDF from 20 years ago will at least be readable by a human; a HTML page does not even guarantee that much.
You're right that most of the relevant semantics would fit into Markdown. So store the markdown! There are problems with PDF but HTML is the worst of all worlds.
What exactly degrades about HTML in twenty years? I can read pages from the 90s just fine: the main thing off is the font size due to the change in screen resolutions, but—surprise!—plain HTML scales and reflows beautifully on big and small screens. (Which is the complete opposite of ‘HTML has the same problems’.) I hope you're not lamenting the loss of the ‘blink’ tag.
If you're talking about images and whatnot falling off, that's a problem of delivery and not the format.
Markdown translates to HTML one-to-one, it's in the basic features of Markdown. For some reason I have to repeat time and again: use a subset of HTML for papers, not ‘glamor magazine’ formatting. The use of HTML doesn't oblige you to go wild with its features.
> What exactly degrades about HTML in twenty years? I can read pages from the 90s just fine: the main thing off is the font size due to the change in screen resolutions, but—surprise!—plain HTML scales and reflows beautifully on big and small screens. (Which is the complete opposite of ‘HTML has the same problems’.) I hope you're not lamenting the loss of the ‘blink’ tag.
I am indeed, and of other tags that are no longer supported. Old sites are often impossible to render with the correct layout. Resources refuse to load because of mixed-content policy or because they're simply gone - which is a problem with the format because the format is not built for providing the whole page as a single artifact. And while the oldest generation of sites embraced the reflowing of HTML, the CSS2-era sites did not, so it's not at all clear that they will be usable on different-resolution screens in the future.
> Markdown translates to HTML one-to-one, it's in the basic features of Markdown. For some reason I have to repeat time and again: use a subset of HTML for papers, not ‘glamor magazine’ formatting. The use of HTML doesn't oblige you to go wild with its features.
This is one of those things that sounds easy but is impossible in practice. Unless you can clearly define the line between which features should be used and which should not, you'll end up with all of the features of HTML being used, and all of the problems that result.
> Unless you can clearly define the line between which features should be used and which should not, you'll end up with all of the features of HTML being used, and all of the problems that result.
You'll notice that I said in the top-level comment that my beef is with PDF papers (i.e. scientific and tech). I don't care about magazines and such, since they obviously have different requirements. So let's transfer your argument to current papers publishing:
“Since PDF can format text and graphics in arbitrary ways, you'll end up with papers that look like glamor and design magazines and laid out like Principia Discordia and Dada posters. You'll have embedded audio, video and 3D objects since PDF supports those, and since it can embed Flash apps, you'll have e.g. ‘RSS Reader, calculator, and online maps’ as suggested by Adobe, and probably also games. PDF also has Javascript and interactive input forms, so papers will be dynamic and interactive and function as clients to web servers.”
You can decide for yourself whether this corresponds to reality, and if the hijinks of CSS2-era websites are relevant.
What is it with people, one after another, jumping to the same argument of ‘if authors have HTML, they will immediately go bonkers’? If really looks like some Freudian transfer of innate tendencies. We have Epub, for chrissake, which is zipped HTML—what, have Epub books gone full Dada while I wasn't looking? Most trouble with Epub that I've had is inconvenience with preformatted code.
> Old sites are often impossible to render with the correct layout. Resources refuse to load because of mixed-content policy or because they're simply gone - which is a problem with the format because the format is not built for providing the whole page as a single artifact.
Yes, as I mentioned under the link provided in the top-level comment, the non-use of a packaged-HTML delivery is precisely my beef here. The entire idea of using HTML for papers implies employing a package format, since papers are usually stored locally. It's a chicken-and-egg problem. It's solved by the industry picking one of the dozen available package formats and some version of HTML for the content. Which would still mean that HTML is used for formatting. HTML could be embedded in PDF for all I care, if I can sanely read the damn thing on my phone.
> “Since PDF can format text and graphics in arbitrary ways, you'll end up with papers that look like glamor and design magazines and laid out like Principia Discordia and Dada posters. You'll have embedded audio, video and 3D objects since PDF supports those, and since it can embed Flash apps, you'll have e.g. ‘RSS Reader, calculator, and online maps’ as suggested by Adobe, and probably also games. PDF also has Javascript and interactive input forms, so papers will be dynamic and interactive and function as clients to web servers.”
Those things don't happen in PDFs in the wild, or at least not to any great extent. It's not that technical paper authors have shown some special restraint and limited themselves to a subset of what the rest of the PDF world does. Technical papers look much like any other PDF and do much the same thing that any other PDF does; if they were authored in HTML, we should expect them to look much like any other HTML page and do much the same thing that any other HTML page does. Based on my experience of HTML pages, that would be a massive regression.
> Yes, as I mentioned under the link provided in the top-level comment, the non-use of a packaged-HTML delivery is precisely my beef here. The entire idea of using HTML for papers implies employing a package format, since papers are usually stored locally. It's a chicken-and-egg problem. It's solved by the industry picking one of the dozen available package formats and some version of HTML for the content. Which would still mean that HTML is used for formatting. HTML could be embedded in PDF for all I care, if I can sanely read the damn thing on my phone.
The details matter; you can't just handwave the idea of a sensible set of restrictions and a good packaging format, because it turns out those concepts mean very different things to different people. If you want to talk about, say, Epub, then we can potentially have a productive conversation about how practical it is to format papers adequately in the Epub subset of CSS and how useful Epub reflowing is versus how often a document doesn't render correctly in a given Epub reader. If all you can say about your proposal is "a subset of HTML" then of course people will assume you're proposing to use the same kind of HTML found on the web, because that's the most common example of what "a subset of HTML" looks like.
> It's not that technical paper authors have shown some special restraint and limited themselves to a subset of what the rest of the PDF world does. Technical papers look much like any other PDF and do much the same thing that any other PDF does.
This makes zero sense to me. You're saying that technical papers look the same as Principia Discordia or glamor/design magazines or advertising booklets, including those that just archive printed media. That technical papers include web-form functionality just like some PDFs do—advertising or whatnot, I'm not sure. If that's the reality for you then truly I would abhor living in it—guess I'm relatively lucky here in my world.
However, if you point me to where such papers hang out, I would at least finally learn what mysterious ‘complex formatting’ people want in papers and which can only be satisfied by PDF.
Actually, no thanks. "Sementic" structure is how we got responsive web soup of ugly websites with hamburger menus.
We need the opposite, we need a format that stays the same size, same proportions and is vectorized so you can zoom to any size - however, the relationship of space between elements remains constant.
PDF is an amazing format IMO. Think of it like Docker - the designer knows exactly how its going to appear on the user's device.
No, not everything is orthogonal. When you have sementic structure, you're gonna display it in responsive and adaptive way. That __breaks__ the design intent. For a designer, WYSIWYG is godsent. the parent comment is right - PDF is like a Docker container for designers - people who work with media.
If you have opposing thoughts, please elaborate further instead of simply saying "nothing to do with it". HN works by explaining and arguing about issues to get to the bottom of something.
I read "I think the problem is in your head" as talking about the other user personally. Looking more closely, I can read it as a general statement, in which case it wasn't a personal attack.
Statements of the form "So you're saying [obviously stupid thing]?" still break the site guidelines, though.
Well, firstly I don't need to place words in fermienrico's mouth, since the comment to which I was replying says: ‘"Sementic" structure is how we got responsive web soup of ugly websites with hamburger menus’. On my part I'm trying to figure out why fermienrico considers that connection to be inevitable, in the context of HTML production, as compared to PDF production—i.e. with my correspondent taking the place of an author.
Next:
> I don't think that's a problem in HTML. I think the problem is in your head.
This directly addresses fermienrico's complaint as being targeted at HTML. The ‘in your head’ part says that the problem is imaginary and that the person's reasoning and motivation in arriving at that connection is a mystery to me. So there are two meanings combined, both of which aren't ad hominems. The statement also deliberately invokes the words of the character of Preobrazhensky from Bulgakov's ‘Heart of a Dog’, on the condition of the fresh-born Soviet Union: “The Disruption isn't in the lavatories, it's in the heads”.
Let's inspect the statement more closely. The use of the second-person ‘you’ in hypothetical constructs is ubiquitous in English, instead of a third-person ‘a man’ or ‘one’, e.g.: “When you try reading PDF on a phone, you experience unspeakable horror and loss of all hope”—this doesn't imply that the addressed correspondent is the one doing this, and in fact may be directed at multiple unknown readers or listeners.
The hyperbole of ‘you're nuts’ is, to my knowledge, also a typical feature of colloquial English language, e.g.: “You must be crazy to try reading PDF on a phone”, or “What's wrong with you, that tablet is too small to display PDF adequately”. These both don't mean that the correspondent is literally mentally damaged, but that the speaking party doesn't understand their reasoning or doesn't agree with it.
My choice of words there is rather harsh, yes. Why I needed that is, I've had this same discussion before and I repeatedly failed to extract from people the reasons why they make this inference. I tried different approaches, and now came the time of directly placing the person in the shoes of an author. Still nothing so far.
On top of all this and nitpicking further, even disregarding the above I still can't quite fit the statement in a ‘personal attack’ category: as I understand it, an ‘ad hominem’ works by carrying a belief ‘the person has some bad quality A’ over to ‘the person's opinion B hence must be wrong’. In the case of ‘you must be insane/naive to have the opinion B’ no other personal qualities are involved, and the opinion B is directly stated to be wrong. Possibly uncivil and possibly unsupported yes, personal no.
P.S. Could someone please make the ‘collapse comments’ button-link larger on phones? It's even worse on a higher-PPI display, easily taking a dozen attempts to hit it—and extensions like Stylus aren't readily available on phones, what with Firefox dumping them in Preview. Just making the link a dozen characters wide would be splendid.
One thing to point out about our library is that while we do take PDF as input and use it to calculate visual features, we also rely on an HTML representation of the PDF for structural cues. In our pipeline this is typically done by using Adobe Acrobat to generate an HTML representation for each input PDF.
What type of visual features are you looking at? I've been trying to find a web-clipper that uses both visual and structural cues from the rendered page and HTML, but have no luck finding a good starting point.
There are a handful. We looks at bounding boxes to featurize which spans are visually aligned with other spans. Which page a span is on, etc. You can see more in the code at [1]. In general, visual features seem to give some nice redundancy to some of the structural features of HTML, which helps when dealing with an input as noisy as PDF.
You'd hope so, but some printers run some very finicky software with less horsepower than your desktop machine so can fall over on complex PDF structures. I preferred Postscript!
Not sure if I'm misunderstanding but PCL is another page description language like PS, some printers can use both depending on the driver.
Most of our Xerox printers spoke Postscript natively, these days more printers can use PDF. We generally used a tool to convert PCL to PS to suit our workflow if that was the only option for the file, because being able to manipulate the file (reordering and applying barcodes or minor text modifications) was important. Likewise for AFP and other formats. PCL jobs were rare so I never worked on them personally.
Unfortunately, when you need the output of program A as the input to B sometimes you have to jump through such hoops. I've never done it with .pdf but I've fought similar battles with .xps and never fully conquered them. (And the parser was unstable as hell, besides--it would break with every version and sometimes for far lesser reasons.)
The relatively small company I work for makes me fill out some forms by hand, because they receive them from vendors as a PDF. So I print it out, sign it, and return it to my company by hand.
If someone could make a service that lets you upload a PDF that contains a form, and then let users fill out that form and e-sign it and collect the results, and then print them out all at once, it would be great.
It's not a billion dollar idea but there are a lot of little companies that would save a lot of time using it.
There are quite a few services that should be able to solve this problem (turning a PDF into a web form and collecting signatures.) Here's a few of the services I'm aware of:
(I know about all these because I'm working on a PDF generation service for developers called DocSpring [1]. I'm also working on e-signature support [2], but that's still under development, and still won't be a perfect fit for your use-case.)
I used Xournal for a couple years in college. It was perfect in how simple it was to mix handwritten and typed notes or markup documents. The only thing is that I wish it had some sort of notebook organization feature. It would have been nice keeping all of my course notes in one file, broken down by chapter or daily pages. Instead, I ended up with a bunch of individual xojs that did the job but made searching for material take longer.
Coming from a documents format world (publishing), there are a lot of cases like this.
In theory it sounds like it should be straightforward but it hinges so much on how well the document is structured underneath the surface.
Being that these tools were primarily designed for non-technical users first the priority is in the visual and printed outcome and not the underlying structure.
One document can look much the same as another in form—uses black borders to outline fields, similar or same field names, etc, but may be structured entirely differently and that can be a madhouse of frustrating problems.
It can be complex enough to write a solution for one specific document source. Writing a universal tool that could take in any form like that would probably be a pretty decent moneymaker.
My first intuition, though, would be it may be more successful (though no less simple) to develop a model that can read from the visual of the document rather than parsing it successfully.
Out of curiosity, what exactly are non-technical people doing with PDF's, and why does there need to be a universal tool in the space? What would the tool do with the extracted data?
All kinds of things. PDF is the unifying data exchange format for a lot of businesses who use computers at some end to manage things and need to exchange documents of any kind without relying on the old "can you open Word files?" type problems.
There is a wide world outside of consumers of SaaS products for every little niche problem.
Sometimes they are baked in processes that still use PDF's to share information, sometimes they're old forms of any kind, sometimes even old scanned docs that are still in use but shared digitally. A lot of the businesses that carry on that way are of the mind that "if it's not broke, don't fix it" which is quite rational for their problem areas and existing knowledge base. They might be a potential market at some point for a new solution, but good luck selling them on a web-based subscription SaaS solution when a simple form has been serving their needs for 30+ years.
OP's problem of the PDF being the go-between to digital endpoints is more common than you might think.
The universality I was referring to was the wide range of possibilities for how a given form might be laid out. And old documents contain a lot of noise when they've been added to or manipulated. Look inside an old PDF form from some small-medium sized business sometime. Now imagine 1000 variations of that form one standard problem. Then multiple that by the number of potential problem areas the forms are managing.
Also like OP said—it's not sexy, but it's very real and having an intelligent PDF form reader and consumer would be a time-saver for those businesses who aren't geared to completely alter their workflow.
The tool could do anything with the extracted data. If it allowed you to connect to any of your in house services (like payroll or accounting) either with a quick config/API or a custom patch, or Google Drive, or whatever without complications like online-required and web accounts especially. No whole solution like that exists to my knowledge. At least nothing accessible to the wider market.
Thanks for the comment, this is really interesting. I guess i'm still confused what people actually do with these PDF's though. Are people looking at a PDF sent to them and manually entering that data somewhere else (like payroll or accounting), so this tool would take that data from the PDF and pump it in there automatically?
Thanks again, I just want to make sure I understand.
I assume you mean a drawn form as opposed to a true PDF form. The former would be difficult to parse automatically into inputs.
OTOH, a PDF form works exactly they way you’d like. Maybe there’s a small market in helping convert one to the other for collecting input from old paper-ish forms.
Something like that would work for signing, but the hard part is "turn this pdf into an online form". That way after a user finishes a form, you can perform some basic error checking like, did they fill out everything, is this field a valid format, etc. After 100 employees turn in a multi-page printed out form, someone has to go through it and make sure they signed everywhere, filled out all the fields, etc.
Again, not sexy, but it is so stupid I have to fill out a direct deposit form by hand and turn it into my company, who checks it, then hands it off to the payroll vendor, who has to check it, just to enter the damn data into a form on their end.
> By looking at the content, understanding what it is talking about and knowing that vegetables are washed before chopping, we can determine that A C B D is the correct order. Determining this algorithmically is a difficult problem.
Sorry, this is a bit off-topic regarding PDF extraction, but it distracted me greatly while reading...
I'm pretty sure the intention was A B C D (cut then wash). Not sure why the author would not use alphabet order for the recipe...
[edit] Sorry, I made it read to a colleague and he mentioned the A B C D annotations were probably not in the original document. This was not clear at all for me while reading, and if they are not included it's indeed hard to find the correct paragraph order.
Even if the ABCD was in the original document, how would the computer figure out it's supposed to indicate the order?
And of course, even if the letters were there in the original document, it would be clear to a human that they're incorrect because it doesn't make sense to wash vegetables after cutting.
The article mentions various ways in which text that appears normal can actually be screwed up inside a PDF. I have found this when running PDFs through the BeeLine Reader PDF converter that my startup built.
One workaround I've found is that sometimes it helps to "print to PDF" the original PDF using Preview on Mac. This doesn't fix all the problems, but it does sometimes fix issues with the input PDF — even though both files appear identical to the human eye.
Are there any other workarounds or "PDF cleaners" out there? It would be awesome if there were a web-based service where you could get a PDF de-gunkified, for lack of a better term.
I had to go through a fair bit of this when writing my Android receipt printer driver. Parse a PDF print job, detect tables, basic formatting, align text to grid, reformat for 58 mm paper roll width… and that's when the fun begins, since every ESC/POS printer makers supports a different dialect or a different character encoding set, or maybe just one, or maybe there are certain quirks you have to account for…
Is there a tool that works for the limited subset of PDFs generated by Latex? Do those documents have more structure than the average PDF? Less? It'd be nice to extract text from scientific articles at least.
I spent some time extracting abstracts from NLP papers (ACL conferences) and it was mostly straightforward. Using pdfquery to extract PDF -> XML gave each character as an element, and they were mostly ordered sensibly and grouped into paragraphs.
However... this didn't work in some cases, mainly with formatted text but sometimes with PDFs that looked like they were compiled in some nonstandard way. As a result I ended up chucking the XML structure entirely and recompiling the text from character-level coordinates. Formatted text was also an issue, with slightly offset y coordinates from regular characters on the same line.
I'm not sure I could take this experience and say that extracting _all text_ would be straightforward. Hopefully for most documents the XML is nicely structured, but I imagine there are many more opportunities for inconsistencies in how the PDF is generated when thinking about diagrams, tables etc. rather than just abstracts.
Considered writing up a blog post about my experiences with the above but imagined that it was far too niche. Code's here [1] if it's of interest.
I've had remarkably good results in general (for reading) using the Poppler library's "pdftotext" utility. Since it defaults to writing output to file, I wrap that in a bash function to arrive at a less-like pager, with page breaks noted:
The key is the "-layout" argument, which preserves original layout of the document. This ... may not be what you want visually, but makes backing out the original text somewhat easier.
Of course, requesting the LaTeX sources would be preferred.
Not a general tool but arxiv-vanity - which produces webpages of articles submitted to arxiv - works by parsing the source code that's submitted along with the PDF. You can probably use this data to train a model that converts between pdf, tex, and html.
FYI, redaction from pdf can be similarly difficult. I once was tangentially involved with a pdf redaction piece of software, and due to many different issues with pdf, the solution ended up being to create images of the input pdf draw over the redacted info, then create/overwrite a new pdf that was just a container for the jpgs. It was the only way to be sure the info wasn’t in the pdf at all, since it could be all kinds of places and duplicated in interesting ways. But since you’d be working on the final rendering you could be sure everything you covered would be covered in the final output. The biggest challenges after that were related to text extraction, since we wanted a nice UI where you could select text and the redaction would auto cover text and use a uniform width based on the heights of all characters in the redaction. I think more often than we were happy with a user would need to simply use a bounding box since extracting all the pertinent data related to the text was so hard.
I walked away from the product over a decade ago since it always seemed like it’d be trivial for adobe to implement the feature in reader. Though every couple of years there’s a redaction scandal and I keep wondering how lucrative the product could have been with some marketing.
> our most successful solution was to run OCR on these pages.
That's the most interesting point in the article.
Reminds me of how a friend managed to fix bugs in an assembly source file written in the original programmer's very own undocumented special language implemented in the assembler's macro language. He disassembled the resulting object file, fixed the problems, and checked in the disassembly as the new source code.
The open source project I work on [0] returns the letters, their positions and other associated information.
We provide support for retrieving words as well as a bunch of different algorithms for document layout analysis [1]. But like the other commenters here mention, it's an extremely difficult problem which doesn't have an easy or general solution.
I was trying to build a custom library on top of the open-source library that did a bit more processing, multi-column analysis, statistical analysis of whitespace size, etc. But building something that works for the general case is difficult enough to be functionally impossible.
Despite that I think the PDF format is well suited to what it is for and there are very few "implementation mistakes" in the spec itself (no up-front length for inline image data is the main one, plus accessibility obviously). It's ultimately become too successful and as a result developers are stuck handling cases where it's being used for entirely the wrong purpose but I can't see a way to another format gaining purchase for the correct purpose (perhaps it's like JavaScript in that way, it has huge adoption because it was first, not because it does all jobs well).
Perhaps a content-first format which also handles presentation well could gain a foothold if it came with a shim for PDF viewers and software to use but I dread to think how much effort that would be.
I've also done a lot of work in this space and one thing I don't understand is why more extraction libraries don't support images as input. If your PDF isn't layered or OCR'd, it might as well be an image. I've lost count of the number of times I've downloaded some PDF extraction tool and then had to hack it into accepting an image.
The open-source Ghostscript [1] can convert simple PDFs to text, while keeping the layout. I doubt it will handle some of the more complicated cases outlined in the article though.
I use it quite successfully to turn my bank statements into text, which can then be further processed.
I've recently done this. Have scanned over 5,000 documents to PDF, then batch converted those from PDF to TIFF using Ghostscript, and then Tesseract to OCR the TIFF and combine both back into a searchable PDF. Tesseract may not be the worlds best OCR software but it's free and both it and Ghostscript are easy to automate.
Now all I need is a good front end search system for my document archive.
I have a Brother ADS-2700w[1] as my scanner which is network connected. It scans directly to a network share (SMB, but also supports FTP, nfs etc.) and outputs as PDF. The PDFs are basically 'dumb' PDFs in that each page of the PDF is an image all wrapped up inside the PDF container.
So that's where Ghostscript comes in. On a schedule I have a script that picks up new PDFs in the share, runs them through Ghostscript to create a multipage TIFF, that TIFF is then given to Tesseract (as it can't handle PDFs natively) which does the OCR and outputs a nice PDF with searchable text. All very simple.
The scanning of the pages is very fast, but the scanner takes an age sending the PDFs over the network - it's ethernet port is only 100mbit/s but to be honest I just think the CPU inside the scanner is slow. It also doesn't have enough internal buffer which means you can't scan the next document until the previous one has completed being sent to the share.
If I hooked the scanner up to USB, then the PC could run the Brother software which does use OCR - but it's not automatic, all it does is display the PDF inside Paperport once the scan is complete. For bulk scanning, it's not workable.
Regarding indexing - I've started looking at Solr, and it might suit my needs. I was hoping for a visual type search system, where you could see thumbnails of the PDFs in the results.
xpdf seems to have started to respect the not-copyable-flags, while in days yonder, it didn't. So now, even something like a manual of some command-line tools or a text book on C++ or Rust, you still have to re-type the text (wtf). Time to remove and search for something better, something that does not need a 0.5GB update every 3 days (on Windows). (yes, exaggerating slightly)
Maybe it's the new QT version, OpenBSD still has the Motif one, and it works great. For Windows you have SumatraPDF which is pretty good and it's libre.
I worked on PDF generating software for years. It's a horrible format that should never have been approved as an ISO standard.
When in doubt, use plain text. It's a million times better in every way that counts.
I wish my bank statements and such could be downloaded as plain text files, instead of massive PDF files that embed another copy of a bunch of typefaces in each file.
Ugh, this. I still fail to understand how a device from 2019, even a phone, could show any rendering delay when scrolling to page 200 of a 400 page static document. I thought PDF was less programmable than PostScript, but there's still got to be some kind of non-local semantics in there.
Since gdpr, businesses are "required" to make your data available to you for transfer in a machine-readable format and you could argue that pdf is not exactly machine-readable in the sense of the law. Practically I have seen cases where you do get csv or something similar, but especially smaller firms will probably give you word documents, excel files or pdfs.
Interesting! Halifax Bank in the UK changed their generation library for PDFs the other year such that new statements rendered incorrectly on the Mac. Old statements were fine. New ones were garbage text.
Chrome displayed them fine, Preview on Mac did not.
Trying to communicate this to them was like talking to a tree, or an alien, or a room of catatonic individuals.
I think that root problem here is that most people still have trouble separating the data from the presentation. We have to understand that in the end, substance always beats form.
My bank lets me download bank statements in several formats, CSV among them - not entirely plain text, although embedded in it; seems like the best choice for the usecase.
Wish I had this to share with my boss years ago. My first big project at my first post-college job was building a PDF parser that would generate notifications if a process document had been updated and it was the first time the logged in user was seeing it (to ensure they read the changelog of the process). Even with a single source of the PDFs (one technical document writer) I could only get a 70% success rate because the text I needed to parse was all over the place, when I stated we would need to use OCR to get better results no further development was done (ROI reasons). The technical writer was unwilling to standardize more than they already had, or consider an alternative upload process where they confirm the revision information.. which didn't help.
I don't envy working on ingesting even more diverse PDFs.
Another site that breaks the browser's back navigation. Why do so many sites do this? Do they imagine they retain user attention for longer if they break navigation? It's pretty trivial to long-press the back button or just close the tab and not come back again to your site...
Good to know! I don't believe PM is possible on hacker news so I hope you don't mind that I describe some details right here?
My browser is the latest (v73.0.1) Firefox on the latest build of Windows10. I confirmed the issue with all addons disbled so it is not an addon issue. I think I know what may be responsbile. When initially I load the page the back button works as intended for about a second. After that delay the page seems to load some resources from static.parastorage.com and www.mymobileapp.online. Once those resources are finished loading the back button does not navigate back to the HN article on the first press. Have to press once more. So I presume a script from one of those domains is responsible. Hope this helps!
Broken back buttons are often due to a site where a placeholder loads and it turns around and loads the real thing. Back takes you to the placeholder which promptly takes you back where you were.
There are sites that explicitly mess with the back button but I haven't seen one in ages.
I'm curious, how would fiddling with navigation impede text extraction?
The site renders perfectly fine without javascript, and the markup looks straight forward enough
I saw this and thought it might be something my company could use so I contacted them for info.
The CEO Simon Mahony basically told me to piss off when I told him I thought the site was misleading since there was no product or service to directly purchase. They make custom developed software that you must pay their consultants to integrate. I would not do business with such a company that acts so unprofessionally even if they have a decent team.
On a personal project, I had a good experience extracting PDF text using Tabula[1]. You specify the bounding boxes where desired data is, and it spits out the content it finds.
It still hits the issues mentioned in this article (surprise spaces appearing in middle of words, etc)
There's also camelot in Python [1]. Discovered it on HN [2]. Still a decent amount of manual work afterwards though but it's probably unreasonable to expect otherwise.
Does anybody regularly use Acrobat's text extraction engine? I've had fine results as far as accuracy goes when compared to other OCR engines but one sticking point drives me nuts. My problem is, and I'm typically doing this in batches of thousands of files, if a PDF has a footer applied Acrobat sees that as renderable text and blows off the rest of the rest of the page. I've tried all manner of sanitizing, removing hidden information, saving as another PDF protocol and still can't get around the plain text footers/headers. In a perfect world I'd have unlimited Tesseract or ABBYY access but we're trying to do this on the cheap and I'm working with client data that I don't want to bang through Google. I'll have to poke at some of the open source tools mentioned so far, too.
14 years ago I used the personal edition of Abbyy FineReader to OCR about 400,000 scanned journal articles. It took me a few months.
The workflow was:
- Extract the page images as TIFF, and store the page ranges so I could map the page ranges back to the individual articles afterward.
- Concatenate a range of images one big file, with an upper limit of (IIRC) about 4000 pages. FR would start to generate weird errors when I made the files any bigger than this.
- Run OCR over the giant 4000 page file.
- Export the result as one big PDF with OCR text layer under the scanned pages.
- Split the PDF back into individual PDF files corresponding to articles, using the data I saved in step 1.
- Optimize the individual PDF article files for compact storage, using the Multivalent [1] optimizer.
I did this with a combination of FineReader -- the only paid software -- Python, Multivalent, AutoHotKey, and PDFtk.
I was living on a grad student stipend at the time so I optimized for spending the least amount of cash possible, at the cost of writing my own automation to replace the batch processing found in more expensive editions of FineReader.
The most time consuming part was dealing with weird one-off errors thrown by FR's OCR engine. I had to resolve them all manually. They were too varied and infrequent to be worth automating away.
I tried Acrobat's own OCR too before I resorted to FineReader, but it was pretty terrible. At the time it also appeared to make the PDF files significantly larger, which was weird since a text layer shouldn't take much additional storage.
It's interesting to see other views of PDF. As someone who lives in Illustrator ripping every little piece of data out of a pdf to import into an Illustrator or InDesign file and then making a production pdf for large format printing and fixing plenty of issues along the way I find the text almost inconsequential to the whole thing. It's just another element among many elements: images, vector illustrations, et. PDF might not the best way to pass along pure text but as a container for graphical representation it works pretty well. I build pdf files to describe 20 ft walls with 1+ gigabyte images, complex vector illustrations, finely formatted text and it all prints out dam close to how I planned for it down to exact colors that match specific Pantone swatches. It's amazing what can be packed in to a pdf...
You are probably working directly with native formats embedded in PDF without even processing the visualized elements. Adobe tools like to do that.
Sometimes, publishers make their PDF e-books from printed source in which images are “optimized” to low quality JPEGs, but next to them non-display Photoshop data streams with pristine megapixel illustrations are kept. If you catch big PDF files, check their insides, it's one line of `mupdf extract`.
Are there 30- to 40-year-old application formats that you think have done a better job adapting to new needs and 4 to 6 orders of magnitude improvements in the systems they run on?
I think TIFF has a number of advantages there. It was from the beginning an interchange format, so it had the opportunity to look at a bunch of existing formats and extract the commonality. It's also not an application format, so the pace of change is slower and more controlled; it can trail rather than lead. And it is of course a standard, which means a different set of dynamics around how things get added and how clear the specs have to be.
That's not to say it isn't great; I could well believe it. But I'm just not shocked that PSD and PostScript have ended up being a bit of a mess over the decades. I doubt I could have done any better.
The one that hits me all the time is trying to reference the OpenGL and OpenGL ES spec pdfs. The last numbered section of the specs contain state tables in landscape layout vs the rest of the spec in portrait layout. Neither Chrome nor Firefox's readers search the text in these tables that I need to reference often.
The fact that text might be oriented different wasn't covered in the article. IIRC Preview on Mac might search there (not near my mac ATM to check)
For academic papers: GROBID [0] is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.
I had problems with copy-pasting Chinese text from PDFs before. The characters would come out as Kangxi radicals, rather than Traditional Chinese characters. They look the same, but are different code points!
I've worked on the other end of this: trying to make it easy to extract text from PDFs that we generated. Turns out that is pretty hard too. There just isn't a good way to include metadata about how text flows. So columns, callouts, captions, etc. all cause problems. The PDF format just wasn't designed for text extraction.
On the other hand... OCR is meanwhile so good that it can be used for many PDF text extraction projects. So often there is no longer the need to bother with PDF internals, just screenshot the PDF document and parse it. A free pdf ocr service is for example ocr.space.
I'm still unclear on why SumatraPDF bothered implementing this anti-user copy prevention feature, it's really annoying to have to actually break out separate tooling to strip the flags on datasheets and schematics.
used to work on pdf extraction during my bachelor thesis analyzing german law texts. The most fun part here was that the text came shipped in two columns. Sometimes the extraction worked in correct order, sometimes the two lines from two columnes where recognized as one line. I implemented at the end some kind like this algorithm: see here, from chapter 4.4. https://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/final.pd...
I've been wrestling with a similar set of tasks, and have arrived at a similar set of tools and options.
How you process PDF depends greatly on the scale at which you're working with documents. For large-volume, high-speed processing, automation is necessary. Where you're translating a more stable corpus, human input may be tractable. The ability to look at source PDF, OCR, and an edited text version to correct for errors seems a part of that workflow.
Often it's possible to get close or approximate transcription using standard tools. I've found the Poppler library's "pdftotext" remarkably good with many PDFs, so long as there's some text within them: https://poppler.freedesktop.org
There's a general concept I've been working toward of a minimum sufficient document complexity, which follows a rough (though not strict) hierarchy. It's remarkable how much online content is little more than paragraph-separated text, with no further structure. Even images are not strictly informational, but rather window-dressing.
Typically, additional elements added are hyperlinks, images, text emphasis (italic and bold, often only the first), sections, lists, blockquotes, super- and sub-script, in roughly that order.
(A study looking at the prevalence of specific semantic HTML elements within a corpus would be ... interesting.)
Then there are the elements NOT natively supported in HTML: equations, endnotes/footnotes, tables of contents, etc.
It seems to me there should be an analogue to Komolgrov complexity as concerns layout of textual documents. That is: there is a minimum necessary and sufficient level of markup (perhaps: number, type, and relationship of elements) necessary to lay out a specific work.
I've tagged out novel-length books in Markdown with little more than the occasional italic and chapter marks.
Documents which use more markup than is required are overspecified. This is the underlying problem with a great deal of layout, and the ability to reduce texts to their minimum complexity would be useful. It's a nontrivial problem, though large swathes of it should be reasonably achievable.
Another approach would be for information-exchange formats to actually be, you know, information exchange formats rather than PDF.
(Though the latter is often, though not always, well-suited to reading.)
We're working on a number of fun problems like this over at PDFTron in Vancouver! Currently growing and looking for software devs in a few different areas. ltully(at)pdftron.com
Not exactly a industrial solution, but for common types of text documents used by common people (i.e. books) k2pdfopt performs a lot of that magic under the hood.
I wouldn't bother with parsing the pdf. Directly reading from pixels can be more accurate than the parsed output, but will require some R&D. You'll need very high recall text detection and an accurate algorithm for OCR. And a lot of real documents as training data.
It's critical that the training data is good quality and much of the engineering effort should go into good annotation interfaces. We built an end-to-end system for all this at evolution.ai. Please email me if interested in an off-the-shelf solution. martin@evolution.ai.
An order of magnitude increase of time is very significant. If you're just processing a few documents with a lot of human oversight you may be right, but it's definitely not a generalised best approach, at least going by the article.
Too true! That's probably why all of the agencies near me wanted .doc files so they could scrape them and remove my address to insert themselves as middlemen with the aim of holding both employer and prospective employee hostage to their bounties.
I wabt to read the article but it's pointless because PDF doesn't test anything but ASCII chars well. Add some Asian languages and there is no way to get that text back.
What do you mean? As I understand it it depends on the font - you can provide any sort of encoding. So Unicode is there, I don't see how that would be harder than with the latin abc (which is still a hard problem as per the article)
So all through this I’m thinking “just OCR it and be done”, and we get to:
> Why not OCR all the time?
> Running OCR on a PDF scan usually takes at least an order of magnitude longer than extracting the text directly from the PDF.
... so? Google can OCR video and translate it in something that feels like real-time; what PDF processing are they doing that is so performance bound?
> Difficulties with non-standard characters and glyphs
OCR algorithms have a hard time dealing with novel characters, such as smiley faces, stars/circles/squares (used in bullet point lists), superscripts, complex mathematical symbols etc.
Sure, but more than the random shit you find in PDFs anyway?
> Extracting text from images offers no such hints
Finding an algorithm that approximates how a human approaches a page layout doesn’t feel like it would be all that hard.
Obviously it’s very easy to stand on the sidelines and throw stones, but parsing PDFs using anything other than OCR + some machine learning models to work out what the type of a piece of text feels like pretending we are still constrained by the processing costs of 5 years ago
1) Do you have trillion or so dollars at your beck and call? If not, you're not Google.
> Finding an algorithm that approximates how a human...
2) ...is generally nigh impossible even for someone with Google's resources (e.g. Waymo, although when it comes to reading, it's somewhat usable). Also, look at 1)
Unless by approximate you mean toddler level. In that case:
Without having heard of or tested the solution, I'll bet anyone $1M that I can produce an image that produces an incorrect answer. Which would mean it's not "solved".
If I can produce an image that you incorrectly label as Bird or No Bird, does that mean it's accurate to say you cannot tell me if pictures have birds in them? Or is that needlessly pedantic beyond any practical use case and clearly the intended context?
"Doing better than me", or any other human, wasn't the problem proposed. Anything other than 100% accuracy means the problem isn't solved as there will always be room for a better solution.
Edit: To add a little more color, given that none of us was (or at least certainly I wasn't) an expert on the PDF format, we had so far treated the bug like a bug of probably at-most moderate complexity (just have to read up on PDF and figure out what the base unit is or whatever). After discovering what this article talks about, it became evident that any solution we cobbled together in the time we had left would really just be signing up for an endless stream of it-doesn't-work-quite-right bugs. So, a feature that would become a bug emitter. I remember in particular considering one of the main use cases: scientific articles that are usually in two columns, AND also used justified text. A lot of times the spaces between words could be as large as the spaces between columns, so the statistical "grouping" of characters to try to identify the "macro rectangle" shape could get tricky without severely special-casing for this. All this being said, as the story should make clear, I put about one day of thought into this before the decision was made to avoid it for 1.0, so far all I know there are actually really good solutions to this. Even writing this now I am starting to think of fun ways to deal with this, but at the time, it was one of a huge list of things that needed to get done and had been underestimated in complexity.