Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’m a contractor. One of my gigs involved writing parsers for 20-something different kinds of pdf bank statements. It’s a dark art. Once you’ve done it 20 times it becomes a lot easier. Now we simply POST a pdf to my service and it gets parsed and the data it contains gets chucked into a database. You can go extremely far with naive parsers. That is, regex combined with positionally-aware fixed-length formatting rules. I’m available for hire re. structured extraction from PDFs. I’ve also got a few OCR tricks up my sleeve (eg for when OCR thinks 0 and 6 are the same)


Many years ago, I regularly had to parse specifications of protocols from various electronic exchanges. The general approach I used was to do a first pass using a Linux tool to convert it to text: pdftotext. Something like:

    pdftotext -layout -nopgbrk -eol unix -f $firstpage -l $lastpage -y 58 -x 0 -H 741 -W 596 "$FILE"
After that, it was a matter of writing and tweaking custom text parsers (in python or java) until the output was acceptable, generally an XML file consumed by the build (mainly to generate code).

A frequent need was to parse tables describing fields (name, id, description, possible values etc.). Unfortunately, sometimes tables spanned several pages and the column width was different on every page, which made column splitting difficult. So I annotated page jumps with markers (e.g. some 'X' characters indicating where to cut).

As someone else said, this is like black magic, but kind of fun :)

Edit: grammar


I've discovered page-oriented processing in awk, which is a godsend for parsing PDFs.

See:

https://news.ycombinator.com/item?id=22156456

In the GNU Awk User's Guide:

https://www.gnu.org/software/gawk/manual/html_node/Multiple-...

Tracking column and field widths across page breaks is ... interesting, but more tractable.


I worked for an epub firm that used a similar approach a while ago - we took PDFs and produced Flash (yes, that old) versions for online, and created iOS and Android apps for the publisher.

I've come across most of the problems in this post but the most memorable thing was when we were asked to support Arabic, when suddenly all your previous assumptions are backwards!


Oh my goodness, this whole thread is deja vu from some code I wrote to parse my bank statements. I arrived at exactly the same solution of "pdftotext -layout" followed by a custom parser in Python. And ran into the same difficulty with tables: I wrote a custom table parser that uses heuristics to decide where column breaks are.


I work in the print industry and some clients have the naive idea they'll save money by formatting their own documents (naive because usually this just means a lot more work for us, which they end up paying for).

We need some metadata to rearrange and sort PDF pages for mailing and delivery (such as name, address, and start/end page for that customer).

Our general rule is you provide metadata in an external file to make it easy for us. Otherwise, we run pdftotext and hope there's a consistent formatting for the output (e.g. every first page has "Issue Date:", "Dear XYZ,", or something written on it).

If that doesn't work then we're re-negotiating. It is not too difficult usually to build a parser for one family of PDF files based on a common setup as you've said and you get to learn various tricks. It is very difficult though to write a general parser.

Personally, I found parsing postscript easier since usually it was presented linearly.


I can cosign on this methodology. I used to work in an organization that used to build pdfs for accounting and licensing documentation. I used a proprietary tool (Planetpress :( ) to generate the documents using metadata from a separate input file (csv or xml) to determine what column maps to what field.

Good thing about this was as you have already outlined: It allowed for some flexibility in what was acceptable input data. For specific address formats or names we could accept multiple formats as long as they were consistent and in the proper position in the input file.

Regarding renegotiating: We didn't get that far. However, if a customer within our organization was enlisting our expertise and could not produce an acceptable input file, then we would go back to them and explain the format that we require in order to generate the necessary documents. Of course, creating our document through our data pipelines is obviously the better choice, but this was not an option in some cases at the time.

As far as doing the work of creating these documents in a tool like Planetpress is concerned, well, don't use Planetpress. You are better of doing it in your favorite language of choice's libraries tbh. Nothing worse than having to use proprietary code (Presstalk/Postscript.) that you have to learn and never be able to use anywhere.


By re-negotiating I mean in terms of quoting billable hours. A rule of thumb for a typical Postscript scraper was around 20 hours end to end (dev, testing, and integration into our workflow system).

The problem we have with a lot of client files is that they look fine but printers don't care about "look fine", they crash hard when they run out of virtual memory due to poor structure. And usually without a helpful error message, so that's more billable hours to diagnose. The most common culprit is workflows that develop single document PDFs then merge them resulting in thousands of similar and highly redundant subset fonts.


Any tricks for decimal points versus noise? Its a terrifying outcome and all I've got is doing statistical analysis on the data you've already got and highlighting "outliers".


Change the decimal point in the font to something distinctive before rasterizing.


For something like bank statements, I'd use the rigidly-defined formatting (both number formatting and field position) to inform how to interpret OCR misfires. My larger concern then would be missing a leading 1 (1500.00 v 500.00), but checking for dark pixels immediately preceding the number will flag those errors. And I suppose looking for dark pixels between numbers could help with missed decimals too.


I've done this a bit. I define ranges per numeric field and if it exceeds or is below that range, I send it to another queue for manual review. Sometimes I'll write rules where if it's a dollar amount that usually ends ".00" and I don't read a decimal but I do have "00", then I'll just fix that automatically if it's outside my range.


(Novice speaking) Maybe there's something about looking for the spacing / kerning that is taken up by a decimal point? (Not sure if OCR tools have any way to look for this)


Do you have a blog? I'd enjoy reading some of your tricks.

Also, how do you manage things when one of those banks decides to change the layout/format?


Interestingly, I was doing the similar stuff for 3 years to a US company. Curious, is your client a legal tech company? Mine was.

The experience helped me to roll out an API, as https://extracttable.com, for developers.

OCR tricks? Assuming post processing dev stuff - may I know your OCR engine. We are supported with Kofax and openText along with cloud engines like GVision as a backup.


Maybe there's a SasS opportunity for you to explore.


I build such a service, but it is impossible to guarantee any reliable result. I ended up shutting it down.

The PDF standard is a mess, and the number of 'tricks' I've seen done is astonishing.

Example: to add shade or border effect to text, most PDF generators simple add the text twice with a subtle offset and different colors. Result: your SaaS service returns every sentence twice.

Off course there were workarounds, but at some point it became unmaintanable.


I'm actually surprised that PDF hasn't been superseded by some form of embedded HTML by now.


It partly has: ePub [1], an open format for ebooks, contain HTML.

[1] https://en.wikipedia.org/wiki/EPUB


I'd say exactly the opposite. PDF makes it easy to create a document that looks exactly the way you want it to, which seems to be all that most web designers want (witness all the sites that force a narrow column on a large screen and won't reflow their text properly on a small screen).


In a way it has. In my experience, there have been multiple times where a "generate PDF" requirement has come up, with the best viable solution being "develop it in HTML using standard tech" followed by "and then convert it to PDF".


I blame CSS.


Why?


Hi! I’m the founder of a startup (https://siftrics.com) in this exact space.

The demand for automating text extraction is still very high — or at least it feels like it when you’re working around the clock to cater to 3 of your customers, only to wake up to 10 more the next day. We’re small but growing extremely quickly.


As someone who works in aviation... what made you choose an avionics company as your demo business?

I've bookmarked your site for future research... but the aviation part has me curious!


What space are your customers in? Healthcare? Government?


Everything. Insurance companies to fledgling AI startups.

It’s definitely harder to get government business because the sales process is so long and compliance is so stringent. That said, we are GDPR compliant.


Great demo video. Congrats on growing your startup!


Well I am putting the finishing touches on a front end that allows extracting PDF text visually. It's also able to adjust when the PDF page size vary for a given document type. Once you build the extractor for a document type, it can run on a batch of PDFs and store to Excel or Database (or any other format). I sense this tool facilitates and automates a lot of the 'dark art' you mention. Of course there are always difficult documents that don't fit exactly in the initial extraction paradigm, for those I use the big guns ...


Id also be interested in a blog or any basic tips/examples! I totally understand you don't want to give too much away, but I'm sure HN would love to see it!


I remember writing one of my first parsers was for a pdf and I had to employ a similar methodology where I had to rely on regex and "positionally-aware fixed-length" formatting rules. I would literally chunk specific groups by the number of spaces they contained lol. I had to do very little manual intervention but, damn it all, it worked :D .


I've written similar code for investment banks, to extract financial reporting data from PDFs. It's shocking to think how much of the financial world runs on this kind of tin-cans-on-a-piece-of-string solution.


Do these pdfs even get printed ?


My first internship was at a small company that did PDF parsing and building for EU government agencies and it was really painful work but paid an absolute shitton.


Are you open to doing more of this? Trying to do the same thing but I’d rather have an expert do it and focus on the app.


Are you building an app?


Building personal finance app to keep track of multiple bank accounts and investments, categorising spends, etc.

Parsing statement PDFs from every bank is pretty hellish.


That’s why the open banking api is amazing these days

In the past we did purposely make it more difficult to parse our PDFs


Do you have any tricks for dealing with missing unicode character mapping tables for embedded fonts?


What’s your contact info? Didn’t see any on your github.


dan at threeaddresscode dot com


Can I PM you?


Of course.


I don't think Hacker News supports PMs.

I managed to find your email address from your GitHub profile. Going to send you an old fashioned email.


Are you me? Wish that I had known the insertion order trick, though it isn't straightforward to implement with the stack I was using at a previous gig. (Tabula + Naive parsing + Pandas Data Munging). I can expand on a few issues challenges I've run into when parsing PDFs:

# Parser drift and maintenance hell

Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.

You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.

A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.

A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:

pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)

pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)

pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)

What's the common factor between these version changes? The return address is determining the version.

PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.

# PDFs that aren't standards compliant

In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.

TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: