Hacker News new | past | comments | ask | show | jobs | submit login

This is arguably a lot more than I need. I'm a hoarder in that I have every email I've ever sent or received (bar junkmail), and every piece of paper I've ever received.

Most of my paper is now scanned - I think I have two boxes left in my garden shed. I don't bother with OCR because search doesn't help me when I don't know what to search for (e.g. invoice for a jumper I bought in 2010 - fashion labels rarely call their jumpers jumper).

And so I rely on meta data. There's not much out there in terms of open-source tagging software, and even less in terms of an open tagging approach. I ended up with tagspaces, which is a web app packaged up as a native app. The approach to tagging is good (tags appended to file name), but the app is abysmally poor. Slow - waiting up to 30 seconds for a pop-up menu to appear. It assumes tag-based searches work in only one way.

The intent is to write some native apps to solve my biggest problems. For now I'm still trying to clear the backlog of un-scanned paper docs (not going to get this done for me, because privacy). I tag important stuff, like employment contracts, mortgage agreements, passports and birth certificates...

Hope to have everything done by the time I cash in my chips. Might make for a useful dataset for someone somewhere some day.




A few years ago I was involved with a startup that built a document management system for consumers, and we actually got pretty good results with OCR + automatic tagging based on a very simple database that maps keywords to tags.

Let's say you want to auto-tag bills and other documents from your ISP. So you add the ISP's name, phone number, website address etc. into the database - any uniquely-identifying keywords that typically appear on the documents that they send. Now any document that contains these keywords will get tagged as "ISP", making it very easy to find in the future.

Even if the OCR quality isn't perfect, at least one of these keywords will most likely get matched.

Another example - you could add the names of your family members as keywords, making it easy to find all documents related to Jenny or Susan.

You could argue that full-text search would achieve the same result, but uploading documents into the system and having them auto-tagged as "ISP", "car-payments", "Walmart", "Susan" and so on feels a little bit like magic, as if the system is actively helping you organize your papers.

The keyword approach is also very easy to understand and tweak, unlike more rigorous but opaque methods of document clustering (such as tf-idf).


Out of curiosity what is the state of the art today for extracting text or other data from scanned documents (forms, legal docs, receipts, etc) ?


I don't have an exact answer but can tell you that Expensify still resorts to human parsing sometimes. How often "sometimes" is, I have no idea. I would guess a lot.


Everything you say is true, and the value, I think, is clear. The part I don't like is that I have to create a database manually. Granted, the results will save me time as I don't have to manually tag the routine.

Food for thought.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: