Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not ignorant of the needs and concerns of self-promotion in order to build a popular campaign...but I hope they have a technical advisor who will, at some point, inform them about the technique of OCR and how a large hashtag-watermark can obstruct such a technique.

Also, minor detail, but the images should also be rotated to their proper orientation. Crowdsourcing data collection has to be as frictionless as possible, and this is an easy fix.

Depending on how many actual documents there are (i.e. how many pages are in those 200 folders), it might be worth it to go the route of ProPublica's "Free the Files" project, in which they built a mini-app that let people voluntarily transcribe the important fields in each document:

https://www.propublica.org/series/free-the-files

Their Al Shaw wrote a piece about designing for efficient crowd-sourcing:

http://www.propublica.org/nerds/item/casino-driven-design

They even open-sourced the Rails plugin for it:

http://www.propublica.org/nerds/item/transcribable-free-the-...



> the images should also be rotated to their proper orientation

there's a rotate button when you click through to the detail page. We're tracking where images get rotated to, and setting the orientation according to that. It's still a bit buggy, but we're getting there

> the technique of OCR and how a large hashtag-watermark can obstruct such a technique

We're running OCR over non-watermarked versions. We're hoping to have a search function up later today

Thanks for the links -- we'll look at them, and see what we can use


> there's a rotate button when you click through to the detail page.

When rotating some of the images (for example Img 999) it seems to cut the edges of the image off and clicking to zoom doesn't help.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: