Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



But that was at least a bug, and a somewhat understandable one (though still stupid). I did my MSc on doing statistical methods to reduce error rates in OCR, and one of the methods that actually worked very well was various nearest neighbour variations over small windows of the pixel data. As part of that I did a literature review, of course, and there has been quite a lot of work on various algorithms for cleaning up images by trying to replace patches of pixels with presumed "clean" samples (sometimes from a known font, but more often by applying various clustering methods to patches from the image itself). Get that wrong and you'd very easily end up with something like this.

My own methods would also have easily produced this kind of error if you set the threshold for what to consider identical when clustering high enough. But for OCR the risk is somewhat mitigated by people not trusting it to be error-free, and so it can be an acceptable tradeoff if it reduces the overall error rate, but if you're outputting the raw pixel data and let people think it's an unmanipulated image you're begging for trouble.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: