Hacker News new | past | comments | ask | show | jobs | submit login

> they're just OCRing the rendered page

Not quite. Usually the PDF specifies each character (although the reader still has to do a slightly wacky conversion from glyph name to unicode character) but the position is specified as an (x,y) position, so the reader has to reconstruct the order that they come in, add spaces and newlines, etc.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: