> they're just OCRing the rendered page Not quite. Usually the PDF specifies eac...

> they're just OCRing the rendered page

Not quite. Usually the PDF specifies each character (although the reader still has to do a slightly wacky conversion from glyph name to unicode character) but the position is specified as an (x,y) position, so the reader has to reconstruct the order that they come in, add spaces and newlines, etc.