Hacker News new | past | comments | ask | show | jobs | submit login

If you go this route you will discover a world full of pain.

PDF's look nice on screen and/or printed, but internally they are not always so nice for data extraction (unless the creator specifically set them up to be data extracted).

Inside a PDF, the PDF structure is simply instructions to position font glyphs at 2D coordinates on a virtual sheet of paper. And depending upon how the creating system generated the PDF, it might be relatively easy to extract (the PDF was created left to right, top to bottom, and positions nothing smaller than whole words at a time) or a royal pain (each individual letter is independently positioned at a specific x,y coordinate [this is unlikely, but possible]).

If you intend to consume a specific PDF from a specific generator you'll have better luck (because you can adapt to that specific generators methods) but if you expect to extract from any pdf from any source you'll be constantly updating to cover for some pdf creator program's quirks that you had not seen before.




From ChatGPT plugins, OCR gives the best results so far.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: