Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Converting Pdf into CSV
2 points by ed_balls 11 months ago | hide | past | favorite | 4 comments
I'd like to create a simple tool for coworker that would read a pdf file and convert it to a CSV. Usually it's an invoice or a file with rates.

I could make a screenshot and pate into ChatGPT, but it struggles (and cannot do pdfs, just images)

Is there a better way to automate this?




If you go this route you will discover a world full of pain.

PDF's look nice on screen and/or printed, but internally they are not always so nice for data extraction (unless the creator specifically set them up to be data extracted).

Inside a PDF, the PDF structure is simply instructions to position font glyphs at 2D coordinates on a virtual sheet of paper. And depending upon how the creating system generated the PDF, it might be relatively easy to extract (the PDF was created left to right, top to bottom, and positions nothing smaller than whole words at a time) or a royal pain (each individual letter is independently positioned at a specific x,y coordinate [this is unlikely, but possible]).

If you intend to consume a specific PDF from a specific generator you'll have better luck (because you can adapt to that specific generators methods) but if you expect to extract from any pdf from any source you'll be constantly updating to cover for some pdf creator program's quirks that you had not seen before.


From ChatGPT plugins, OCR gives the best results so far.


What does the PDF contain? Maybe this online tool is helpful for you, it converts PDF into Excel files: https://www.ilovepdf.com/es/pdf_a_excel

If it works you can extract the CSV from there.

Hope it helps!


My experience converting from PDF has been ... less than pleasant (even manually copy-pasting from a PDF into Excel only works some of the time)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: