Hacker News new | past | comments | ask | show | jobs | submit login

Can I ask how you parse PDFs? I'm curious both in terms of reading the PDF data (Python library?) and parsing it (regex?)... and do you have to deal with OCR as well?



I use "pdftotext -layout" and then parse that. Here is some more info from people who have tried this approach:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: