The use of the CNN for page segmentation is great. I recently had to extract tex... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

notafraudster on Feb 26, 2021 | parent | context | favorite | on: Automatic Text Summarization in PDF Documents with...

The use of the CNN for page segmentation is great. I recently had to extract text from a few hundred thousand pages of PDFs which had actual text (so it's an easier problem than raw images), but which moved around between one, two, and three column layouts, sometimes within page. I ended up doing basically a probabilistic model where I searched coordinate grids and looked for low density columns of the grid search. It worked well enough for my exact dataset but I think would not generalize very well, and at the time I was looking I wasn't satisfied with anything off the shelf. Kudos.

konfuzio on Feb 26, 2021 [–]

Hi notafraudster, is this dataset or your approach public? Perhaps we can can collaborate to expand our approach. FYI: We detect text embeddings automatically and decide thereby if we need OCR. Thanks for the feedback!

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact