Automatic Text Summarization in PDF Documents with Faster R-CNN and PEGASUS

notafraudster · on Feb 26, 2021

The use of the CNN for page segmentation is great. I recently had to extract text from a few hundred thousand pages of PDFs which had actual text (so it's an easier problem than raw images), but which moved around between one, two, and three column layouts, sometimes within page. I ended up doing basically a probabilistic model where I searched coordinate grids and looked for low density columns of the grid search. It worked well enough for my exact dataset but I think would not generalize very well, and at the time I was looking I wasn't satisfied with anything off the shelf. Kudos.

konfuzio · on Feb 26, 2021

Hi notafraudster, is this dataset or your approach public? Perhaps we can can collaborate to expand our approach. FYI: We detect text embeddings automatically and decide thereby if we need OCR. Thanks for the feedback!

Der_Einzige · on Feb 26, 2021

Ever thought about trying to do extrative summarization the same way? I'm constantly frustrated that there is no extractive PEGASUS variant and all the existing transformer based extractive models rank (or rather highlight) sentences rather than highlighting/underline at the word level like most humans do.

konfuzio · on Feb 27, 2021

We collected all models in our documentation on https://deep-tech.com/training_documentation.html#text-modul... Extractive PEGASUS is not yet there. What do you exactly mean by highlighting/underline?

WalterGR · on Feb 27, 2021

I'm confused by a few things - hopefully someone with more experience can help.

* The text mentions OCR, but the screenshots show documents with fidelity far beyond what I would expect via scanning. I would guess that the screenshots in actuality show PDFs that include character and layout information, i.e. they don't simply contain scanned images. If my guess is correct, why is OCR needed?

* How does segmentation contrast with layout analysis, or are they synonymous?

* I know a lot of work has been done on layout analysis in commercial off-the-shelf OCR software. How do these results (up to but obviously not including the summarization itself) compare? Or, how would you expect them to compare?

Thanks!

konfuzio · on March 3, 2021

Hi Walter,

thanks for your questions! We have updated the post and included the answers to your questions.

- Of course, this step can be omitted if the documents already have text embeddings. However, it is often necessary to read tables or scanned documents, for example. In our software solution, the users can decide for any project if they want to use text embeddings, Tesseract, or a commercial OCR.

- With page segmentation or also called layout analysis, we refer to the division of a document into separate parts.

- This is done with our own trained model because we couldn’t achieve the needed outcome with off-the-shelf software like Tesseract or Abbyy FineReader.

WalterGR · on March 5, 2021

Thanks for the follow-up.

Did you see this, posted earlier today? It looks like the actual data isn't available yet, however.

https://news.ycombinator.com/item?id=26339769

Wit: Wikipedia-Based Image Text Dataset (github.com/google-research-datasets)

davide_v · on Feb 26, 2021

Do you offer this service of text summarization via API? I didn't exactly catch that, but we would be interested.

Note: I think the "Register for free" button for the webinar is broken.

konfuzio · on Feb 26, 2021

Hi David, thanks for reporting the link issue! We fix it. It should be https://app.konfuzio.com

The page segmentation API is already live. The PDF summarization API is work in progress. We just wanted to share our approach already now to incorporate any feedback! We are also working on the retraining loop to fine-tune our model on a small sample of other documents. We support this for custom NER models and document classification so far.

Best Chris