The use of the CNN for page segmentation is great. I recently had to extract text from a few hundred thousand pages of PDFs which had actual text (so it's an easier problem than raw images), but which moved around between one, two, and three column layouts, sometimes within page. I ended up doing basically a probabilistic model where I searched coordinate grids and looked for low density columns of the grid search. It worked well enough for my exact dataset but I think would not generalize very well, and at the time I was looking I wasn't satisfied with anything off the shelf. Kudos.
Hi notafraudster, is this dataset or your approach public? Perhaps we can can collaborate to expand our approach. FYI: We detect text embeddings automatically and decide thereby if we need OCR. Thanks for the feedback!