Great work! I am a bit confused with the comparison with nougat throughout the repo. Nougat was specifically trained for academic documents, and I don't think anyone ever claimed Nougat was the best OCR model out there. That's kinda clear in your benchmark too where you mention that nougat has higher accuracy on arxiv documents. You also mention that marker will convert fewer equations when compared to nougat, and yet compare with nougat in terms of speed? (again, only complaining because it's a model designed for academic documents).
For anyone trying to do OCR on any pdf with math in it, definitely do try nougat. It's very easy to install (just a python package), and extracts the math, text, tables and beyond (in a .mmd file) with a single command line command. It also runs reasonably fast for personal uses - it takes about 30 seconds to convert a 6 page document using CPU only on my 4 year old i5 laptop.
> I don't think anyone ever claimed Nougat was the best OCR model out there
Comparing two things doesn't inherently imply the previous thing was touted about with superlatives. It's just a way to juxtapose the new thing with something that may be familiar. As you said, nougat is easy to install/run so it makes sense they'd compare it. Would it be better if they could add more libraries in the comparison? Absolutely; that'd be helpful.
How do you think nougat would handle RPG rulebook PDFs?
I'm looking for a food OCR model to help me transcribe sections of RPG books to markdown. Ideally, I'd like the emphasis such as bold or italics to be transcribed.
The combo of text, numbers, and math symbols seems similar to technical and academic writing, but often has weird formatting, text boxes in the margins, and many diagrams.
I'm not completely sure to be honest, but you should try it yourself with a sample page! I believe hugging face hosts it online on their demo pages so you don't even have to install the package to test on one page.
Author here: for my use case (converting scientific PDFs in bulk), nougat was the best solution, so I compared to it as the default. I also compare to naive text extraction further down.
Nougat is a great model, and converts a lot of PDFs very well. I just wanted something faster, and more generalizable.
Reading your comment and parent's I think perhaps there is a mistake in the comparison chart on GitHub? It says nougat takes around 700 seconds per page and yours around 90. This doesn't match with parent's claim that it took him 30 seconds to run nougat on 6 pages.
Yes, nougat is used as part of the pipeline to convert the equations (basically marker detects the equations then passes those regions to nougat). It's a great model for this.
Let's not underestimate the impact of such tool: we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.
I'm very excited about it.
Let's build a pipeline: all the pdfs -> markdown them all -> archive.org them all
I don't think that is the right approach for archiving. The preferred pipeline would be
all the pdfs -> archive them all -> markdown them
This way you can always re-run the conversion as bugs are fixed and improvements are made. Generally archivist prefer to save as close to the source material as possible, because every transformation from there can only lose data.
Yeah if you get down into the weeds these models are significantly corrupting the source data.
I opened the first example to a random chapter (1.4 Formal and natural languages); within the first three paragraphs it:
- Hallucinated spurious paragraph breaks
- Ignored all the boldfacing
- Hallucinated a blockquote into a new section
This is not a tool to produce something for humans to read.
Maybe it might be useful as part of some pipeline that needs to feed markdown into some other machine process. I would not waste my time reading the crud that came out of this thing.
> we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.
FWIW PDF is actually great for distribution. It allows you to invisibly embed all the raw data used to generate the document that the end user is seeing, in whatever format you want. So if you are generating your PDFs by using PrinceXML to render HTML, you can embed the raw JSON used to generate all of the text, graphs, charts, etc. Now most people don't actually do this of course, but that isn't the fault of the spec.
Are the standards for building accessible PDFs worse than the standards for building accessible websites, or are they just not as commonly implemented?
(anecdotally) PDFs usually come from many people, departments, companies, and apps. It's hard to shoehorn in accessibility if someone didn't add it in at the origin (like in indesign or whatever app they used). Or if they printed to PDF, whatever accessibility they had would probably be lost. Much of the time it's like working with a raster image with some embedded text. Not really the same as being able to edit a proper semantic document.
With a website and available source code, any dev working on it later on can still add accessibility, tweak contrasts and fonts and add screen reader hints, etc.
It's much harder to do so for PDFs after the fact. And PDF viewer apps may or may not even support the accessibility annotations. By contrast all the major browsers and operating systems have OK support for web accessibility.
Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).
Yeah, I know, but a lot of this content can be pretty sensitive, and might not be allowed to upload outside organization networks sometimes (hospitals, governments etc).
Like most software, LLMs can be run locally, or on private infrastructure. This was on the front page yesterday, which is not the only way to run an LLM locally, but about the easiest way possible: https://news.ycombinator.com/item?id=38464057
This also has tons of use-cases for accessibility, getting PDF accessibility right is tons of work and even if you manage it, it's highly likely that the PDF viewers your users use don't support the necessary standards anyway.
This looks amazing, I'll have to play around with this over the weekend.
I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.
It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).
Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.
This would save so much time if I can get it to work. Thanks for sharing!
I had this use case also in mind. Already tried with one book, but the results were not that good. Many of the tables and text boxes were messed up. I had pretty good results converting tables to markdown with ChatGPT by taking a screenshot of a table and pasting it to chat. It was able to handle some "irregular" tables with a bit of prompting. Like "Read the table row by row. Column headers are X, Y, Z. X is text, Y is number, Z is word" as a simplified example.
I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.
I found it to be surprisingly good and I was very impressed with the in-browser performance. It is very very sensitive to resolution though. Once my images got down to a certain size they produced garbage from Tesseract even though they were very human readable.
I tried it quite recently and it failed on a very basic image. I also tried the iOS Vision API, which also failed. My test case was a clear photo of a book page.
Question for the author: Why to markdown? It seems to me the hard part of this tool is parsing pdfs with high accuracy, not whatever you do with them. As such, I would love if this tool allowed the user to choose the output format. I know that I would use a high accuracy pdf parser to render into epub.
You would want to have some kind fo markup that preserves structural markup as much as possible. I manage ebooks for a university press, and we have a deep backlist waiting for conversion, a lot of which only exists as page scans of old print volumes. I want to be able to offer them as epubs, which means I need to know where there are chapter breaks, heads, tables, charts, math, blockquotes, and so on and so forth. I have vendors that can do this for me, but it costs more than we'd get for some of these books in sales. I'd love to be able to do soem of this myself.
I agree, the intermediate format should be plain text that could optionally be converted to any other format. I suppose that Markdown, however, is used as intermediate format here. It is close to plain text while it can preserve simple layout information.
In practice, I would use the Markdown output and plug it into any tool that converts that into the desired final output format.
That sounds reasonable. I might explore pdf -> markdown -> epub.
I wonder if this could somehow be used directly by calibre. I think calibre's pdf->epub conversion isn't amazing. In particular, tables often end up broken.
I chose markdown because I wanted to preserve equations (fenced by $/$$), tables, bold/italic information, and headers. I haven't looked into epub output, but this ruled out plain text.
I have an odd usecase that I've yet to find a good solution to: Reading construction documents (Blueprints are always PDF). I've had much better luck parsing DXF (AutoCAD) files but it's not always easy to get an architect to send them to me even if I'm the GC on the job.
Nice work. I tend to do most of my longer reading on an e-reader. PDFs, especially multi-column layouts, are a nightmare with the out-of-the-box offerings from Amazon Kindle or Pocketbook. This looks like something that'll improve my experience quite a lot.
I have a question regarding the output of Nougat: Where do the "hallucinations" come from (just scroll through the Nougat output of the Think Python example to see what I mean)?
Nevermind, i just read it runs it through an LLM, so hallucinations are par for the course.
I think these sorts of tools are dangerous at least until the hallucination (in text or formatting) rate is below that experienced by a careful reader repeatedly re-reading a document, which is almost but not quite zero and, depending on the application, potentially even until it's actually zero. I guess they're mostly fine for cases where the extact document content isn't important, but it's probably not common to have a lot of documents that nobody anywhere considers or ever will consider important yet which must be more accessible than pdfs.
Nice. This would have been very helpful when I was building an e-discovery document processing engine. Back then we could get text out (OCR, so kind of) but it was a bear to present. Markdown would have been a whole lot easier.
I have a set of PDF files, and this week was thinking how I can link them to an LLM and be able to ask questions about them. So this was very timely.
I did a quick side-by-side testing against Nougat, and it clearly works better. On a handful of PDFs I tested, Marker extracted considerably more text (the text did not have any math, just academic papers), finished the job faster, and did not crash on any pdf, while Nougat took a lot longer to finish, and sometimes crashed due to out-of-memory error (could not allocate more than 7GB RAM!)
Might the OCRing of for example MIT's student magazine The Tech have used a similar stack as this, sans Markdown output of course? In the sense of the way any given historical issue's complex layout has been OCR'd so well?
I know it is included. The problem is that the available selection of languages is not good enough to include any of the languages I need it for. There is only support for a handful of languages.
>Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
Does this mean it isn't suitable if I wanted to use it in a product for sale or I cannot use it for tasks at my work?
I would like to try to use this at work to convert vendor documentation to include in our internal wiki.
If your work is commercial then you cannot use it. Think of it this way, is your work being used in a commercial business. Then it cannot be used. If you are using this for personal use or anything that is not part of a business, its ok.
Why are people converting PDF to Markdown? I get the impression that it is a thing currently in the LLM / ML world. But shouldn't we be converting to an unambiguously machine readable format like JSON, and then separately writing JSON to md formatters?
I've struggled with the other part of this flow: getting a good clean PDF of a website in an automated way. Whatever archive.today does is probably the best approach I've seen, but they don't publish their code as far as I can tell.
My current workflow (for getting a magazine onto a website) is Calibre's HTMLZ export, then through Pandoc to markdown. It produces good enough Markdown to feed in to Hugo, and extracts images.
I've been through a number of options in the past and this is what I've settled on.
My workflow still takes manual tweaking. When I find floated figures with captions, the lines get intertwingled and need to be unintertwingled. So I'm not surprised it didn't work for you.
Good luck, report back if you find what you're looking for. I'm always on the lookout for a better way.
I can report that the closest I've came before is with PDFMiner (https://pypi.org/project/pdfminer/) for Python. The benefit of this one is that it retains styling information, so that italics and the like can be retained, at least with some post-processing (I think one might need to convert certain CSS-classes to actual <i> or <em> tags).
The other option I have started looking into is the PDFCPU library for Go. It is a bit more low-level than PDFMiner, but one gets out very well structured info, that seem it might be possible to post-process quite well, for one's particular use case and PDF layouts: https://github.com/pdfcpu/pdfcpu
I also now tried the Marker tool in the OT, and it seems to do a reasonable job. It did intermingle some columns though, at least in some tricky cases such as when there were a round shaped image in between the two columns. One note is that Marker doesn't seem to retain styling like italics though.
Especially for those that want to move out of Confluence. It is rather easy to obtain a docx or pdf from the API as well as the raw, uncompressed attachements, a bit more complicated to convert said files to markdown with full quality attachements and no formatting errors on every pages.
I'd love to try this for a magazine I publish in PDF (designed with Adobe Indesign), but I couldn't make the repo work on my local. Any chance anyone could make a guide to try it on the cloud? It would be appreciated :)
I'm curious if anyone has had any success building this package. I've spent a lot of time trying to build it myself, but unfortunately haven't been able to get it to work. Has anyone else had better luck?"
(author) Please feel free to open an issue if you try again. Poetry can be painful, I might just switch to a requirements.txt file in the future. (you can skip poetry if you want by just pulling everything in pyproject.toml into a requirements.txt file also)
I found the use of poetry a bresh of fresh air compared to the usual python silliness. Painless, as opposed to getting the cuda stuff working which took a lot longer.
The installation of this thing needed more time than manual fixups of the generated .md with a simple pdf2md converter. And I got a perfect result, unlike marker/nougat.
Are there any other libraries or online services that does this well? I have a large number of PDFs from government agencies. I’ve tried AWS Textract and works fairly well.
https://www.handwritingocr.com is aimed specifically at handwriting, and will do that better than Textract and co, but works well for printed text too.
It actually doesn't matter. For my cases, I found Mathpix to be much more reliable than Nougat, for example. So, when you have hundreds of documents to convert a year and little time for manual labor on the results, paying a yearly "pro" subscription fee is worth it. However, it will really hit your pocket when you need to prepare datasets from thousands of PDFs... That's what you can't afford without a budget allocation from your project.
While I can see that Mathpix might be the superior choice. What matters to me is that knowing the affiliation of the GP the comment has a very different feel to it.
For anyone trying to do OCR on any pdf with math in it, definitely do try nougat. It's very easy to install (just a python package), and extracts the math, text, tables and beyond (in a .mmd file) with a single command line command. It also runs reasonably fast for personal uses - it takes about 30 seconds to convert a 6 page document using CPU only on my 4 year old i5 laptop.