Hacker News new | past | comments | ask | show | jobs | submit login
Marker: Convert PDF to Markdown quickly with high accuracy (github.com/vikparuchuri)
683 points by sebg on Dec 1, 2023 | hide | past | favorite | 95 comments



Great work! I am a bit confused with the comparison with nougat throughout the repo. Nougat was specifically trained for academic documents, and I don't think anyone ever claimed Nougat was the best OCR model out there. That's kinda clear in your benchmark too where you mention that nougat has higher accuracy on arxiv documents. You also mention that marker will convert fewer equations when compared to nougat, and yet compare with nougat in terms of speed? (again, only complaining because it's a model designed for academic documents).

For anyone trying to do OCR on any pdf with math in it, definitely do try nougat. It's very easy to install (just a python package), and extracts the math, text, tables and beyond (in a .mmd file) with a single command line command. It also runs reasonably fast for personal uses - it takes about 30 seconds to convert a 6 page document using CPU only on my 4 year old i5 laptop.


> I don't think anyone ever claimed Nougat was the best OCR model out there

Comparing two things doesn't inherently imply the previous thing was touted about with superlatives. It's just a way to juxtapose the new thing with something that may be familiar. As you said, nougat is easy to install/run so it makes sense they'd compare it. Would it be better if they could add more libraries in the comparison? Absolutely; that'd be helpful.


How do you think nougat would handle RPG rulebook PDFs?

I'm looking for a food OCR model to help me transcribe sections of RPG books to markdown. Ideally, I'd like the emphasis such as bold or italics to be transcribed.

The combo of text, numbers, and math symbols seems similar to technical and academic writing, but often has weird formatting, text boxes in the margins, and many diagrams.


I'm not completely sure to be honest, but you should try it yourself with a sample page! I believe hugging face hosts it online on their demo pages so you don't even have to install the package to test on one page.


Author here: for my use case (converting scientific PDFs in bulk), nougat was the best solution, so I compared to it as the default. I also compare to naive text extraction further down.

Nougat is a great model, and converts a lot of PDFs very well. I just wanted something faster, and more generalizable.


Reading your comment and parent's I think perhaps there is a mistake in the comparison chart on GitHub? It says nougat takes around 700 seconds per page and yours around 90. This doesn't match with parent's claim that it took him 30 seconds to run nougat on 6 pages.


Great work! I just tried it on Linux for System Administrators and it did a great job properly picking up on code and config text.

I noticed marker downloaded a PyTorch checkpoint called `nougat-0.1.0-small`, do you use nougat under the hood too or is that just a coincidence?


Yes, nougat is used as part of the pipeline to convert the equations (basically marker detects the equations then passes those regions to nougat). It's a great model for this.


> extracts the math, text, tables

I want to extract financial statements from pdfs which are in tables, would Nougat be suitable for that use case?


Let's not underestimate the impact of such tool: we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

I'm very excited about it.

Let's build a pipeline: all the pdfs -> markdown them all -> archive.org them all


> Let's build a pipeline

I don't think that is the right approach for archiving. The preferred pipeline would be

all the pdfs -> archive them all -> markdown them

This way you can always re-run the conversion as bugs are fixed and improvements are made. Generally archivist prefer to save as close to the source material as possible, because every transformation from there can only lose data.


Yeah if you get down into the weeds these models are significantly corrupting the source data.

I opened the first example to a random chapter (1.4 Formal and natural languages); within the first three paragraphs it:

- Hallucinated spurious paragraph breaks

- Ignored all the boldfacing

- Hallucinated a blockquote into a new section

This is not a tool to produce something for humans to read.

Maybe it might be useful as part of some pipeline that needs to feed markdown into some other machine process. I would not waste my time reading the crud that came out of this thing.

It's a stunt.


> we are talking about freeing up tons of knowledge from a "good for consumption/bad for distribution" format.

FWIW PDF is actually great for distribution. It allows you to invisibly embed all the raw data used to generate the document that the end user is seeing, in whatever format you want. So if you are generating your PDFs by using PrinceXML to render HTML, you can embed the raw JSON used to generate all of the text, graphs, charts, etc. Now most people don't actually do this of course, but that isn't the fault of the spec.


The problem of PDF is not distribution, it's consumption. It has a fixed layout that's so 1990 that makes me itch.


pdfs don't play well with ereaders.


Are the standards for building accessible PDFs worse than the standards for building accessible websites, or are they just not as commonly implemented?


(anecdotally) PDFs usually come from many people, departments, companies, and apps. It's hard to shoehorn in accessibility if someone didn't add it in at the origin (like in indesign or whatever app they used). Or if they printed to PDF, whatever accessibility they had would probably be lost. Much of the time it's like working with a raster image with some embedded text. Not really the same as being able to edit a proper semantic document.

With a website and available source code, any dev working on it later on can still add accessibility, tweak contrasts and fonts and add screen reader hints, etc.

It's much harder to do so for PDFs after the fact. And PDF viewer apps may or may not even support the accessibility annotations. By contrast all the major browsers and operating systems have OK support for web accessibility.


I don't know anything about websites. I had ebooks in mind.


Yeah, totally. PDFs are wonderful for archiving.*

They can hold so many different types of data so that they're extremely difficult to parse.

Because of this, you can put several malicious programs into them for RCE.

That way, if someone archives many PDFs, there can be a plethora of different RCE vulnerabilities just waiting for the user to discover.

It's a wonderful dream for any malicious actor.

* /s


Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).


Yes, there is an enormous interest in this kind of thing, not the least in larger organizations with tons of PDF documents in various forms.

Even though this would only cover a small part of the needs or use cases, it will still be hugely useful if it works well.


cough L cough L cough M cough anyone? :)


Yeah, I know, but a lot of this content can be pretty sensitive, and might not be allowed to upload outside organization networks sometimes (hospitals, governments etc).


Like most software, LLMs can be run locally, or on private infrastructure. This was on the front page yesterday, which is not the only way to run an LLM locally, but about the easiest way possible: https://news.ycombinator.com/item?id=38464057


Thanks! Well, yea, I just thought the quality of offline models might not yet be good enough. By I'm glad to be told otherwise :)


This also has tons of use-cases for accessibility, getting PDF accessibility right is tons of work and even if you manage it, it's highly likely that the PDF viewers your users use don't support the necessary standards anyway.


Finally a good usecase for AI/ML/LLM.


This looks amazing, I'll have to play around with this over the weekend.

I regularly hand transcribe RPG PDFs scans from dubious sources that have not always been run through OCR to have selectable text. If it has, it wasn't always done very well.

It's literally faster to type it all myself than fix all the errors from copy-pasting (or after using OCR to turn it into text).

Even if the file was an official PDF the formatting would often get screwed up with lots of double or triple spaces and even tabs included between words.

This would save so much time if I can get it to work. Thanks for sharing!


I had this use case also in mind. Already tried with one book, but the results were not that good. Many of the tables and text boxes were messed up. I had pretty good results converting tables to markdown with ChatGPT by taking a screenshot of a table and pasting it to chat. It was able to handle some "irregular" tables with a bit of prompting. Like "Read the table row by row. Column headers are X, Y, Z. X is text, Y is number, Z is word" as a simplified example.


> I regularly hand transcribe RPG PDFs scans from dubious sources

Heh, that was my immediate thought too. There's a ton of RPG stuff that never had any kind of physical release and is totally orphaned as IP.


How good is tesseract for OCR nowadays? I tried using it a while back and it was nowhere near as good as the online offerings from AWS, Azure and GCP.


Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:

https://github.com/tesseract-ocr/tesseract/releases

I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.


I found it to be surprisingly good and I was very impressed with the in-browser performance. It is very very sensitive to resolution though. Once my images got down to a certain size they produced garbage from Tesseract even though they were very human readable.


It requires quite a bit of preprocessing. I've only tried GCP's solution which it's better in my experience


I tried it quite recently and it failed on a very basic image. I also tried the iOS Vision API, which also failed. My test case was a clear photo of a book page.


Question for the author: Why to markdown? It seems to me the hard part of this tool is parsing pdfs with high accuracy, not whatever you do with them. As such, I would love if this tool allowed the user to choose the output format. I know that I would use a high accuracy pdf parser to render into epub.


You would want to have some kind fo markup that preserves structural markup as much as possible. I manage ebooks for a university press, and we have a deep backlist waiting for conversion, a lot of which only exists as page scans of old print volumes. I want to be able to offer them as epubs, which means I need to know where there are chapter breaks, heads, tables, charts, math, blockquotes, and so on and so forth. I have vendors that can do this for me, but it costs more than we'd get for some of these books in sales. I'd love to be able to do soem of this myself.


I agree, the intermediate format should be plain text that could optionally be converted to any other format. I suppose that Markdown, however, is used as intermediate format here. It is close to plain text while it can preserve simple layout information.

In practice, I would use the Markdown output and plug it into any tool that converts that into the desired final output format.


That sounds reasonable. I might explore pdf -> markdown -> epub.

I wonder if this could somehow be used directly by calibre. I think calibre's pdf->epub conversion isn't amazing. In particular, tables often end up broken.


I chose markdown because I wanted to preserve equations (fenced by $/$$), tables, bold/italic information, and headers. I haven't looked into epub output, but this ruled out plain text.


Why not choose an unambiguously parseable output format such as JSON, and then convert JSON to markdown/ html / etc when needed?


I have an odd usecase that I've yet to find a good solution to: Reading construction documents (Blueprints are always PDF). I've had much better luck parsing DXF (AutoCAD) files but it's not always easy to get an architect to send them to me even if I'm the GC on the job.


Nice work. I tend to do most of my longer reading on an e-reader. PDFs, especially multi-column layouts, are a nightmare with the out-of-the-box offerings from Amazon Kindle or Pocketbook. This looks like something that'll improve my experience quite a lot.


Great stuff!

I have a question regarding the output of Nougat: Where do the "hallucinations" come from (just scroll through the Nougat output of the Think Python example to see what I mean)?

Nevermind, i just read it runs it through an LLM, so hallucinations are par for the course.


I think these sorts of tools are dangerous at least until the hallucination (in text or formatting) rate is below that experienced by a careful reader repeatedly re-reading a document, which is almost but not quite zero and, depending on the application, potentially even until it's actually zero. I guess they're mostly fine for cases where the extact document content isn't important, but it's probably not common to have a lot of documents that nobody anywhere considers or ever will consider important yet which must be more accessible than pdfs.


This seems like a great tool to help migrate my notes out of OneNote



How can it help with OneNote?


Really interesting stuff... it might be worth adding some before-and-after examples to the repo.

What kind of PDF are you tweaking it for? How does it handle handwritten annotations?


Kosmos2.5 seems promising and I hope we see it in oss (otherwise assume it just makes Azure cloud ocr better)

https://arxiv.org/pdf/2309.11419.pdf


Nice. This would have been very helpful when I was building an e-discovery document processing engine. Back then we could get text out (OCR, so kind of) but it was a bear to present. Markdown would have been a whole lot easier.


Amazing work. Thank you.

I have a set of PDF files, and this week was thinking how I can link them to an LLM and be able to ask questions about them. So this was very timely.

I did a quick side-by-side testing against Nougat, and it clearly works better. On a handful of PDFs I tested, Marker extracted considerably more text (the text did not have any math, just academic papers), finished the job faster, and did not crash on any pdf, while Nougat took a lot longer to finish, and sometimes crashed due to out-of-memory error (could not allocate more than 7GB RAM!)


Might the OCRing of for example MIT's student magazine The Tech have used a similar stack as this, sans Markdown output of course? In the sense of the way any given historical issue's complex layout has been OCR'd so well?

https://thetech.com/issues

Random old issue for example: https://thetech.com/issues/33/34


Impressive. It would be nice to have access to a spellchecker with support for more languages though. But the results are pretty good despite that.


Spellchecker is included. Just change the spell_Lang from eng to your lang


I know it is included. The problem is that the available selection of languages is not good enough to include any of the languages I need it for. There is only support for a handful of languages.


Can someone help me understand the line

>Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.

Does this mean it isn't suitable if I wanted to use it in a product for sale or I cannot use it for tasks at my work? I would like to try to use this at work to convert vendor documentation to include in our internal wiki.


If your work is commercial then you cannot use it. Think of it this way, is your work being used in a commercial business. Then it cannot be used. If you are using this for personal use or anything that is not part of a business, its ok.


What if your business is education and you use it in your educational output? (Serious question)


Thank you for your help!


Why are people converting PDF to Markdown? I get the impression that it is a thing currently in the LLM / ML world. But shouldn't we be converting to an unambiguously machine readable format like JSON, and then separately writing JSON to md formatters?


I've struggled with the other part of this flow: getting a good clean PDF of a website in an automated way. Whatever archive.today does is probably the best approach I've seen, but they don't publish their code as far as I can tell.


It'd be really great if there was something like this that also supported image extraction


My current workflow (for getting a magazine onto a website) is Calibre's HTMLZ export, then through Pandoc to markdown. It produces good enough Markdown to feed in to Hugo, and extracts images.

I've been through a number of options in the past and this is what I've settled on.


Interesting! I tried it, but it seems to struggle with multi-column layouts (lines get intermingled). Is that something you tried?


No, only standard paragraphs.

My workflow still takes manual tweaking. When I find floated figures with captions, the lines get intertwingled and need to be unintertwingled. So I'm not surprised it didn't work for you.

Good luck, report back if you find what you're looking for. I'm always on the lookout for a better way.


I can report that the closest I've came before is with PDFMiner (https://pypi.org/project/pdfminer/) for Python. The benefit of this one is that it retains styling information, so that italics and the like can be retained, at least with some post-processing (I think one might need to convert certain CSS-classes to actual <i> or <em> tags).

The other option I have started looking into is the PDFCPU library for Go. It is a bit more low-level than PDFMiner, but one gets out very well structured info, that seem it might be possible to post-process quite well, for one's particular use case and PDF layouts: https://github.com/pdfcpu/pdfcpu

I also now tried the Marker tool in the OT, and it seems to do a reasonable job. It did intermingle some columns though, at least in some tricky cases such as when there were a round shaped image in between the two columns. One note is that Marker doesn't seem to retain styling like italics though.


Especially for those that want to move out of Confluence. It is rather easy to obtain a docx or pdf from the API as well as the raw, uncompressed attachements, a bit more complicated to convert said files to markdown with full quality attachements and no formatting errors on every pages.


I'd love to try this for a magazine I publish in PDF (designed with Adobe Indesign), but I couldn't make the repo work on my local. Any chance anyone could make a guide to try it on the cloud? It would be appreciated :)


I'm curious if anyone has had any success building this package. I've spent a lot of time trying to build it myself, but unfortunately haven't been able to get it to work. Has anyone else had better luck?"


The hard part was getting CUDA and torch to work. The package itself was just poetry install. Easy-peasy.


I did it on mac without any issues. Are you using Mac or Linux? what is the issue?


I'm using Ubuntu 22.04. I encountered several errors with Poetry and attempted to fix them but eventually gave up.


(author) Please feel free to open an issue if you try again. Poetry can be painful, I might just switch to a requirements.txt file in the future. (you can skip poetry if you want by just pulling everything in pyproject.toml into a requirements.txt file also)


I found the use of poetry a bresh of fresh air compared to the usual python silliness. Painless, as opposed to getting the cuda stuff working which took a lot longer.


Is there a plan to release this package as a docker image?


Yes, this is on my list of things to do :)


Nice! The only missing feature is conversion of plots to ASCII art ; )


This could be achieved with chafa.py :

https://chafapy.mage.black/

https://hpjansson.org/chafa/


The installation of this thing needed more time than manual fixups of the generated .md with a simple pdf2md converter. And I got a perfect result, unlike marker/nougat.


How well does this handle tables and text embedded in images?


Are there any other libraries or online services that does this well? I have a large number of PDFs from government agencies. I’ve tried AWS Textract and works fairly well.


https://www.handwritingocr.com is aimed specifically at handwriting, and will do that better than Textract and co, but works well for printed text too.


[flagged]


Suggest adding a disclaimer that you are the founder


You’re right :) got to love HN, I broke the code and got downvoted! I like a community that has teeth


This kind of tool should also be built-into the post-processing pipeline of paperless-ngx. Well-parsed markdowns would be better indexable for search.


What’s the best tool for writing with chatGPT so that markdown gets rendered properly? Copy pasting in google docs is always misery.


I'd just use pandoc to convert to docx.


I'm not very technical but could benefit from this tool tremendously. Is there a way to use it from R?


That looks great ! I would think that the same with latex or typst could be even better if doable


[flagged]


You should mention that you are the CEO of Mathpix.


It actually doesn't matter. For my cases, I found Mathpix to be much more reliable than Nougat, for example. So, when you have hundreds of documents to convert a year and little time for manual labor on the results, paying a yearly "pro" subscription fee is worth it. However, it will really hit your pocket when you need to prepare datasets from thousands of PDFs... That's what you can't afford without a budget allocation from your project.


While I can see that Mathpix might be the superior choice. What matters to me is that knowing the affiliation of the GP the comment has a very different feel to it.


Mathpix is going to lower price a lot for larger scale pdf processing, stay tuned!


$$




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: