Diff-pdf: tool to visually compare two PDFs

simonw · 2024-07-02T19:44:36 1719949476

This inspired me to have Claude 3.5 Sonnet knock out a quick web page prototype for me, using PDF.js to load and render the PDFs to canvas elements and then display visual diffs between their pages.

Two prompts:

    Build a tool where I can drag and drop on two PDF files and
    it uses PDF.js to turn each of their pages into canvas
    elements and then displays those pages side by side with a
    third image that highlights any differences between them, if
    any differences exist

    rewrite that code to not use React at all

Here's the result: https://tools.simonwillison.net/compare-pdfs

It actually works quite well! Screenshot here: https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5...

radicality · 2024-07-02T21:02:08 1719954128

What’s the best way to effectively make use of Claude 3.5 ? I signed up a few days ago for the api access. Besides console.anthropic.com , do you recommend any other tools that I can run locally to give it api key and use claude effectively ?

simonw · 2024-07-02T21:29:02 1719955742

The web interface gives you access to the Artifacts feature where it can build SPAs and render them in the browser.

For terminal access I like using my own https://llm.datasette.io/ tool with the https://github.com/simonw/llm-claude-3 plugin

For Python library access I recommend checking out Claudette: https://www.answer.ai/posts/2024-06-21-claudette.html

kenjackson · 2024-07-02T22:23:12 1719958992

How much additional work did you have to do to get it into this form? Thats impressive work using those prompts.

simonw · 2024-07-02T22:33:34 1719959614

Almost no extra work at all. You can see the full transcript here: https://gist.github.com/simonw/9d7cbe02d448812f48070e7de13a5... - it really was just those two prompts, then I copied the result out into a document to test it.

I modified the HTML a tiny bit before publishing it - I set the font to Helvetica and added the note at the bottom of the page showing the prompt I used.

The whole project took less than 5 minutes - then another 10 to write it up.

robertlagrant · 2024-07-03T10:26:08 1720002368

That's incredible.

hallucinated · 2024-07-07T09:03:26 1720343006

It's hard to challenge a comment that is 4 days old: What's the difference between the 2 PDFs?

I didn't find any actual difference. But - Maybe it's just me that's hallucinating

tomwheeler · 2024-07-02T14:02:31 1719928951

In a previous job, I had to validate the output of an unreliable production publishing system, so I tested dozens of PDF comparison tools available at the time. The best I found was called Delta Walker. It was proprietary commercial Mac-only software, but reasonably inexpensive, accurate, and could handle long PDFs with lots of graphics well.

I remember evaluating this diff-pdf tool and finding that it fell short in some way, although it's been so long that I don't recall the specifics. Most of them failed to identify changes or reported false positives. I also remember being disappointed since this one was open source and could easily be scripted.

pivo · 2024-07-02T14:35:20 1719930920

It looks like Delta Walker's added Windows and Linux support: https://www.deltawalker.com/download

Always42 · 2024-07-03T02:35:32 1719974132

I found "Draftable" to be great

mksreddy · 2024-07-02T22:23:33 1719959013

Wouldn’t exporting pages to images and using pixel diff accurately identify differences in PDF’s?

netsharc · 2024-07-02T22:50:18 1719960618

I guess it depends on the use case. Imagine adding an extra sentence in the second PDF, and this causes the paragraph to have 6 instead of 5 lines, and the next paragraph begins a line further down, and the last paragraph of that page ends up in the next page, etc...

mksreddy · 2024-07-02T23:14:51 1719962091

Thanks. That helped understand it better.

ydant · 2024-07-02T14:31:12 1719930672

Related - this might be helpful to someone.

ImageMagick can do a visual PDF compare:

    magick compare -density "$DENSITY" -background white "$1[0]" "$2[0]" "$TMP"

(density = 100, $1 and $2 are the filenames to compare, $TMP the output file)

You need to do some work to support multiple pages, so I use this script:

https://gist.github.com/mbafford/7e6f3bef20fc220f68e467589bb...

This also uses `imgcat` to show the difference directly in the terminal.

You can also use ImageMagick get a perceptual hash difference using something like:

    convert -metric phash "$1" null: "$2" -compose Difference -layers composite -format '%[fx:mean]\n' info:

I use the fact you can configure git to use custom diff tools and take advantage of this with the following in my .gitconfig:

    [diff "pdf"]
        command = ~/bin/git-diff-pdf

And in my .gitattributes I enable the above with:

    *.pdf binary diff=pdf

~/bin/git-diff-pdf does a diff of the output of `pdftotext -layout` (from poppler) and also runs pdf-compare-phash.

To use this custom diff with `git show`, you need to add an extra argument (`git show --ext-diff`), but it uses it automatically if running `git diff`.

bigfatfrock · 2024-07-02T15:12:25 1719933145

Next level, especially with the git attribute calls, well played.

I'm still blown away how powerful imagemagick is after using it for a decade or two, what an inspiring piece of open source software.

Bluestein · 2024-07-02T21:06:53 1719954413

imagemagick really is magical.-

thibaut_barrere · 2024-07-02T11:28:58 1719919738

I have been using this in a CI pipeline to maintain a business-critical PDF generation (healthcare) app (started circa 2010 I think), here is the RSpec helpers I'm using:

https://gist.github.com/thbar/d1ce2afef68bf6089aeae8d9ddc05d...

The code contains git-stored reference PDFs, and the test suite re-generate them and assert that nothing has changed.

Helped a lot to audit visual changes, or PDF library upgrades!

tylerflick · 2024-07-02T14:22:19 1719930139

Are you using singed digests in the PDFs?

pmarreck · 2024-07-02T13:01:44 1719925304

could you not just compare the source (or perhaps even the hash) of the PDF and assert on that?

ydant · 2024-07-02T14:30:28 1719930628

I use some custom tools for PDF comparison (visual, textual, and perceptual hash) for my personal records/accounting purposes.

A number of the financial and medical institutions I deal with re-generate PDFs every time you request them, but the content is 99-100% identical. Sometimes just a date changes. So I use a perceptual hash and content comparison to automate detecting truly new documents vs. ones that are only slightly changed.

jabroni_salad · 2024-07-02T13:34:18 1719927258

If the document is a legally required disclosure (like a bank's fee schedule for example) then you need to grade that document directly rather than its source code. PDFs are horrible and there is a lot that can go wrong with making them between writing and publishing.

alexdoesh · 2024-07-02T13:08:00 1719925680

Hashes can change regularly due to metadata. Source checks may also require some filtration or preprocessing before comparison. Visual comparison is the best option here, especially if you have a complex document with multiple third-party components that may change both the hash and source but keep the visual appearance the same.

thibaut_barrere · 2024-07-02T22:05:56 1719957956

In this case, we indeed have multiple components (although not third-party), and being able to refactor those without risk is quite nice.

knallfrosch · 2024-07-02T22:24:22 1719959062

What should I do when the assertion fails – inspect the PDF with my sad little caveman eyeball?

ggrosskopf · 2024-07-03T07:31:54 1719991914

In my own tests where I inspect PDF differences in python, I iterate through the pages, if the number of pages is the same, I convert each of them with PIL to bitmap, get the diff (ImageChops.difference is black for everything same and colored for diffs) and find the content of the diff with `getbbox`. This gives me the coordinates of the rectangle where changes appeared, I then use those to also print the page with a colored rectangle and print out the crops.

I give out the original page, the original rectangle, the original page with colored rectangle, the new page and the new rectangle, the diff cropped and uncropped only after which I start using my caveman eyeballs

I also pixelate it a bit and have a brightness cutoff for the diff to see if the diff actually matters and i also try if re-cropping a bit so shifting by a limited amount of pixels makes it look like an ignorable difference because everything just moved to the left a bit but that is optional.

I also recommend exporting the new pdf from the CI/CD tool to be put back into the test as reference. Even between Linux distros and versions small changes in fonts and stuff like that make a difference

thibaut_barrere · 2024-07-02T22:05:18 1719957918

the source sometimes changed for small internal reasons in the library generating the PDF (prawn). So just comparing the source would not give a clear cut answer. A visual comparison has helped quite nicely over time.

poidos · 2024-07-02T11:57:43 1719921463

Reminds me of the tool Bob Nystrom wrote to help himself out when working on the physical edition of Crafting Interpreters: https://journal.stuffwithstuff.com/2020/04/05/crafting-craft...

Whole article is worth reading, but if you want the relevant bits search for “ I wrote a Dart script that would take a PDF of the book”.

jaustin · 2024-07-02T09:27:50 1719912470

We've been using this in the Micro:bit Educational Foundation (microbit.org) to fill a gap in hardware design tooling, and get visual diffs of our schematics and gerbers during PCB design iterations. It's kinda wild that's what we ended up doing, but if you want to be sure your radio layout didn't change at all when you're making a minor revision to a different part of the board, visual diffs are perfect.

That said, next project we want to try something more integrated with EDA tools. If anyone else has followed this path, we'd love to know.

mikeyinternews · 2024-07-02T13:52:20 1719928340

You can do this with Beyond Compare (it's not free, but not very expensive either) https://www.scootersoftware.com/

Rinzler89 · 2024-07-02T14:01:54 1719928914

Beyond Compare is one of those priceless tools I pay for myself instead of waiting for my employer to pay for it. Price/functionality wise it's worth its weight in gold, it's cross platform, and its licensing is very liberal. There's just no FOSS compare tools out there that can match BC.

hipnoizz · 2024-07-02T14:38:36 1719931116

What are BC features that you find to be so great?

I'm genuinely curious - I heard of lot of BC being 'the tool' for diffing. I'm used to Meld, but my current employee has a pretty strict policy which tools could be used so at some point I've managed a licence for some older version of BC. But for some reason I've found its UI/the way it works a bit less optimal that I was accustomed for. Since I'm using that primarily for text diffs these day I usually use a diff tool from IntelliJ Idea (I have Idea open all the time).

netol · 2024-07-02T20:19:53 1719951593

In comparison, Meld is not stable, nor fast, especially for big diffs. The UI is also more limited. Araxis Merge and WinMerge are good alternatives

jessriedel · 2024-07-03T15:05:03 1720019103

Yea I found Araxis Merge better than BeyondCompare, FileMerge, DiffForm, and Meld (mostly based on diffing prose rather than code, though).

Rinzler89 · 2024-07-03T16:33:14 1720024394

Araxis Merge is ~4x the price of BC though. What does it do that makes is 4x better?

jessriedel · 2024-07-03T19:10:12 1720033812

Well, whether it's worth it is going to depend both on the use case and on the user. (I figure for many folk in this thread, the difference in price is going to be pretty negligible for a tool they use ~weekly.)

For me, I eliminated BC immediately because I was often diffing prose and it didn't have word wrap; that ability is apparently available now in the beta version of BC5, but it wasn't when I was testing it. I suspect it will continue to be non-optimized for prose in how it handles long lines.

smartmic · 2024-07-02T12:39:12 1719923952

I like this tool better: https://www.qtrac.eu/diffpdf.html

It shows the differences in the GUI side-by-side instead of overlayed.

Tryk · 2024-07-02T13:09:32 1719925772

From the github:

Another option is to compare the two files visually in a simple GUI, using the --view argument:

$ diff-pdf --view a.pdf b.pdf

This opens a window that lets you view the files' pages and zoom in on details. It is also possible to shift the two pages relatively to each other using Ctrl-arrows (Cmd-arrows on MacOS). This is useful for identifying translation-only differences.

yencabulator · 2024-07-02T17:50:02 1719942602

Shifting the offset is very far from the experience of a side-by-side diff, and more useful for nudging the images to align them.

justinnk · 2024-07-02T22:46:15 1719960375

There is also an open-source/free version of this [1], which I use regularly. You can install it, e.g., in Fedora, with the ‚diffpdf’ package. It is no longer maintained but works very well, has a nice GUI with a side-by-side view, drag&drop support, and both text and visual modes.

[1] https://www.qtrac.eu/diffpdf-foss.html

invalidlogin · 2024-07-02T12:51:32 1719924692

I use BeyondCompare 5 for this.

rawbert · 2024-07-02T09:30:25 1719912625

We use this tool in our team regularly for comparison of PDFs we obtain from third party services that might have changed after code-changes on our side. Big thanks to the author <3

canistel · 2024-07-02T09:32:28 1719912748

Interestingly, Github thinks the project is 46% shell, due to the fairly huge wxwin.m4.

infecto · 2024-07-02T12:15:22 1719922522

I noticed this a while back with a private project of mine. The Github languages breakdown seems broken. Mine is a Python project with a handful of Jupyter notebooks but many many python files. The LOC must be 80% python files but Github sees the project as 50% Jupyter.

badlibrarian · 2024-07-02T13:15:01 1719926101

You can tweak/exclude with .gitattributes

https://github.com/github-linguist/linguist/blob/master/docs...

infecto · 2024-07-02T23:10:25 1719961825

I had no idea. Thanks for sharing.

deckar01 · 2024-07-02T13:36:30 1719927390

I wrote a pixel-based visual diffing algorithm long ago that was intended for a CI tool that finds all of the UI changes in a PR. I broke the layout of a page I didn’t even know existed as an intern at Inkling and have had this idea in my head ever since.

https://github.com/deckar01/narcis

crocal · 2024-07-02T21:21:17 1719955277

I will just chime in to mention Draftable (https://www.draftable.com/compare). It really works well. It’s not so easy to have a visually comfortable diff of two PDFs.

ck_one · 2024-07-02T13:47:04 1719928024

Can anyone recommend a method to deduplicate pdfs? The hash is often different but the content and meta data is 99.99% the same.

pixelmonkey · 2024-07-02T15:18:10 1719933490

You might want strip metadata before doing a comparison, using exiftool. Even though exiftool was originally written for EXIF metadata on JPGs, these days, it supports a lot of metadata standards, including PDF. This command will do it assuming you set filename=`basename your.pdf .pdf`:

    exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf

That won't help you with small differences in the contents, but might help with small differences in metadata. Running `md5sum` on the stripped PDF should give more reliable dedupe results.

I was recently working on a similar problem for JPG, RAW, and MP4 files (photo/video backup) so it is fresh in my mind.

bob1029 · 2024-07-02T15:56:43 1719935803

I would consider rasterizing the PDFs and then hashing the resulting bitmaps.

strangus · 2024-07-02T14:15:40 1719929740

strangus · 2024-07-02T14:27:44 1719930464

https://10052.ai has a tool that will visually compare documents(pdfs, doc, image,etc) and cluster them together. It works amazingly well.

sva_ · 2024-07-02T13:35:50 1719927350

Coincidentally I downloaded and tried using this just a while ago. I was trying to see if it can identify an Elsevier fingerprint between two pdfs. It can't, it only compares visible things.

I used vbindiff instead.

akasakahakada · 2024-07-02T12:36:45 1719923805

Use this to compare university textbook edition 8 and 9 before buying.

ant6n · 2024-07-02T12:48:26 1719924506

Uh how can you compare without buying? Or put another way, why buy if you can compare?

cocodill · 2024-07-02T13:02:15 1719925335

time machine research

akasakahakada · 2024-07-02T14:31:51 1719930711

libgen exist bro

N0b8ez · 2024-07-02T14:39:55 1719931195

But then why would you need to buy it?

Foobar8568 · 2024-07-02T15:55:06 1719935706

Because for textbooks, paper is often superior.

redman25 · 2024-07-02T13:42:28 1719927748

I created a similar in-browser version a while back with mozilla's pdf-js. The diff rendering is all run client side.

https://www.parepdf.com

The diff-pdf project was my inspiration but I wanted to create a version that was distributable to non-programmers.

TacticalCoder · 2024-07-02T15:21:01 1719933661

This reminds me of a book author who posted here IIRC. He had a little tool allowing him to quickly compare two revisions of his book. For example too make sure typos fixed didn't t break havoc. I remember his tool would show in red what had changed on pages thumbnails.

atum47 · 2024-07-02T13:25:00 1719926700

back when I was writing my final paper I faced a similar issue, needed to de-duplicate a bunch of PDF's, so I came up with a simple solution

https://github.com/victorqribeiro/dtf

fwn · 2024-07-02T14:37:36 1719931056

I really like the overlay view and that it is not cloud based. Will try to test it at work.

I rely heavily on PDF comparison via PDF-XChange Editor, which is accurate for text, but often has trouble highlighting visual changes correctly.

riedel · 2024-07-02T16:38:48 1719938328

I always used DiffPDF only to read on their website: > in the view of the EU’s Cyber Resilience Act and an abundance of caution, we have withdrawn all our free software

[1]

Good to see post-cyberresilience alternatives :)

PDF diffs are really great for versioning/comparing PCB-Designs. (The only real use case I had 15 yrs back)

[1] http://www.qtrac.eu/diffpdf-foss.html

yencabulator · 2024-07-02T17:47:00 1719942420

What a convenient excuse for them to try to get people to switch to their proprietary fork.

I genuinely need a side-by-side PDF comparison tool, and the diff-pdf tool linked from the main link doesn't do that. Any thoughts?

mycall · 2024-07-05T21:08:36 1720213716

Of course, Adobe Compare does this too.

https://www.adobe.com/acrobat/features/compare-pdfs.html

npack · 2024-07-04T00:06:33 1720051593

https://onlinetextcompare.com/pdf lets you compare text between two pdf files locally within the browser

jgalt212 · 2024-07-02T12:00:43 1719921643

Thanks. I'll give this a shot to see if any counterparties try to sneak in any last second changes to the executable version of the doc.

asah · 2024-07-02T10:31:20 1719916280

Crazy, I'd have thought that modern multi-modal LLMs can do this, but when I tried Gemini, ChatGPT-4o and Claude they all pooped out:

- Gemini at first only diff'd the text, and then when pushed it identified the items in the images and then hallucinated the differences between the versions. It could not produce an image output.

- Claude only diff'd the text and refused to believe that there images in the PDFs.

- ChatGPT attempted to write and execute python code for this, which errored out.

ale42 · 2024-07-02T13:02:40 1719925360

Visually comparing two PDFs is something a PC can do deterministically without any resource (and energy) intensive LLMs. People will soon use LLMs for things they are not especially good or efficient at, like computing the sum of numbers in an Excel table... (or are they doing it already?).

B1FF_PSUVM · 2024-07-02T18:52:41 1719946361

As a bonus, they'll get a result that looks likely.

ertgbnm · 2024-07-02T12:43:48 1719924228

This may be the type of thing that LLMs are currently the worst at. I'm not surprised at all.

infecto · 2024-07-02T12:18:13 1719922693

This is definitely not a strength for multi-modal LLM. Multi-modal capabilities are still too flaky especially when looking at a page of a PDF which can have multiple areas of focus.

pmarreck · 2024-07-02T13:04:46 1719925486

I would fully expect an LLM to not get natively good at this but to know how to reach out to another tool in order to get good at this

downboots · 2024-07-03T07:35:38 1719992138

Maybe this could be used to generate PDFs using LaTeX and use the diff as a distance metric to optimize.

Levitating · 2024-07-02T10:08:38 1719914918

No screenshots?

colddevil · 2024-07-02T10:13:23 1719915203

Here is one: https://vslavik.github.io/diff-pdf/

Tryk · 2024-07-02T13:10:32 1719925832

Can anyone explain how to interpret that screenshot? It just looks like a very blurry text to me.

pimlottc · 2024-07-02T13:16:11 1719926171

It's showing you both PDFs overlaid on each other. The main window looks blurry because the main text has shifted vertically slightly. The regions that have changes are highlighted in the thumbnails on the left.

I agree it's not the best initial example to demonstrate the tool, but it does show how it can be used to detect even minor spacing changes.

ska · 2024-07-02T13:59:39 1719928779

If it was sharp, they would be identical. The “blurriness” is doubling, where the lines are not quite aligned. Red text there show you content that is in one and not the other.