Crazy, I'd have thought that modern multi-modal LLMs can do this, but when I tried Gemini, ChatGPT-4o and Claude they all pooped out:
- Gemini at first only diff'd the text, and then when pushed it identified the items in the images and then hallucinated the differences between the versions. It could not produce an image output.
- Claude only diff'd the text and refused to believe that there images in the PDFs.
- ChatGPT attempted to write and execute python code for this, which errored out.
Visually comparing two PDFs is something a PC can do deterministically without any resource (and energy) intensive LLMs. People will soon use LLMs for things they are not especially good or efficient at, like computing the sum of numbers in an Excel table... (or are they doing it already?).
This is definitely not a strength for multi-modal LLM. Multi-modal capabilities are still too flaky especially when looking at a page of a PDF which can have multiple areas of focus.
- Gemini at first only diff'd the text, and then when pushed it identified the items in the images and then hallucinated the differences between the versions. It could not produce an image output.
- Claude only diff'd the text and refused to believe that there images in the PDFs.
- ChatGPT attempted to write and execute python code for this, which errored out.