Hacker News new | past | comments | ask | show | jobs | submit login

Crazy, I'd have thought that modern multi-modal LLMs can do this, but when I tried Gemini, ChatGPT-4o and Claude they all pooped out:

- Gemini at first only diff'd the text, and then when pushed it identified the items in the images and then hallucinated the differences between the versions. It could not produce an image output.

- Claude only diff'd the text and refused to believe that there images in the PDFs.

- ChatGPT attempted to write and execute python code for this, which errored out.




Visually comparing two PDFs is something a PC can do deterministically without any resource (and energy) intensive LLMs. People will soon use LLMs for things they are not especially good or efficient at, like computing the sum of numbers in an Excel table... (or are they doing it already?).


As a bonus, they'll get a result that looks likely.


This may be the type of thing that LLMs are currently the worst at. I'm not surprised at all.


This is definitely not a strength for multi-modal LLM. Multi-modal capabilities are still too flaky especially when looking at a page of a PDF which can have multiple areas of focus.


I would fully expect an LLM to not get natively good at this but to know how to reach out to another tool in order to get good at this




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: