could you not just compare the source (or perhaps even the hash) of the PDF and ...

ydant · 2024-07-02T14:30:28 1719930628

I use some custom tools for PDF comparison (visual, textual, and perceptual hash) for my personal records/accounting purposes.

A number of the financial and medical institutions I deal with re-generate PDFs every time you request them, but the content is 99-100% identical. Sometimes just a date changes. So I use a perceptual hash and content comparison to automate detecting truly new documents vs. ones that are only slightly changed.

jabroni_salad · 2024-07-02T13:34:18 1719927258

If the document is a legally required disclosure (like a bank's fee schedule for example) then you need to grade that document directly rather than its source code. PDFs are horrible and there is a lot that can go wrong with making them between writing and publishing.

alexdoesh · 2024-07-02T13:08:00 1719925680

Hashes can change regularly due to metadata. Source checks may also require some filtration or preprocessing before comparison. Visual comparison is the best option here, especially if you have a complex document with multiple third-party components that may change both the hash and source but keep the visual appearance the same.

thibaut_barrere · 2024-07-02T22:05:56 1719957956

In this case, we indeed have multiple components (although not third-party), and being able to refactor those without risk is quite nice.

knallfrosch · 2024-07-02T22:24:22 1719959062

What should I do when the assertion fails – inspect the PDF with my sad little caveman eyeball?

ggrosskopf · 2024-07-03T07:31:54 1719991914

In my own tests where I inspect PDF differences in python, I iterate through the pages, if the number of pages is the same, I convert each of them with PIL to bitmap, get the diff (ImageChops.difference is black for everything same and colored for diffs) and find the content of the diff with `getbbox`. This gives me the coordinates of the rectangle where changes appeared, I then use those to also print the page with a colored rectangle and print out the crops.

I give out the original page, the original rectangle, the original page with colored rectangle, the new page and the new rectangle, the diff cropped and uncropped only after which I start using my caveman eyeballs

I also pixelate it a bit and have a brightness cutoff for the diff to see if the diff actually matters and i also try if re-cropping a bit so shifting by a limited amount of pixels makes it look like an ignorable difference because everything just moved to the left a bit but that is optional.

I also recommend exporting the new pdf from the CI/CD tool to be put back into the test as reference. Even between Linux distros and versions small changes in fonts and stuff like that make a difference

thibaut_barrere · 2024-07-02T22:05:18 1719957918

the source sometimes changed for small internal reasons in the library generating the PDF (prawn). So just comparing the source would not give a clear cut answer. A visual comparison has helped quite nicely over time.