Hacker News new | past | comments | ask | show | jobs | submit login

could you not just compare the source (or perhaps even the hash) of the PDF and assert on that?



I use some custom tools for PDF comparison (visual, textual, and perceptual hash) for my personal records/accounting purposes.

A number of the financial and medical institutions I deal with re-generate PDFs every time you request them, but the content is 99-100% identical. Sometimes just a date changes. So I use a perceptual hash and content comparison to automate detecting truly new documents vs. ones that are only slightly changed.


If the document is a legally required disclosure (like a bank's fee schedule for example) then you need to grade that document directly rather than its source code. PDFs are horrible and there is a lot that can go wrong with making them between writing and publishing.


Hashes can change regularly due to metadata. Source checks may also require some filtration or preprocessing before comparison. Visual comparison is the best option here, especially if you have a complex document with multiple third-party components that may change both the hash and source but keep the visual appearance the same.


In this case, we indeed have multiple components (although not third-party), and being able to refactor those without risk is quite nice.


What should I do when the assertion fails – inspect the PDF with my sad little caveman eyeball?


In my own tests where I inspect PDF differences in python, I iterate through the pages, if the number of pages is the same, I convert each of them with PIL to bitmap, get the diff (ImageChops.difference is black for everything same and colored for diffs) and find the content of the diff with `getbbox`. This gives me the coordinates of the rectangle where changes appeared, I then use those to also print the page with a colored rectangle and print out the crops.

I give out the original page, the original rectangle, the original page with colored rectangle, the new page and the new rectangle, the diff cropped and uncropped only after which I start using my caveman eyeballs

I also pixelate it a bit and have a brightness cutoff for the diff to see if the diff actually matters and i also try if re-cropping a bit so shifting by a limited amount of pixels makes it look like an ignorable difference because everything just moved to the left a bit but that is optional.

I also recommend exporting the new pdf from the CI/CD tool to be put back into the test as reference. Even between Linux distros and versions small changes in fonts and stuff like that make a difference


the source sometimes changed for small internal reasons in the library generating the PDF (prawn). So just comparing the source would not give a clear cut answer. A visual comparison has helped quite nicely over time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: