Various mass-market desktop apps that I've worked on have had a test system like this, where tests will screenshot the new build in various states and the pictures are compared with known good reference screenshots from a previous build. If a comparison fails, QA will check that the change is intentional and either update the reference image or file a bug.
Even quite minor OS updates can cause the tests to fail en masse, because of a global OS change to the system font or button design or whatever, which is a shame as that is just the point where you want to see what the OS change actually broke (like one of your windows now comes up offscreen, or your help screen is now always in Icelandic).
With app testing you can't restrict the detection to the window content area, as a bug could for example give the main window the wrong kind of title bar, or make it draw its default title in the wrong localized language, and you would want to detect that.
Then when making UI changes you need a mechanism for marking which comparisons are intended to fail after this change, and will need to be automatically regenerated for the next build.
I don't think I've ever worked on a project that got this entirely automated, and it resulted in a lot of work for QA. On a complicated app like a web browser it is a really valuable system though.
>Even quite minor OS updates can cause the tests to fail en masse
I found it didn't even take that. Sometimes indeterminism in the code you wrote or indeterminism in code you can't even control will cause tests to fail en masse.
Currently working on a Testing framework use Sikuli which takes this is a step up and tests interaction. Can be used on any apps - not just websites. Sikuli uses opencv to handle interactions. It's hard to test things like drag drop, html canvas and resizing. let me know if anyone is interested. Will be open source and produces visual test outputs.
I have a similar pipeline set up at work -- it's surprisingly valuable.
CSS is rife with potential to cause faraway effects. Catching these regressions is very satisfying.
It's also useful to get a survey of all of your UX. Being able to see everything at once has helped us to improve the dark corners of our app/site and see patterns where we can extract into a design system
Running mobile width screenshots has been awesome. Designing/developing for mobile first doesn't always happen and this surfaces areas that need responsive work pretty effectively.
Automating this as part of the build process is very cool. As a UX designer, I once put together a rudimentary node script that used phantom.js (I think, it was a while ago) that grabbed different sized screenshots of a page I was lead designer for. It would once or twice a week. Things were always changing and people would always want to see what the site looked like before change X. Sometimes we had design artifacts, but sometimes we didn't.
It worked, but it wasn't an ideal solution for a variety of reasons.
Having a tool that can do visual regression testing but doesn't necessarily have to be part of the build pipeline would be cool for designers that might already be doing this manually because, for a great variety of reasons, these types of issues aren't being caught if they don't catch them (been there, done that, "we don't need QA!" they said...). I've seen automated screen capture services before, but having to manually check them for regressions is a pain.
If there's already something like this out there, I'd love to hear about it.
Even quite minor OS updates can cause the tests to fail en masse, because of a global OS change to the system font or button design or whatever, which is a shame as that is just the point where you want to see what the OS change actually broke (like one of your windows now comes up offscreen, or your help screen is now always in Icelandic).
With app testing you can't restrict the detection to the window content area, as a bug could for example give the main window the wrong kind of title bar, or make it draw its default title in the wrong localized language, and you would want to detect that.
Then when making UI changes you need a mechanism for marking which comparisons are intended to fail after this change, and will need to be automatically regenerated for the next build. I don't think I've ever worked on a project that got this entirely automated, and it resulted in a lot of work for QA. On a complicated app like a web browser it is a really valuable system though.