Imho, the problem with Lighthouse (and Pagespeed before it) isn't that they're not perfect, it's that they assign scores/grades.
When Google assigns a score to something, people understand it to mean highest score = best and start optimizing for the grade Google gives them, not for performance and user experience, which the grade is supposed to represent.
It would be more fruitful to list the issues and their severity but not add overall scores, because scores change the objective from fixing the problems to getting high scores. They occasionally also have the have bugs where they will punish something with a worse score that is actually an improvement in the real world, discouraging people from doing the right thing ("I want my site to load faster, but Google is buggy and will rank me down if I do").
The problem, as I see it, is it more or less makes Google the arbiter of the internet, and when their purpose is to provide internet advertising then I see a conflict of interest. Especially when things like "supporting AMP" will likely impact your score.
While unfortunately many businesses have to care about Google SEO due to their complete monopoly status in the search field (one that is of course growing increasingly unhealthy), I'd prefer to not let Google be an ML-driven judge, jury and executioner when it comes to being visible on the net.
If you read my comment again, you'll see that I'm not claiming that they are "judge, jury, and executioner" at all. I'm saying that people perceive them to be, and act accordingly by optimizing for the score, not what the score is supposed to measure.
Wouldn't be surprised if they use it in search rankings, chances are they already do. And not being on Google effectively == dead. They already take into account page load speed (which discriminates against complex but useful & good web apps) and iirc also whether you have AMP.
Google using technology they have developed to rank other sites on their search engine while making sure that nobody cna leave the Google-verse (as AMP sites are served from Google's CDN and not the original source). It would open them up to huge anti trust issues, so at least officially they have said that it's not a ranking factor.
Do a test and compare your own optimized site to an amp site. See if you can't beat it. Not that hard when you consider the 3rd party requests amp requires.
Maybe Google's CEO runs lighthouse on a server in his closet and personally has hacked it into something ads something something data something something search
My point being, I'm not personally amenable to conspiratorial thinking, it correlates with argument dysfunction
> They occasionally also have the have bugs where they will punish something with a worse score that is actually an improvement in the real world
If there's a bug and it lists an issue of "you got the aria roles here wrong" when they aren't, it really isn't any different than a less than 100 score with that issue listed underneath in terms of feeling pressure to fix it.
Completely automating accessibility testing isn't possible (the article calls this out), but it's still a net good for these tools to exist.
I mostly agree with you, although I think having a "score" can be really useful. It is only as useful as the underlying formula for calculating it, however.
I view it like I do code coverage: it doesn't mean your code is well tested just because it has 100% coverage, but if the number is going down with each commit then that does likely indicate that testing is not a priority and isn't happening with regularity. The real problem comes in when people (especially managers) assume there is qualitative information inherent in the score, when there is not, or that it is a quantitative measure of test quality, when it is not.
Depends on what the right thing is in your specific context. If you drop from 100% to 98% but increase your conversion rate 5x, the trade off should be obvious.
From my experience anything 95% or better is basically perfect. Although stuff like "time to first byte" might impact the number of pages crawled when you have millions of pages etc. There will always be edge cases.
The article makes a funny point, but I'm not sure it is a practical one. Who would honestly go through those steps? Maybe someone copy-pasting or installing templates could make a few of those errors?
Overall the score is helpful and the specific errors detected are always visible in the breakdown below the score. As with anything if you don't understand the underlying technology, tooling and metrics can be troublesome. In those cases, a simple score is still better than giving a novice developer a complex breakdown.
Yeah, we did an analysis* on the recent upgrades to use LCP, and found that the score now heavily incentivizes Iframe/Embed content even though they are a big negative for actual performance.
My guess is if there wasn't a Lighthouse score, the focus would be on another singular metric instead, like First Contentful Paint for web performance. The score, while imperfect, helps at least to have a focus on something that weighs multiple metrics.
As an industry, I'm wary of having a single company define these scores. It seems like a conflict of interest. I'd rather have this in control by a standards group.
My issue with the score is that it is numerical. If I get 91, I want 100.
Having a traffic light style system whereby green is good, red is bad etc. would give a decent target without the costly Uber optimisation for perfection.
To your point of a single company defining these scores - a standards body would be great, I agree, but there's a danger it'll be like all the other web standards bodies, whereby Google is a member, albeit one with pretty much all the power.
Yeah, I don't use static analysis tools to "grade" my code. I use them to automatically find simple issues I might have overlooked. If ESLint doesn't find any issues, it doesn't mean my Javascript is actually good. It just means it isn't bad in the ways ESLint is looking for.
He solution is simple: start disengaging with Goole and it's scores and move to other search engines (DuckDuckGo, Ecosia?), So these "optimization" will stop all together and the monopoly as well
And yet, making it a target can still be a positive move.
The upside of having it be a score, is that makes it easier to have a big cultural push to deal with the issues covered in the score. So the measure isn't as good, but you have more people taking it seriously.
Wow, CSS system color keywords seem like a massive privacy leak. I just tested setting the property:
background: Background;
on an element, and then changing my Windows desktop background. The element immediately changes color to match my desktop. Then if I call getComputedStyle on the element, I get my desktop background color in javascript. This is in Firefox private mode, and apparently every website can read all my system colors. Why in the world is this enabled by default?
I mean, not to be defeatist but... once you’ve got JS turned on you’ve already handed out such a massive amount of entropy I’m not sure this one extra item makes a huge difference.
If it's the OS default, it's probably worthless. But if it isn't, I would imagine it could be quite unique, no? Presuming it's an RGB color, that's 16M possibilities. And there are multiple system colors, meaning even more chance you're a snowflake if you customized them. If you chose a random color on just 2 of them, that's probably enough to make you unique among the entire world. (But it is, of course, likely that you might choose something common, like #ff00ff.)
If you turn of JavaScript, that's also probably a pretty good signal, no? (I'm just hearing someone shouting "There are dozens of us! Dozens!")
If you turn Javascript off, the only information the website can get is user agent and IP, which would narrow it down much less than using Javascript even just among the pool of non-javascript users.
Keep in mind that there are a lot of services that load sites without Javascript enabled (scrapers, mail, preloading).
Pretty sure you can get some extra information through CSS media queries that only trigger a server hit when active (allowing you to add, say, screen size and color range to the fingerprint even without javascript).
i'd imagine most of the background colors are the same as most people set an image as their background.
ive not really thought about (or even know to be honest) what my desktop background color is these days. its not something ive throught about since windows 95. once XP came along with that pretty background I think i've used a photo ever since
but oh well 1 more bit is one more bit for the people that do still set a background
I set my background to a solid color. Mostly so compression on screenshots and screen captures is more effective. Though, I’d consider me an edge case here.
One of the "philosophers' stone" goals of the software industry is to completely replace human testing with automated testing.
Basically, I think automated testing is a very good thing, and we should definitely try to do as much of it as possible.
So we can clear the way for more useful and meaningful human testing.
I've always thought that the engineers in QC should be just as skilled and qualified as the ones building the product.
Part of what they should do, is design and build really cool automated tests, but I think that they should also be figuring out how to "monkey-test" the products, and get true users (this is 1000% required for usability and accessibility testing) banging on the product.
"True users" != anyone even remotely connected with the product development or testing, beyond a contract agreement.
But I'm kind of a curmudgeonly guy, and my opinions are not always welcomed.
> One of the "philosophers' stone" goals of the software industry is to completely replace human testing with automated testing.
Generally the people who say that don't have a clue - I meet them monthly: "We can save X money by writing a Selenium script and firing the whole QA team!" is typical.
I'm managed QA teams at a few companies in telco, boxed software and SaaS.
And people invariably get all huffy when I say, "Are you going to assign one of your programmers to review changed screens before integration and deploy?"
Then suddenly their eyes glaze over since they just lost a partial headcount and accepted even more responsibility for the release.
What I want are world-class manual testers that understand all of the product features, know how to approach testing it, what the problem areas are, and can communicate the issues to developers.
Luckily I know a few, and I wouldn't call what they do "monkey-testing" at all - their opinion of a release's quality is the only one that actually counts to me.
What I don't want are programmers writing Selenium scripts with no understanding of the product, then moving on to some other project and abandoning half-done tests. That's what you get when you fire your QA team.
> I've always thought that the engineers in QC should be just as skilled and qualified as the ones building the product.
World-class QA people are skilled at manual testing. World-class automated test programmers are not QA, they're called programmers.
For developers, what you can do is make the software you write testable. For web applications, add id's to all buttons for example, which are usually required for automated testing software to know what to specify.
Also, if you crapped out a feature (did a copy and paste of other code with "it works for me" local testing), just tell your QA person, "I crapped it out. Please take a look." so they know they'll need extra time to look at it.
I worked for a Japanese company that was renowned for Quality (with a capital "Q").
In the US, having "Quality" in your title often means that you are in a dead-end job.
In that company, it meant that you were an elite, and that you had considerable power over Engineering.
It also meant that you were a completely anal-retentive S.O.B.
They had spreadsheets with 3,000 rows (each row was a test -usually "monkey" test).
If even one of those rows got a red "X," the whole shooting match would come to a screeching halt, with heads rolling around like foozballs.
It also meant that they double-checked every bug report six ways to Sunday. If they reported a bug, It. Was. A. Bug. No ifs, ands, or buts. If you questioned it, they would get quite huffy; which was not a good thing (see "power," above).
Management kept the QC organization quite separate from Engineering, and they often had an adversarial relationship; which was sometimes encouraged.
This led to engineering departments having some very large testing teams; often outnumbering the engineers. The engineering departments would be penalized for bugs found by the official QC organization, so having large in-house testing teams was worth it.
Their QC doesn't work especially well for software. They would get frowny faces, when I'd suggest automated testing, or process quality best practices.
Quality was always treated separately from construction. I could never quite agree with that, but it also meant that I was "on my own," if I wanted to try using modern quality engineering techniques.
Their [hardware] products are damn good, though. What many companies would consider minor quality issues are treated like Extinction-Level Events, at that company. They've been doing it for 100 years, so it's difficult to argue with them.
> The engineering departments would be penalized for bugs found by the official QC organization, so having large in-house testing teams was worth it.
Some more details about that behavior are:
In Japan, employees start with a zero score, then for each mistake lose a point. So they will analyze/block anything that could potentially cause a demerit. Obviously that's the polar opposite of "move fast and break things."
Regarding the "3,000 line spreadsheet", yup, that's normal there and part of why their meetings often run to midnight. Gotta check and double-check every row before moving on to the next one. :)
If you want to learn more about these cultural differences, read patio11's excellent posts about working in Japan.
Totally unrelated, but your anecdote reminds me of Japan's WW2-era history - the animosity between the Imperial Army and Navy was so bad that the army built their own submarines.
- Correct. Before the Doolittle Raid, they weren't cooperating with each other. Things totally changed after Tokyo was bombed.
- One of the reasons for the Pearl Harbor bombing was because the Japanese navy had nothing to do, while the army was engaged in China. But had the navy and army been coordinated, they could have occupied Hawaii.
- before Guadalcanal, the US Navy and Marines weren't very coordinated. After the Marines were abandoned there without half of their equipment and supplies, a protocol was developed that improved communications.
Until today, the Air Force and Army fight over who can operate aircraft on behalf of the Army.
I think this is important and requires a big shift in how most organizations think of QA.
A lot of QA being done is very mechanical and and done by junior staff offshore to keep the cost as low as possible. This causes the value to be low too, for example, the QA team for a F100 company I worked with would meticulously test against specs and file bugs such as "the error message is misaligned on page XYZ". Which was true, but they missed that the error message didn't make any sense at all.
Improved automated testing has the opportunity to free up people to move from Quality Assurance (preventing defects) to being the voice of the user and ensuring quality products. This shift needs a completely different skill set and mindset, but equally it needs organizations to rethink both the cost of software engineering and the value it creates.
It’s probably a matter of how Important your application actually is. If it’s dealing with critical numbers or features that have a real cost when it breaks, then the investment into good testing pays for itself.
And for specialized products, you want a range of testers from "this is my first day working with this" up through "I send you bug reports so often your helpdesk knows my name."
Having seen many other instances where chasing numbers has lead to a worse outcome, I think "metrics driven development" is an abomination that must be abolished. Unfortunately, management seems to really like the idea of turning everything into a number and increasing it at all costs --- I have fought against such things, and when I pointed out all the negatives associated with it, they would often agree; but then dismiss the thought completely with a response that essentially means "but it makes the numbers look better."
As the saying goes: "Not everything that counts can be counted, and not everything that can be counted, counts."
On the other hand, a true accessibility score (for example, one awarded by a human or a robust grading process) could reliably quantify accessiblity. Goodhart wouldn't apply in that case.
Yeah, as a TDD advocate this is the most common objection leveraged against that methodology: code with 100% test coverage can still be complete crap. And I agree. If you add test coverage only to increase that metric the result is likely going to be crap. If instead good coverage is a side effect of a healthy development cycle it's likely the code is going to be excellent.
Basically, the metric should never be the goal, but discovering after the fact that your code does well on a metric is a good sign.
Cool, but this article would have been more useful with some practical examples of things Lighthouse doesn't catch. If the point is "this automated metric isn't perfect", no automated metric is but how bad is it exactly?
I still don't have a sense for how bad Lighthouse is because I've never disabled all keyboard events, disabled all mouse events, or changed the high contrast stylings. The article almost makes the opposite point to me -- how bad can Lighthouse be if the only loopholes are things that would pretty obviously have accessibility issues?
The only useful examples I could see were the ones at the bottom of the article, which show up in Lighthouse next to the score.
You didn't understand the article. The article doesn't contain a single loophole. Each of those features has legitimate use cases and it is impossible to detect whether they are illegitimate.
Just look at the first example. It straight up denies access to the content and the reason why lighthouse accepts it is because hiding content just hides content. Usually when content is marked hidden the user is not supposed to see it. It is therefore not a concern if a user can't see hidden content. Imagine if you had a hidden dialog that is only shown on user interaction. Any automated tool cannot tell if this is an accessibility issue or not. It's like deciding if your car should be blue or black. There is no objective answer and choosing the wrong color won't reduce your score.
All the other tricks in the article follow the same pattern.
The only thing I can imagine are screen reader-only labels (although you're usually achieving this with a special "sr-only" or "visually-hidden" utility class that doesn't touch the font size)
The article repeatedly states that it's not about Lighthouse and that there's nothing wrong with Lighthouse. The point is that your website can be completely inaccessible and no amount of automated testing will detect it. You need user testing, that's all. The "practical examples" to take away are simple: for a start, make sure that your (human) testers can read and navigate your website. Seems practical enough.
This reminds me of a time when a team member was adamant about code coverage metrics. It felt like an intense amount of busy work that really didn’t improve our codebase or ensure thoughtful tests that actually, you know, caught stuff.
It was just some weird metric we were chasing that involved making sure we go through each function and superficially test calls, regardless of the fact that some of the stuff we were testing gave us no confidence on the actual internals. I will not even mention that code coverage became this number that he/she believed was this standard, even though no attempts were made to build the codebase via TDD from the get to (making chasing the code coverage metric after the fact laughable). What could I say? The person appealed to the authority of the code coverage metric.
But hey, we got that code coverage percentage up :)
As one of those people who pushes for code coverage, I'd just like to jump in and dump my thoughts, because no tool does what I want and because of that most people don't understand what I'm going for.
The single most important thing is getting code where no consideration was given down to 0%.
This doesn't necessarily mean "all code is covered by tests", though definitely I'd prefer that be a high percent. What it means is all code was either covered by tests, or deemed not worth the time to add a test for and marked as such so the coverage tool ignores it.
Unfortunately, all coverage tools I've ever seen only allow "skip this for coverage", with no nuance. I want at least two reasons for skipping to be built in to the tool, that would be separated in the report:
* Things intentionally not covered because they're simple helpers in the realm of "obviously no bugs", or a small manually-tested API wrapper that works and shouldn't ever change again but that would be a pain or waste of time to write a worthwhile test (significantly reducing the value-to-effort ratio). There are probably other reasons, but those are the two that jump into my mind as the most common in the codebases I've worked with.
* Things that you're not writing a test for because of some other reason, but a test probably could be written for - such as a time crunch while fixing a bug, something complicated you know should have a test but are uncertain about, and so on. This one is measurable technical debt, and ideally should trend towards 0%, but isn't terribly important to do right away - it's a place where a decision was actually made, where losing test coverage is acceptable.
And of course, that leaves code that's covered by no tests and was not intentionally left uncovered. This is what I mean by "no consideration was given" - it's not covered by tests by accident. It's a likely place to find bugs. This is what, IMO, should always be at 0%.
(Quick aside, adding coverage to a legacy codebase would involve marking everything as the second exception. The project functions, but adding all tests immediately is not feasible, so it becomes explicitly marked as technical debt to be reduced over time. All new code then becomes unmarked-by-default where a decision should be made or a test added as the new code is written.)
Having the separate ways to mark uncovered code, and actually using them, is a way to signal to future developers "here be dragons!" when they look at the uncovered and unmarked code.
The most interesting part to me (as someone with vision problems) was the WebAIM link [1]. The biggest problem I have is with the almost total blind adoption of low contrast (so often too low for me to even read) and sure enough the section about low contrast [2] says:
"found on 86.3% of home pages. _This was the most commonly-detected accessibility issue_.
My basic question then is why do so many designers and websites choose to break the WCAG guidelines?
From my experience, it is often not an active choice to break the guidelines. It's rather the lack of knowledge/experience or ignorance - or a mix of both ("What is this accessibility thing? Do we need it?").
For contrast ratio specifically, I wished more people would adapt the approach that USWDS uses [1]. It enforces accessible color combinations by using standardized naming with a special property. E.g., I know that `blue-60` on `purple-10` is accessible (WCAG AA), because the absolute difference is 50+ (60 - 10). I'm currently writing a blog post regarding this approach to spread the awareness.
The easiest explanation is that the company hasn't been sued yet. Once a company gets sued and has to settle out of court, suddenly everyone will care about a11y.
First, let me start by saying that this is a good article and sheds light on one of the challenges of accessibility adoption.
All of tools like lighthouse, axe-core etc. run a subset of tests that gives a false sense of security about accessibility. Similarly, tools like Accessibility Insights for Web has a fast-pass option, which does the same thing where it runs a subset of tests to catch the most common issues on a website.
But it does not and cannot(at this moment) catch all the issues that require semantic analysis of a website, like checking that the alt text on an image has meaningful text. For tests like those, a human is needed to perform a comprehensive assessment, something like Accessibility insights for web offers as an Assessment option.
In my opinion, all of the tools are doing one thing good and that is raise awareness about the problems that users with a disability face daily, when trying to use a website. They are making accessibility a must. Tools still needs more work and I feel confident that it will continue to improve. It all kind of comes down to how much time a development team puts in to make their website completely accessible, which ideally every team should budget and plan for.
Where I am, we use axe-core as part of our automated integration testing. Any change that regresses on accessibility (i.e. new severe a11y issue etc) is blocked from submission in the same way that it would be blocked from submission by causing tests to fail.
I find this useful as it prevents "accessibility rot" where people think "Oh we'll sort out the aria stuff later".
But yeah I agree that a human test is best. I think of axe et al as an equivalent of a "static analysis" that can pick out the obvious mistakes, but wont
understand the dynamic nature of the application - it is after all great for flagging the "easy" problems, allowing the human tester to focus on the major a11y issues without ending up raising hundreds of bugs for every button and link and colour combination that should have been sorted out way before any human testing is involved.
I usually rephrase the same point as: "Automatic testing is able to tell you if a page is inaccessible. However, it cannot tell you if a page is accessible."
Another technique to mess up keyboard users: don’t use the document scroll area, but make your own (and don’t focus it or anything in JavaScript). Thus the user will have to press Tab or click in the area before keyboard navigation keys will work. So for best results put a large number of focusable elements before the scrollable pane, so that the keyboard user must press Tab a large and unpredictable number of times before it works.
You could probably mess with other tabindexes (randomly jump through the document with Tab!) without Lighthouse baulking.
I was going to suggest adding `pointer-events: none` so that the user can’t just click to focus it, but that was already done!
(I mentioned focusing your scroll area element as something you need to do if you roll your own rather than using the document scroll area; but that’s not all you need to do. You also need to monitor blur events and change any .blur() calls, so as to avoid the document element ever retaining focus. It inherently depends on JavaScript, and is very fiddly to get fully right—I’m actually not sure if anyone gets it fully right; the interactions of focus and selection are nuanced and inconsistent, and it’s extremely easy to mess up accessibility software; I haven’t finished my research on the topic. I strongly recommend against the technique on web pages; web apps can occasionally warrant it.)
I just read an article a couple days ago about how even YouTube widgets and stuff have huge a11y problems. I think it's time to admit Google is terrible at accessibility. All their devrels talk like it's important and have all these beautiful demos but whenever you look behind the curtain at their products it's terrible.
I'm sure that you find it funny. Consider though those for whom accessibility is matter of great importance, because for those of us taxes, health and employment might depend on the diligence of some motherfucker who was not in the mood or just decided to copy/paste from Stack Overflow and just ruined someones experience on a rather important webpage.
Not to talk about Atlasian who had the habit of adding aria-hidden to Jira so that enterprises felt insentivized to purchase the accessibility plugin for their employees.
When Google assigns a score to something, people understand it to mean highest score = best and start optimizing for the grade Google gives them, not for performance and user experience, which the grade is supposed to represent.
It would be more fruitful to list the issues and their severity but not add overall scores, because scores change the objective from fixing the problems to getting high scores. They occasionally also have the have bugs where they will punish something with a worse score that is actually an improvement in the real world, discouraging people from doing the right thing ("I want my site to load faster, but Google is buggy and will rank me down if I do").