So the problem here is a bit different. It's not that devices will say "I can play Format X" and then not play it. It's that devices say "I can play Format X at Resolution A, B, C". When you give the device resolution A and B it succeeds but at resolution C it fails to decode it.
In H.264 this would be the "Level" https://en.wikipedia.org/wiki/Advanced_Video_Coding#Levels . A device may say that it can decode Level 4.2 but in reality it can only do 4.1. That means it can play back 1080p30 but not 1080p60. The only way to know is to actually try and observe the failure (which often btw is a silent failure fro m the browsers point of view, meaning you need to rely on user reports).
Wouldn't it be just as easy to test videos in formats A, B, and C and see if the play? You could check that video.currentTime advances. If it lies about that you could draw to a canvas and check the values. That seems more robust than checking WebGL.
The issue here is the architectural difference between the hardware decoder and the GPU. What happens under the hood with MSE ( https://developer.mozilla.org/en-US/docs/Web/API/Media_Sourc...) is that you are responsible for handing off a buffer to the hardware decoder as a bitstream. Underneath, the GPU sets up a texture and sends the bitstream to the hardware decoder that's responsible for painting the decoded video into that texture.
What often ends up happening is that the GPU driver says "yes the hardware decoder can do this", it accepts the bitstream, sets up the texture for you which is bound against your canvas in HTML. Starts playing the video, moves the timeline playhead but the actual buffer is just an empty black texture. From the software's point of view, the pipeline is doing what it's supposed, due to the hardware decoder being a black box from the Javascript perspective it's impossible to know if it "actually" worked. Good decoders will throw errors or refuse to advance the PTS, bad decoders won't.
Knowing this, your second suggestion was to read back the canvas and detect video. That would work but the problem here is "what constitutes working video". We can detect if the video is just a black box but what if the video plays back but at 1 frame per second, or plays back with the wrong colors. It's impossible to know without knowing the exact source content, a luxury that a UGC platform like Twitch does not have.
For this reason just doing heuristics with WebGL is often the "best" path to detecting bad actors when it comes to decoders.
My point with the video to canvas is if you create samples of the various formats in various resolutions then you can check a video with known content (solid red on top, solid green on left, solid blue on right, solid yellow on bottom) and check if that video works. If it does then other videos of the same format/res should render? I've written conformance tests that do this.
At worst it seems like you'd need to do this once per format per user per device but only if that user hasn't already had the test for that video size/format. (save a cookie/indexed-db/local-storage that their device supports that format) so after that only new sizes and formats need to be checked.
Just an idea. No idea if what problems would crop up
So the problem here is a bit different. It's not that devices will say "I can play Format X" and then not play it. It's that devices say "I can play Format X at Resolution A, B, C". When you give the device resolution A and B it succeeds but at resolution C it fails to decode it.
In H.264 this would be the "Level" https://en.wikipedia.org/wiki/Advanced_Video_Coding#Levels . A device may say that it can decode Level 4.2 but in reality it can only do 4.1. That means it can play back 1080p30 but not 1080p60. The only way to know is to actually try and observe the failure (which often btw is a silent failure fro m the browsers point of view, meaning you need to rely on user reports).