Because synching subtitles to the video isn't even a goal.
Most people read more slowly than they can listen. Many other people can read faster than they can listen. But you need subtitles that stick around long enough for the first group, and really, you want a comfortable margin so that people aren't frantically trying to finish reading every subtitle before it disappears.
I watched the show 大江大河 ("Like a Flowing River") on Viki, which has excellent community-generated subtitling. The second season isn't available on Viki - only YouTube. (Actually, Viki seems to have lost the license to the first season by now, too.) And the subtitles are abominably bad, bad enough to make me stop watching the show. But they're perfectly synched - every subtitle on Youtube is exactly matched, millisecond-to-millisecond, with a Chinese subtitle which it attempts to translate.[1]
[1] Ignoring the timing, which is much too fast for the English subtitles, the fact that each subtitle is translated independently is another huge problem. It leads to nonsense when one sentence is split across multiple subtitles, because the English and Chinese do not naturally present the same information in the same order.
With tools like alass[1] (using it to synchronise against the original language subtitles) it is about as close to solved you can get.
All of the attempts I've seen of using audio information to synchronise subtitles have been awful. One issue is that some languages subtitle everything, even screams and incoherent shouts (such as Japanese) while others only subtitle dialogue and often rework dialogue for the purposes of making the subtitles short enough to be readable easily. It feels like you need too much domain knowledge to know how different languages subtitle things and that subtitles that match the general meaning of what is being said should be matched up.
Every movie, every show ever produced? Subtitles are required even for domestic markets for the hearing impaired, even if we disregard the audience that prefers to have subtitles.
Why can’t someone just loosely transcribe without time stamps and sync it to the video?