I'm guessing the reason is that for predicting video frames hallucinating detail is undesirable, so you would rather remove detail than add non-existent detail. AVIF also seems to have some kind of deblocking filter which JXL lacks, to my surprise.
AVIF deblocking filter is one axis at a time whereas JPEG XL is doing an axis-non-separable filter, 2d selection at once. It is not clear that AVIF can be parameterised to do similar filtering to JPEG XL -- at least it hasn't been done yet.