Thumb gives a lot of 2-byte instructions (like the RISC-V C extension), which makes sense for it to significantly reduce code size. ~20-30% is about the savings you would expect.
x86 has the huge disadvantage that it can't use 2-byte encodings for common instructions, since a lot of those are used by 16-bit and 8-bit instructions (which are almost completely unused today).
SH4 is relatively compact (16-bit instructions) but it isn't really compressed in the same way Thumb is. They shaved some corners to make things fit; e.g. there's only room for a 4 bit displacement in a load [R + disp] -> R type instruction. If you want to use an 8 bit displacement you have to use R0. This naturally leads to a lot of two-instruction sequences to calculate an offset into the stack, etc. which ends up practically more like classic MIPS in density.
Code density is not all about instruction length. It's also about instruction count. With shorter instructions you usually need to use more instructions to do the same thing (e.g. extra MOVs or stack spill etc).
The[/your?] article mentions two current concerns with code density, "Cache hit ratio," and "Instruction fetch bandwidth."
Does the footprint of Thumb (including total bytes, count of opcodes, speed, likely other things that I have not thought of) impact the conclusions in the paper?
Interesting that Thumb has been removed from AArch64, and that Intel never added anything like it.
My experience from Thumb was that it makes code slower. Probably because the CPU had to execute more insteuctions than in "ARM" mode.
Thumb was a thing for embedded systems with very limited RAM etc. It was not designed for optimal speed.
My guess is that ARM64 targeted higher end, and that severe memory constraints were no longer considered a big issue. ARM has its Cortex M4 (Thumb only microcontrollers) and the likes for those markets.