I wonder whether a modern byte-sized instruction encoding would sort of look like Unicode, where every byte is self synchronizing... I guess it can be even weaker than that, probably only every second or fourth byte needs to synchronize.
Honestly, I think modern (meaning wide, multiple instruction decoders, and designed today without back compat concerns) and byte-sized are sort of mutually exclusive. Most of those ISAs were designed around 8-bit data buses, and having simple ops only consume a single memory read cycle was pretty paramount to competitive performance. Without that constraint, there's probably better options.
IMO, you would either go towards bitaligned instructions like the iAPX 432 or the Mill, or 16-bit aligned variable width instructions like the s360 and m68k on the CISC side, and ARM Thumb and RV-C on the RISC side.
That being said, you're definitely thinking about it the right way. Modern Istream bandwidth conscious ISAs absolutely (and perhaps unsurprisingly) look at the problem from a constrained, poor man's huffman encoding perspective similar to how UTF-8 was conceived.
Interestingly Thumb2 was dropped when going from Arm32 to Arm64. Perhaps the encoding was getting really complicated, and would've been even harder with 32 registers, and not being able to save a lot of memory (if many instructions use 4 bytes anyway).
Maybe one could come up with an instruction encoding that encodes some number of instructions per cache line. Every time the cpu jumps to a new instruction (at cache line address + index), the whole cache line needs to be loaded into icache anyway, and could get decoded then -> internally they get represented in microcode anyway.