Seriously. Permutes get harder as you scale - VBMI on CNL is an indicator that 64-way is pretty good but it's still considerably more expensive than 4 16-way permutes on the same architecture.
There's a reason that gather is hard to do; I think if you rocked up and asked the architecture guys for a gather that was competitive with small-scale permute they would reply with the time-honored Intel putdown ("You are overpaid for whatever it is you do").
Hey, you can have 7-bit lookup tables at the byte level on AVX512VBMI (using the 2-register shuffle forms) and you can already have 6-bit lookups with 2-register 16-bit shuffles if you can play around on Skylake Server.
Mass availability of the VBMI goodies looks to be bottlenecked behind Icelake/Sunny Cove, so you'll have plenty of time to think through the implications of fast 6-bit lookup. :-)
There's a reason that gather is hard to do; I think if you rocked up and asked the architecture guys for a gather that was competitive with small-scale permute they would reply with the time-honored Intel putdown ("You are overpaid for whatever it is you do").