You can do that today, the only advantage today though is being able to fix the model in memory. It’s sequential and slower due to communication costs, though batching might be faster?
You can't get flops on a Hailo-8, they're fixed-point only. As much as these specialised inference chips are cool, we're a long way from just being able to drop them in where a GPU was. Not to mention the memory is hugely constrained. The Hailo chips I've worked with were all limited to 20MiB for the weights which is a squeeze even at 4-bit.
> We therefore leave the attention layers untouched
Meaning, presumably, that the GPU memory remains the bottleneck
Flops really are quite cheap by now, e.g. vision inference chip ~$2/teraflop/s !!