Not to mention if the inference is done on the CPU, it shouldn't be that hard to control it. The matrices are of a set size by the time you're running a VST; this is the actual simple answer.
The medium answer is "this is a wavenet model, so inference is probably really expensive unless the continuous output is a huge improvement to performance".
Indeed. Having myself spent some time in the "VST lifestyle business" when I was in grad school (was selling a guitar emulation based on physical modelling synthesis), and now working in ML, I think there's no chance for such an approach to hit "mainstream" anytime soon. Even if you do your inference on CPU, most deep learning libraries are designed for throughput, not latency. In a VST plugin environment, you're also only one of the many components requiring computation, so your computational requirements better be low.
You might be able to combine it with the recent work on minimizing models to obtain something that is small enough to run reliably in real time.
Although the unusual structure of the net here may mean you're doing original and possibly publication-level work to adapt that stuff to this net structure.
If you were really interested in this, there could also be some profit in minimizing the model and then figuring out how to replicate it in a non-neural net way. Direct study of the resulting net may be profitable.
(I'm not in the ML field. I haven't seen anyone report this but I may just not be seeing it. But I'd be intrigued to see the result of running the size reduction on the net, running training on that network, then seeing if maybe you can reduce the resulting network again, then training that, and iterating until you either stop getting reduced sizes or the quality degrades too far. I've also wondered if there is something you could do to a net to encourage it not to have redundancies in it... although in this case the structure itself may do that job.)
I wonder if teddykoker has looked at applying FFTNet or similar methods as a replacement for Wavenet. I'm not sure but it seems to me like FFTNet is a lot more tractable than Wavenet, and not necessarily that much worse for equivalent training data.
The medium answer is "this is a wavenet model, so inference is probably really expensive unless the continuous output is a huge improvement to performance".