Real-time has a few slightly different meanings. So it's hard to say what the au...

aea12 · on May 11, 2020

Implementing a VST plugin is literally the exact definition of requiring strict latency guarantees. Your comment winds through a lot of unrelated comparisons to ultimately not make any sense.

“Usually fast enough” are three words that guarantee failure in a live show/MIDI environment, which is a large use case of VST and its peers beyond production. By extension, “usually fast enough” further guarantees nobody will ever use your software. That’s noticeable right away.

The question isn’t about compsci real-time theorycrafting, it’s “here’s a buffer of samples, if you don’t give it back in a dozen milliseconds the entire show collapses.” That’s pretty clearly meant by “real time“ contextually.

boomlinde · on May 11, 2020

"Usually fast enough" is unfortunately the only guarantee a preemptive multitasking OS can give you. Unless your system is guaranteeing your program x cycles of uninterrupted processing per frame of audio and you can consistently process the frame in that amount of cycles, the only mitigation is to deliver frames in large enough chunks that you never run out of time in practice under agreeable circumstances.

That said, I agree that the question of what "real-time" might mean is irrelevant given the context.

aea12 · on May 11, 2020

It is completely irrelevant, given the context. The only, only, only thing real-time means here is “can be run on a live signal passing through it” rather than “is a slow, offline effect for a DAW”. No hard real-time, no soft real-time, no QNX, no pulling out the college compsci textbook. There IS real-time in that sense in DSP, it just isn’t in a VST plugin.

I’ll repeat again that any compsci theorycrafting is not the concern here, and real-time has a very specific meaning in DSP. Computer science does not own the concept of real-time, and the only people tripping over the terminology are those with more compsci experience than DSP. I appreciate everyone trying to explain this to me, but (a) I understand both, and (b) this is like saying “no, Captain, a vector could mean anything like a mathematical collection, air traffic control should learn a thing or two from mathematics.”

boomlinde · on May 11, 2020

Just to be perfectly clear here because I'm not sure you're just using my post as a soapbox or if you have misunderstood my argument: I agree that it's clear what real-time means in this context. I disagree that "usually fast enough" guarantees failure for a VST, because in the case of VST, "usually fast enough" is the only guarantee the host operating system will offer your software.

It's not "theorycrafting" to say that real-time music software running in a preemptive multitasking operating system without deterministic process time allocation will have to suffer the possibility of occasional drops. It happens in practice and audio drivers have to be implemented to account for the bulk of it, and the VST API is designed in such a way that failure to fill a buffer on time needn't be fatal.

TheOtherHobbes · on May 11, 2020

It usually doesn't happen in practice unless you're doing a lot of other things at the same time. Which you shouldn't be.

Of course audio is block buffered over (mostly) USB, and as long as the buffers are being filled more quickly than they're being played out, the odd ms glitch here and there is irrelevant.

As real-time systems Windows, MacOS and Linux are terrible from a theoretical POV, and they're useless for the kinds of process control applications where even a ms of lag can destroy your control model.

But with adequate buffering and conservative loading they work well enough to handle decent amounts of audio synthesis processing without glitching - live, on stage.

boomlinde · on May 11, 2020

> It usually doesn't happen in practice unless you're doing a lot of other things at the same time. Which you shouldn't be.

> Of course audio is block buffered over (mostly) USB, and as long as the buffers are being filled more quickly than they're being played out, the odd ms glitch here and there is irrelevant.

As I've noted earlier in the thread. In fact, that the only thing you can offer under such circumstances is that "it usually doesn't happen" because "it's usually fast enough" is my entire point.

> As real-time systems Windows, MacOS and Linux are terrible from a theoretical POV, and they're useless for the kinds of process control applications where even a ms of lag can destroy your control model.

You could employ the same strategies to process control problems where latency is not a problem so much as jitter. You don't, because unlike a music performance an occasional once-in-a-week buffer underflow caused by a system that runs tens to hundreds of processes already at boot can actually make lasting damage there.

microcolonel · on May 11, 2020

Not to mention if the inference is done on the CPU, it shouldn't be that hard to control it. The matrices are of a set size by the time you're running a VST; this is the actual simple answer.

The medium answer is "this is a wavenet model, so inference is probably really expensive unless the continuous output is a huge improvement to performance".

mochomocha · on May 11, 2020

Indeed. Having myself spent some time in the "VST lifestyle business" when I was in grad school (was selling a guitar emulation based on physical modelling synthesis), and now working in ML, I think there's no chance for such an approach to hit "mainstream" anytime soon. Even if you do your inference on CPU, most deep learning libraries are designed for throughput, not latency. In a VST plugin environment, you're also only one of the many components requiring computation, so your computational requirements better be low.

jerf · on May 11, 2020

You might be able to combine it with the recent work on minimizing models to obtain something that is small enough to run reliably in real time.

Although the unusual structure of the net here may mean you're doing original and possibly publication-level work to adapt that stuff to this net structure.

If you were really interested in this, there could also be some profit in minimizing the model and then figuring out how to replicate it in a non-neural net way. Direct study of the resulting net may be profitable.

(I'm not in the ML field. I haven't seen anyone report this but I may just not be seeing it. But I'd be intrigued to see the result of running the size reduction on the net, running training on that network, then seeing if maybe you can reduce the resulting network again, then training that, and iterating until you either stop getting reduced sizes or the quality degrades too far. I've also wondered if there is something you could do to a net to encourage it not to have redundancies in it... although in this case the structure itself may do that job.)

microcolonel · on May 11, 2020

I wonder if teddykoker has looked at applying FFTNet or similar methods as a replacement for Wavenet. I'm not sure but it seems to me like FFTNet is a lot more tractable than Wavenet, and not necessarily that much worse for equivalent training data.

swebs · on May 11, 2020

No, the other guy is right. Technically the definition of real-time can have a lot of leeway. Here's the paper linked in the article. Note how the authors never define what they really mean by real-time. They even make statements like "runs 1.9 times faster than real-time". They certainly imply your definition, but there's plenty of wiggle room to say "Well technically, I wasn't lying"

https://www.mdpi.com/2076-3417/10/3/766/pdf

zodiac · on May 11, 2020

In the context of e.g. offline video encoders "1.9x realtime" is a statement about throughput, not latency

eru · on May 11, 2020

The show won't collapse, if you have one glitch an hour.

munificent · on May 11, 2020

If you drop an audio buffer and fire off a 22kHz impulse into a 50,000 watt soundsystem, you are going to have thousands of very unhappy people and likely some hearing damage.

melq · on May 11, 2020

Point taken, but 22khz is too high for people to hear I think.

munificent · on May 11, 2020

You don't need to perceive a sound to have your ears be damaged by it.

(This goes in both directions on the spectrum too. You can have your hearing damaged by infrasound as well.)

melq · on May 11, 2020

Ah, this makes a lot of sense, thank you. Much like there are spectrums of light we can't see that can damage the eyes.

ssalazar · on May 12, 2020

Idk exactly what the poster meant but an impulse is broadband, theoretically encompassing all frequencies representable in the host sampling rate.

aea12 · on May 11, 2020

Yes, it absolutely 100% will, depending on what you mean by handwaving “glitch”. VST is built into chains, and a flaky plugin will derail an entire performance, often making downstream plugins crash. I’m speaking from extensive experience writing plugins and performing with them in multiple hosts and trigger setups. It’s not a robust protocol, but it gets the job done.

Are you speaking from some experience with which I’m unfamiliar where it’s okay for DSP code to fail hourly? Trying to understand your viewpoint.

nseggs · on May 11, 2020

Agreed. If anyone wants to see some of the more successful DSP work being done today for pro or prosumer audio, I recommend checking out Strymon and Universal Audio products. Both make use of SHARC SoCs and achieve great results.

remcob · on May 11, 2020

Are there any VST containers? Something that will wrap the VST, intercept under-runs or other bad behaviour and substitute some alternative signal (zero, passthrough, etc.). This could also be part of the host software.

The article and your comments inspired in me the idea of a wave-net based VST learning wrapper. If the real plugin fails, substitute a wave-net based simulation of the plugin.

boomlinde · on May 11, 2020

Underruns are not bad behavior. It's the host application's responsibility to hand VSTs buffers to process, and the VSTs themselves have no concept of how much processing time is available to them (except a method that signals to distinguish real-time processing from offline processing) or what it means to underrun the buffer.

The behavior you describe (zero signal on underruns) is a common mitigation. The DAW or the driver itself initializes that'll eventually be handed to the sound card to zero before the host application requests the plugins to process, and if it doesn't have time to mix the plugin outputs it'll play back the initialized buffer instead.

From aea12 one might think that it's normal for an underrun to be fatal. Because underruns are not an exceptional occurrence during production (where you might occasionally load one plugin too many or run a different application with unpredictable load characteristics like a web browser) it really isn't an unexplored area and although they're are a pretty jarring degradation I've never experienced crashes that directly correlated with underruns.

gregsadetsky · on May 11, 2020

Hey @aea12, would you be available for a quick chat? My email is in my profile. Thank you!

boomlinde · on May 11, 2020

A VST that doesn't fill its buffer on time shouldn't crash another plugin. It's your other plugins that are flaky.

boomlinde · on May 11, 2020

To expand on this, each plugin will receive host managed buffers that they're requested to fill and the input they're expected to process. If they don't do that in time for the host to mix the buffers and deliver the mixed buffer to the audio driver, it simply won't. Nowhere do the plugins directly interact through this process.

If your plugins are crashing because of an underrun you have a much more serious problem than underruns. Then you have plugins writing to or reading from memory that wasn't either handed to them by the host or allocated by themselves. That bad code running in your process can cause it to crash is an orthogonal problem to buffer underruns causing skips or stuttering in audio.