Pretty cool, though I wonder what the latency of this would be if used as a plugin?
The author says it works in real-time, but to non music/audio folks this could mean '100 ms latency is real-time enough, right?'
Generally, I think the audio VST business is a really fun space to be in for a lifestyle business, as it is way too small to be attractive for VCs. It seems like a space that provides many niches for lots of small players to thrive in.
As an aside, it's really quite interesting that a lot of cutting edge tech is now used to emulate the hardware-based tech of yesteryear. Think film filters for photoshop, and about 90% of all audio plugins that emulate high end hardware, compressors, pedals, etc etc.
I know of a few shops that took VC money. The big problem isn't the market size so much as how slow the market moves. The product lifetime of a plugin is around a decade. And users hate subscriptions. And it's really hard to determine the value you add to your customers. And no one wants to pay you.
It's basically a terrible place to be a developer in it for the money. Really fun work otherwise. The cool gigs are the ones where you build custom plugins for someone's crazy idea.
In consumer applications, plugins are used all the time for prototyping before you go to hardware. MATLAB is way too slow for anything useful.
The success of splice would disagree with your notion that “users hate subscriptions”. Given the horrendous price point of many of these plugins it seems to be perfect for a subscription based model. To me it always seemed there is more of a pushback from the industry producing vsts than from the consumers.
Splice's numbers aren't public so I can't comment on their success. Avid's are, and they had a terrible quarter - and they're the poster child (alongside Adobe) for subscription licensing in creative software. But I'd be interested to see what the breakdown in revenue is for plugin licenses versus preset/sample packs (bit of a blade & razor model there).
The price points really aren't horrendous if you consider how expensive the engineering is, how little demand there is, and how long you need to maintain a product. You aren't being ripped off by spending a couple hundred bucks on a plugin. I think we'll end up at a place where everything is a subscription, but I can tell you from experience that it creates friction for the users.
Agreed. The business model seems to be to give access to the rent to own deals via the sample subscription fee. Don’t think they make any money of their plugin deals. I’m also not arguing it’s too expensive or a rip off. But it’s still a large amount of money for software, in the private space at least. The rent to own thing seems like a smart tool to get rid of the barrier of entry.
Do solo or small shop vst plugin developers make any money?
I’m curious if anyone has any direct knowledge about that.
There are so many professional activities similar to that where no one makes any money and people really just do it for the love, and then there are seemingly similar things like that where people make surprisingly large amounts of money.
I'm fairly new to the game, but I'm a solo developer. Currently I dont make enough to quit my day job, but it is a nice supplementary income, and it's nice to get paid a bit for something I truly enjoy.
There are also several solo/small shop developers that do make a living from selling plug-ins. Here are a few that I can think of off the top of my head.
Steve Duda, the developer of Serum is kind of the poster child for this. He contracts out for pieces of the synth (UI design, resampler, filters), but he's mostly a one-man shop and, as I understand it, Serum pays the bills.
It's hard to tell how much Duda is an outlier, though, and how many other people could succesfully follow his path.
I was in talks with a (new-style) 'label' that sells samples, sound packs, and VST plugins. Some of their plugins have been purchased 25k times.
One of the things I've also heard from labels is that not only there's money in the VST world (it's also very crowded, piracy is rampant as noted, etc.), a lot of plugins are ported over to iOS and are sold as "virtual pedals". The number of sales and revenue there was noted as being very interesting.
When I had an active band, our guitarist went from bringing his amp to rehearsal, to having a bunch of pedals, to having a digital pedal board, to having an iPhone with some sort of tiny adapter.
I made fun of him and we wouldn't have trusted it to be used live, but damn it worked impressively well
They do. Strezov sampling is one guy. Serum is one guy. Chris Heinz is one guy, etc. etc.
But you have to be willing to put in the time and make phenomenal products, because no one wants average instruments and effects, we can get those for free.
Quite a few small developers in this space. It's not like indie gaming, but there's also less competition.
I think you need to be a musician/producer to be successful here though.
Steve Duda wrote Serum, probably the most popular synth plugin in modern electronic music. everyone I know has a license. so "yes", with the caveat that it's difficult to actually create products of this level of quality
Yeah but facts don't really care about agreement: there are loads of renowned single-person VST shops, and many more "just a handful of folks" ones. Chris Heinz, Steve Duda, Strezov Sampling, Matt Tytel, heck even Plugin Guru, etc. etc. are all renowned folks in the VST/VSTi world, and that doesn't even scratch the surface.
AFAIK, Mike Schuffham (www.scuffhamamps.com) earns a living developing and selling S-Gear. It might be a semi-retirement or lifestyle type living - not sure - but he's been doing it over a decade now. He doesn't charge as much as he could and gives away free updates for far too long. Despite being a (mostly at least) solo effort, its widely regarded as being a top-tier amp sim. I personally think it sounds better than both Helix and Bias, which are both heavily bank-rolled outfits.
It doesn't have their breadth, but the tones it does have are nearly as good as it gets without serious air movement.
There's latency and there's the somewhat separate question of how much time is needed to make a prediction. Wavenet is causal (no look-ahead) and operates on the sample level so there are no buffers and thus no latency in the strict sense, beyond encoding/decoding into the sample rate and format required by the ML model, which should take <1ms.
Whether a model manages to make a prediction in that amount of time depends on things like the receptive field and number of layers. The linked paper says their custom implementation runs at 1.1x real-time. I guess this isn't impossible; their receptive field is ~40ms, vs. 300 for the original (notoriously slow) wavenet, and the model is likely to have less layers and channels.
"Round trip," or guitar to processing to speakers needs to be sub 10ms to be transparent to the musician. Source: spent years playing guitar through my guitar -> DAC -> PC -> DAC -> speaker signal chain
That's not what real time, means though. Real time processing means taking signals as they come in, and outputting the transformed result such that there is as close to no signal lag as possible. The output can in fact be wildly lower or higher resolution, real-time does not particularly say anything about that. It's all about whether the output plays (for practical purposes) at the perceived "same time" as the input signal. There will always be some delay, but that delay can't get perceivable, and for obvious reasons there can't be any (significant) buffering.
Is that your private definition of "real-time"? I think it is common to define real-time processing by a specified, finite time between input and output. Many real-time processes are concerned more with the consistency of the latency than with its absolute value.
Latency is much more noticeable when you’re playing a musical instrument; 25-30ms is the point at which it becomes distracting in my (anecdotal) experience as a keyboardist. 50ms would be literally unplayable —- I cannot keep in time if latency is that severe. And that’s total output latency from the moment a key is depressed to the moment the sound comes out the speakers, so it’s important for every component in the signal chain to have the lowest possible latency. A bunch of 5-10ms delays adds up really quickly.
You can learn to play it. Pipe organs routinely have more than 50ms latency just from the distance the sound has to travel from the pipes to the organist. Add the time needed to set up steady oscillations in large pipes, and the slow pneumatic actions found in some organs, and >200ms latency is nothing unusual. The important thing is that the latency is consistent.
I think "rate" in the parent comment was just referring to speed, not sample rate. But yes, latency is critical for anything used during recording or performance. However way back when I used to make my own music I used non-realtime plugins sometimes and it was okay.
Real-time has a few slightly different meanings. So it's hard to say what the author means.
One meaning is just that you can guarantee specific deadlines. So if your programme can react within an hour guaranteed, that would be real-time. (Though usually we are talking about tighter deadlines, like what's needed to make ABS brakes work.)
For 'real time' music usage you wouldn't need strict guarantees, but something that's usually fast enough.
Implementing a VST plugin is literally the exact definition of requiring strict latency guarantees. Your comment winds through a lot of unrelated comparisons to ultimately not make any sense.
“Usually fast enough” are three words that guarantee failure in a live show/MIDI environment, which is a large use case of VST and its peers beyond production. By extension, “usually fast enough” further guarantees nobody will ever use your software. That’s noticeable right away.
The question isn’t about compsci real-time theorycrafting, it’s “here’s a buffer of samples, if you don’t give it back in a dozen milliseconds the entire show collapses.” That’s pretty clearly meant by “real time“ contextually.
"Usually fast enough" is unfortunately the only guarantee a preemptive multitasking OS can give you. Unless your system is guaranteeing your program x cycles of uninterrupted processing per frame of audio and you can consistently process the frame in that amount of cycles, the only mitigation is to deliver frames in large enough chunks that you never run out of time in practice under agreeable circumstances.
That said, I agree that the question of what "real-time" might mean is irrelevant given the context.
It is completely irrelevant, given the context. The only, only, only thing real-time means here is “can be run on a live signal passing through it” rather than “is a slow, offline effect for a DAW”. No hard real-time, no soft real-time, no QNX, no pulling out the college compsci textbook. There IS real-time in that sense in DSP, it just isn’t in a VST plugin.
I’ll repeat again that any compsci theorycrafting is not the concern here, and real-time has a very specific meaning in DSP. Computer science does not own the concept of real-time, and the only people tripping over the terminology are those with more compsci experience than DSP. I appreciate everyone trying to explain this to me, but (a) I understand both, and (b) this is like saying “no, Captain, a vector could mean anything like a mathematical collection, air traffic control should learn a thing or two from mathematics.”
Just to be perfectly clear here because I'm not sure you're just using my post as a soapbox or if you have misunderstood my argument: I agree that it's clear what real-time means in this context. I disagree that "usually fast enough" guarantees failure for a VST, because in the case of VST, "usually fast enough" is the only guarantee the host operating system will offer your software.
It's not "theorycrafting" to say that real-time music software running in a preemptive multitasking operating system without deterministic process time allocation will have to suffer the possibility of occasional drops. It happens in practice and audio drivers have to be implemented to account for the bulk of it, and the VST API is designed in such a way that failure to fill a buffer on time needn't be fatal.
It usually doesn't happen in practice unless you're doing a lot of other things at the same time. Which you shouldn't be.
Of course audio is block buffered over (mostly) USB, and as long as the buffers are being filled more quickly than they're being played out, the odd ms glitch here and there is irrelevant.
As real-time systems Windows, MacOS and Linux are terrible from a theoretical POV, and they're useless for the kinds of process control applications where even a ms of lag can destroy your control model.
But with adequate buffering and conservative loading they work well enough to handle decent amounts of audio synthesis processing without glitching - live, on stage.
> It usually doesn't happen in practice unless you're doing a lot of other things at the same time. Which you shouldn't be.
> Of course audio is block buffered over (mostly) USB, and as long as the buffers are being filled more quickly than they're being played out, the odd ms glitch here and there is irrelevant.
As I've noted earlier in the thread. In fact, that the only thing you can offer under such circumstances is that "it usually doesn't happen" because "it's usually fast enough" is my entire point.
> As real-time systems Windows, MacOS and Linux are terrible from a theoretical POV, and they're useless for the kinds of process control applications where even a ms of lag can destroy your control model.
You could employ the same strategies to process control problems where latency is not a problem so much as jitter. You don't, because unlike a music performance an occasional once-in-a-week buffer underflow caused by a system that runs tens to hundreds of processes already at boot can actually make lasting damage there.
Not to mention if the inference is done on the CPU, it shouldn't be that hard to control it. The matrices are of a set size by the time you're running a VST; this is the actual simple answer.
The medium answer is "this is a wavenet model, so inference is probably really expensive unless the continuous output is a huge improvement to performance".
Indeed. Having myself spent some time in the "VST lifestyle business" when I was in grad school (was selling a guitar emulation based on physical modelling synthesis), and now working in ML, I think there's no chance for such an approach to hit "mainstream" anytime soon. Even if you do your inference on CPU, most deep learning libraries are designed for throughput, not latency. In a VST plugin environment, you're also only one of the many components requiring computation, so your computational requirements better be low.
You might be able to combine it with the recent work on minimizing models to obtain something that is small enough to run reliably in real time.
Although the unusual structure of the net here may mean you're doing original and possibly publication-level work to adapt that stuff to this net structure.
If you were really interested in this, there could also be some profit in minimizing the model and then figuring out how to replicate it in a non-neural net way. Direct study of the resulting net may be profitable.
(I'm not in the ML field. I haven't seen anyone report this but I may just not be seeing it. But I'd be intrigued to see the result of running the size reduction on the net, running training on that network, then seeing if maybe you can reduce the resulting network again, then training that, and iterating until you either stop getting reduced sizes or the quality degrades too far. I've also wondered if there is something you could do to a net to encourage it not to have redundancies in it... although in this case the structure itself may do that job.)
I wonder if teddykoker has looked at applying FFTNet or similar methods as a replacement for Wavenet. I'm not sure but it seems to me like FFTNet is a lot more tractable than Wavenet, and not necessarily that much worse for equivalent training data.
No, the other guy is right. Technically the definition of real-time can have a lot of leeway. Here's the paper linked in the article. Note how the authors never define what they really mean by real-time. They even make statements like "runs 1.9 times faster than real-time". They certainly imply your definition, but there's plenty of wiggle room to say "Well technically, I wasn't lying"
If you drop an audio buffer and fire off a 22kHz impulse into a 50,000 watt soundsystem, you are going to have thousands of very unhappy people and likely some hearing damage.
Yes, it absolutely 100% will, depending on what you mean by handwaving “glitch”. VST is built into chains, and a flaky plugin will derail an entire performance, often making downstream plugins crash. I’m speaking from extensive experience writing plugins and performing with them in multiple hosts and trigger setups. It’s not a robust protocol, but it gets the job done.
Are you speaking from some experience with which I’m unfamiliar where it’s okay for DSP code to fail hourly? Trying to understand your viewpoint.
Agreed. If anyone wants to see some of the more successful DSP work being done today for pro or prosumer audio, I recommend checking out Strymon and Universal Audio products. Both make use of SHARC SoCs and achieve great results.
Are there any VST containers? Something that will wrap the VST, intercept under-runs or other bad behaviour and substitute some alternative signal (zero, passthrough, etc.). This could also be part of the host software.
The article and your comments inspired in me the idea of a wave-net based VST learning wrapper. If the real plugin fails, substitute a wave-net based simulation of the plugin.
Underruns are not bad behavior. It's the host application's responsibility to hand VSTs buffers to process, and the VSTs themselves have no concept of how much processing time is available to them (except a method that signals to distinguish real-time processing from offline processing) or what it means to underrun the buffer.
The behavior you describe (zero signal on underruns) is a common mitigation. The DAW or the driver itself initializes that'll eventually be handed to the sound card to zero before the host application requests the plugins to process, and if it doesn't have time to mix the plugin outputs it'll play back the initialized buffer instead.
From aea12 one might think that it's normal for an underrun to be fatal. Because underruns are not an exceptional occurrence during production (where you might occasionally load one plugin too many or run a different application with unpredictable load characteristics like a web browser) it really isn't an unexplored area and although they're are a pretty jarring degradation I've never experienced crashes that directly correlated with underruns.
To expand on this, each plugin will receive host managed buffers that they're requested to fill and the input they're expected to process. If they don't do that in time for the host to mix the buffers and deliver the mixed buffer to the audio driver, it simply won't. Nowhere do the plugins directly interact through this process.
If your plugins are crashing because of an underrun you have a much more serious problem than underruns. Then you have plugins writing to or reading from memory that wasn't either handed to them by the host or allocated by themselves. That bad code running in your process can cause it to crash is an orthogonal problem to buffer underruns causing skips or stuttering in audio.
The author says it works in real-time, but to non music/audio folks this could mean '100 ms latency is real-time enough, right?'
Generally, I think the audio VST business is a really fun space to be in for a lifestyle business, as it is way too small to be attractive for VCs. It seems like a space that provides many niches for lots of small players to thrive in.
As an aside, it's really quite interesting that a lot of cutting edge tech is now used to emulate the hardware-based tech of yesteryear. Think film filters for photoshop, and about 90% of all audio plugins that emulate high end hardware, compressors, pedals, etc etc.