Sometimes the distinction is made between "real-time" and "online" processing.
The first one refers to the speed of the processing in relation to the length of the recording - so, say, you can process a 10 minute recording in 1 minute then you're 10x real-time. However, your analysis might require the full track to be available for best outcomes, and so you cannot really start with the processing until the full source is available.
The latter is what "online" processing refers-to, the ability to process on-the-fly in parallel to the recording. Obviously, this cannot be faster than real-time ;-) but hopefully it is not slower, either. Often times, though, you get a (somewhat constant and) hopefully slow offset, i.e., you can process a 10 minute recording online in the same time but you need another 10 seconds on top of that.
This is, by the way, not restricted to source separation, it applies to other disciplines as well, say, automatic speech recognition.
I experimented with the spleeter architecture quite a bit and I would say this is not suitable for real time audio processing. The reason is that the model needs at least 512 frames of audio samples to produce an output usable for source separation. This adds a ton of latency. I tried with smaller windows but the results are very bad.
It's way faster than real time, im not sure why slowing it down would be an advantage. You still need to take the resultant data and do things with them, as a dj, and faster is better.
So can it be run in real-time?
I am thinking about extracting features for music visualization but it could make a DJ happy also.