I guess I must be completely miscalibrated wrt. performance of newer technologies, because I'd imagine it's the opposite. In particular, I'd be surprised to get a Python+Node loop passing large amounts of data around like that to run 30+ FPS, unless everything Python-side is carefully written to do everything on C side. At the same time, I'd assume the inference/ML part is the fastest one, because, as far as I understand how NNs work, they're supposed to be blazingly fast once trained (it's just lots of parallelizable linear algebra). Is the inference part in your solution doing anything more complicated than that in real-time?
A modern laptop will run Bodypix at about 30 fps. There could be additional bottlenecks but the deep (and wide) NNs are usually not super fast, they're just fast for the wondrous things they do.
You can usually alter performance (with Bodypix that's an accuracy/speed tradeoff) or do something silly like downscale, run, and upscale the mask. I'd like to try this.
BodyPix does downsample before masking OOTB, the article is doing 'medium' (50%) (though for this script we ought to move that over to the python side), it's still not 30fps though without egregiously sacrificing quality, at least on my (fairly powerful) machine unless I've missed something.
Amusingly I did some hacking on this and the current bottleneck is actually reading from the webcam which is capped at <10fps without doing anything else. Switching the capture to MJPG helps.