I feel like it should be easier than images, since we have all the 3D information. There are all of the features based on 3D structure which image segmenters can't even begin to use.
It's one of those things that we can do efficiently, so with enough priors of what scenes look like a sufficiently informed ML system should be able to get decent accuracy.
lol, hey it's hard enough to do with static images. Feature matching pointclouds is probably turing complete :P.
That's the kind of shit we're working on though. We're trying to turn the real world into a platform.