It's not quite as simple as "this one has highest mAP, let's use it"; the tradeoffs are complex. In particular, as you can see in the image here, one thing DeepLab doesn't do is segment instances – so you get a mask of "people", not a mask per person. Mask R-CNN does a better job on that by design, because it predicts both bounding boxes and a mask per bounding box.
Overall I'm really happy to work in a domain where people share their code and models in such an open way.
I take issue with detectron in particular though, because a company the size of facebook in the year of 2018 has no excuse to publish a major software package in python 2.
The oldest models they implement are from 2015 (excluding VGG16 which is so prolific it's available in literally every library as python 3) and caffe2 is quite a bit more recent than that. Like I said. No excuse...
The team behind Detectron have published an enormous amount of really good research, but the Detectron codebase struck me as "good research code" rather than something you'd ideally want in production.
Of course, I'm not criticising the fact that they publish those models, nor the models themselves.
But even publishing arguably polished python2 code in 2018 is something I take issue with if it's not a legacy code base
The README on there has a very neat TLDR of the model:
"DeepLabv1 [1]: We use atrous convolution ['s a shorthand for convolution with upsampled filter'] to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks.
DeepLabv2 [2]: We use atrous spatial pyramid pooling (ASPP) ['a computationally
efficient scheme of resampling a given feature layer at
multiple rates prior to convolution'] to robustly segment objects at multiple scales with filters at multiple sampling rates and effective fields-of-views.
DeepLabv3 [3]: We augment the ASPP module with image-level feature [5, 6] to capture longer range information. We also include batch normalization [7] parameters to facilitate the training. In particular, we applying atrous convolution to extract output features at different output strides during training and evaluation, which efficiently enables training BN at output stride = 16 and attains a high performance at output stride = 8 during evaluation.
DeepLabv3+ [4]: We extend DeepLabv3 to include a simple yet effective decoder module to refine the segmentation results especially along object boundaries. Furthermore, in this encoder-decoder structure one can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime."
Congratulations, Deeplab 3+ finally discovered that the U-net architecture, first proposed 3 years ago, is more efficient than the flat architecture they used before.
Deeplab 3+ is still a wildly inefficient network structure, but it undeniably works, if you can afford the computational resources. Just keep in mind you can achieve similar results (within 1% mIOU) with much leaner structures.
Not at the kind of resolution you'd want to be using on, e.g., Twitch. In that setting, you could just use chromakey, though? That's '70s technology, cheap and very reliable.
This is a very active field of research. Another thread worth pulling on is Mask R-CNN: https://arxiv.org/abs/1703.06870
It's not quite as simple as "this one has highest mAP, let's use it"; the tradeoffs are complex. In particular, as you can see in the image here, one thing DeepLab doesn't do is segment instances – so you get a mask of "people", not a mask per person. Mask R-CNN does a better job on that by design, because it predicts both bounding boxes and a mask per bounding box.