Semantic Image Segmentation with DeepLab in Tensorflow

genericpseudo · on March 13, 2018

If you're interested in this but have no background, the best place to start is "Fully Convolutional Networks for Semantic Segmentation" – https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn...

This is a very active field of research. Another thread worth pulling on is Mask R-CNN: https://arxiv.org/abs/1703.06870

It's not quite as simple as "this one has highest mAP, let's use it"; the tradeoffs are complex. In particular, as you can see in the image here, one thing DeepLab doesn't do is segment instances – so you get a mask of "people", not a mask per person. Mask R-CNN does a better job on that by design, because it predicts both bounding boxes and a mask per bounding box.

alexcnwy · on March 13, 2018

Great summary. I believe both models are available in Detectron if anyone wants to give them a go:

https://github.com/facebookresearch/Detectron

black_puppydog · on March 13, 2018

Yes for Mask-RCNN. For FCN, there is R-FCN.

Overall I'm really happy to work in a domain where people share their code and models in such an open way. I take issue with detectron in particular though, because a company the size of facebook in the year of 2018 has no excuse to publish a major software package in python 2. The oldest models they implement are from 2015 (excluding VGG16 which is so prolific it's available in literally every library as python 3) and caffe2 is quite a bit more recent than that. Like I said. No excuse...

genericpseudo · on March 13, 2018

The team behind Detectron have published an enormous amount of really good research, but the Detectron codebase struck me as "good research code" rather than something you'd ideally want in production.

black_puppydog · on March 14, 2018

Of course, I'm not criticising the fact that they publish those models, nor the models themselves. But even publishing arguably polished python2 code in 2018 is something I take issue with if it's not a legacy code base

andreykurenkov · on March 12, 2018

Link to Arxiv (DeepLabv1): https://arxiv.org/abs/1606.00915

Link to Arxiv (DeepLabv3): https://arxiv.org/abs/1706.05587

Link to GitHub: https://github.com/tensorflow/models/tree/master/research/de...

The README on there has a very neat TLDR of the model:

"DeepLabv1 [1]: We use atrous convolution ['s a shorthand for convolution with upsampled filter'] to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks.

DeepLabv2 [2]: We use atrous spatial pyramid pooling (ASPP) ['a computationally efficient scheme of resampling a given feature layer at multiple rates prior to convolution'] to robustly segment objects at multiple scales with filters at multiple sampling rates and effective fields-of-views.

DeepLabv3 [3]: We augment the ASPP module with image-level feature [5, 6] to capture longer range information. We also include batch normalization [7] parameters to facilitate the training. In particular, we applying atrous convolution to extract output features at different output strides during training and evaluation, which efficiently enables training BN at output stride = 16 and attains a high performance at output stride = 8 during evaluation.

DeepLabv3+ [4]: We extend DeepLabv3 to include a simple yet effective decoder module to refine the segmentation results especially along object boundaries. Furthermore, in this encoder-decoder structure one can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime."

Aeolos · on March 13, 2018

Congratulations, Deeplab 3+ finally discovered that the U-net architecture, first proposed 3 years ago, is more efficient than the flat architecture they used before.

Deeplab 3+ is still a wildly inefficient network structure, but it undeniably works, if you can afford the computational resources. Just keep in mind you can achieve similar results (within 1% mIOU) with much leaner structures.

jack_pp · on March 13, 2018

Is this fast enough to be used as a background removal in live streams?

genericpseudo · on March 13, 2018

Not at the kind of resolution you'd want to be using on, e.g., Twitch. In that setting, you could just use chromakey, though? That's '70s technology, cheap and very reliable.

jack_pp · on March 13, 2018

You could but it's cumbersome, amateur streamers might not wish to invest in the setup