It's not an interesting question whether networks learn abstractions. It's almost tautological - an image classification network will by definition (attempt to) distill an image into a distribution over categories. So when people criticize abstraction I think they are really criticizing the quality of the abstraction...
Because the key phrase in the grandparent post is "really represent or mean." Grass + white strands = sheep, is a hierarchical abstraction, but it's a bogus one.
Feature visualizations do not answer this more relevant question.
Check https://distill.pub/2017/feature-visualization/appendix/
I think some of the filters are pretty clearly developing "higher levels of abstractions".