Interesting, I've long wondered why parametric nonlinearities are not used more often. It adds very little to the overall parameter count (the number of units is most often dwarfed by the number of connections), but it should increase expressibility a lot (e.g. adaptively using soft or hard nonlinearities). Taken to its extreme, I've been toying with the idea of combining DNNs and Genetic Programming - using a large number of arbitrary symbolic expressions with high connectivity and many layers.
Genetic programming is an inefficient search method, and will require many evaluations of the cost function to optimize anything. In the case of DCNNs, evaluating an architecture can keep a modern GPU busy for days, so genetic algorithms are pretty much out of the question.
I think an easy way to improve our models in the short term is to make more of the parameters we use be learnable: the parameters of the non-linearity are a good place to start with, and another would be the parameters of the data augmentation transformations.
One could consider that learned data augmentation schemes implement a form of guided visual attention.