It would probably work with careful choice of learning rate, initialization, and...

p1esk on Jan 16, 2020 | parent | context | favorite | on: Computational Power Found in the Arms of Neurons

It would probably work with careful choice of learning rate, initialization, and weight decay to keep signals small. Batch norm would play a larger role (probably want to use it after the activation fn). I don't see why it would get stuck on either side, but it could obviously get stuck if enough signals grow too large on both sides.