Back before deep learning was all the rage textbooks used to argue that two layers were enough because a two layer network can approximate anything arbitrarily well.
They were right and wrong. Two layer networks CAN indeed approximate any function arbitrarily well. They just do a piss poor job of it. It can take exponentially more para meters than a better formulation[0].
The lesson is that representation matters a lot. There are lots of ways to construct universal approximators but they are not made equal.
They were right and wrong. Two layer networks CAN indeed approximate any function arbitrarily well. They just do a piss poor job of it. It can take exponentially more para meters than a better formulation[0].
The lesson is that representation matters a lot. There are lots of ways to construct universal approximators but they are not made equal.
[0] https://www.researchgate.net/publication/3505534_On_the_powe...