This simply isn't true. There are big caveats to the idea that neural networks a...

golol · on Jan 7, 2024

>The function has to be continuouss.

That's not a problem. You can show that neural network induced functions are dense in a bunch of function spaces, just like continuous functions. Regularity is not a critical concern anyways.

>functions vs algorithms

Repeatedly applying arbitrary functions to a memory (like in a transformer) yields you arbitrary dynamical systems, so we can do algorithms too.

> an approximator being possible and us knowing how to construct it are very different things,

This is of course the critical point, but not so relevant when asking whether something is theoretically possible. The way I see it this was the big question for deep learning and over the last decade the evidence has just continually grown that SGD is VERY good at finding weights that do in fact generalize quite well and that don't just approximate a function from step-functions the way you imagine an approximation theorem to construct it, but instead efficiently find features in the intermediate layers and use them for multiple purposes, etc. My intuition is that the gradient in high dimension doesn't just decrease the loss a bit in the way we imagine it for a low dimensional plot, but in those high dimensions really finds directions that are immensely efficient at decreasing loss. This is how transformers can become so extremely good at memorization.