Hacker News new | past | comments | ask | show | jobs | submit login
A visual proof that neural nets can compute any function (neuralnetworksanddeeplearning.com)
259 points by graderjs on March 6, 2022 | hide | past | favorite | 189 comments



There are many caveats to this, esp. that this "fact" has nothing to do with whether training a neural network on a dataset will be useful.

There is often no function to find in solving a problem, ie., there is no mapping from ImageSpace -> DogCatSpace. Ie., most things are genuine ambiguitites --- a stick in a water appears bent, indistinguishably from an actually bent stick in some other transparent fluid.

Animals solve the problem of the "ambiguity of inference" by being in the world and being able to experiment. Ie., taking the stick out of the water. A neural network, in this sense, cannot "take the stick out the water" -- it cannot resolve ambiguities. So that it can "approximate functions" is neither necessary nor sufficient for a useful learning system.

More significantly, a NN is a very very bad approximator of many functions. Consider approximating a trajectory so that one can then find an acceleration, ie., here we need f(x) to find d2f/dx2 -- NN approximations are typically "OK" at the f(x) level and really really crappy at the df/dx level, because the non-linear functions NNs just glue together are only trainable if theyre very rough.

For these, and lots of other reasons, this theoretical approach to learning is largely marketing none-sense. If you go out and actually study the only known systems which learn effectively (ie., animals), one does not find the need for universal approximator theorems in explaining their capacities.

These theorems only show that NNs are, like many computational statistical techniques, sufficient for a wide class of "mere approximation" problems that are only narrowly useful within the whole field of learning-as-such.


Did you really compare a single neural network with a real physical agent (dog) with 5 senses, trillions of cells, a complex inter-connected brain operating in the real world? This is close to what I'd call unfair, or straw-man argument.

This is like calling calculus useless, because it doesn't help you pick one kind of milk versus another in a supermarket.

Universal function approximation is a great tool if used correctly. From what I understand, it comes in handy when our puny human brain can't come up with an algorithm to produce such function. Let's stick to image recognition/processing. Can you write a function that recogizes a cat or a dog on a 64x64 greyscale image? Or write a function to remove the background from a picture of a person. Can you write a function to generate a 3D depth map for a 2D picture?

> These theorems only show that NNs are, like many computational statistical techniques, sufficient for a wide class of "mere approximation" problems that are only narrowly useful within the whole field of learning-as-such.

That's exactly what people are looking for in most cases, the mere approximation can be good enough to consider the problem solved.


I've spoken with no-end of people who think "universal fn approximation" is some magic token to be played in the AI debate -- people getting PhDs in ML no less.

These are people really, in my experience, with no science background -- c-sci programmers who dont have any conceptual foundations in applied mathematics or science (outside of the discrete math taught in csci) -- and who take these properties as genuinely quite magical.

"Intelligence" is, to them, just some function and if NNs can approximate any fn, then presumably they'll be intelligent.

They arent aware that in a sense, everything is a dynamical function of space and time (say, stuff(x, t)) and to instantite it, one requires the entire universe.

In otherwords, csci people are not used to thinking about applied mathematics in the sense of science (, implementation of functions, dynamical functions, and so on). I think it is important for this audience to demystify these properties.

Being a universal fn approximator, in my view, is neither a necessary nor sufficient property of any system implementing intelligence. It's really a misdirection.


I'm one of these persons with no formal training in any scientific fields. I don't think any of what you assume people like me think in your stereotype. I know it's not magic, it's brute force problem solving. Once you are capable of training systems like this, you get to solutions which are really powerful.

All you see is numbers and functions, and all I see is problems being solved by these AI/ML methods and often in quantifiably better ways than intelligent humans can.

NN-s are the building blocks. Neurons in your brain are dumb as well, it's the quality of the network that matters. So instead of moving goalposts, what do you think is required to implement intelligence?


Just because you can brute force things doesn't mean it's a practical way to solve certain problems. The set of all C++ programs humans will ever write is finite and therefore parseable with a regular language, but no one's out there writing C++ compilers that way for obvious reasons. That's the essence of what I think GP is getting at.

I don't know if NNs are sufficiently powerful escape that argument because my formal understanding of "the world" simply isn't good enough, but it's not obvious to me that they are.


I've seen a case when someone implemented an algorithm after an existing AI was created. I can't remember unfortunately, but the hand made version was better so it's certainly a possibility. NN-s automate human problem solving as it was demonstrated recently by DeepMind's AlphaCode.

It can be quite practical though as we failed to come up with solutions that were previously out of reach before deep neural networks (and massive increases in hardware capabilities).


AlphaCode works by first generating many millions of programs from a large language model trained on all of github and fine-tuned on a dataset of problems like the ones it's supposed to be solving. Then, those millions of problems are reduced to several thousand, by "filtering" on input/output examples of the target problem. Finally, these many thousands of examples are further reduced by hand-crafted heuristics, clustering and ranking, that finally select k prorgams, 10 in the paper introducing AlphaCode.

This is not a practical approach. It's a maddenlingly resource-intensive approach that spams programs at an insane rate and hopes to get lucky and catch a few that solve a programming task. In the evaluations listed in the AlphaCode paper, it performs awfully against the test set of its training dataset and a pre-existing benchmark. DeepMind have drummed up their systems ability to do about as well as the average participant in a code contest, but that's no measure of coding ability.

Deep neural nets are certainly very far from automating human problem solving, like you say in your comment. Your comment is exactly the kind of overhyped misunderstanding of the state-of-the-art that the OP is warning against.


Thanks for the deeper explanation. Even though it's not practical, it's an approach that previously wasn't possible so at least some cheering for it is valid.

Problem solving is getting automated, but to what extent is debatable, we are indeed in the early stages. I haven't suggested that it's already AGI.


>> Even though it's not practical, it's an approach that previously wasn't possible so at least some cheering for it is valid.

The approach taken by DeepCode -- generate-and-test -- was always possible and is a staple in inductive program synthesis. DeepCode's version is a particularly crude and backwards-looking attempt at implementing that approach. It's no advance of any kind and it only looks like it is because DeepMind's preprint reporting on it carefully ignores most of what has gone before in inductive program synthesis, the better to show off its new system. It's a scandalous tactic that is a staple of DeepMind's publications and I'm very sorry that I can't find anything positive to say about it.

Also, I'm not talking about AGI. At this point, AGI is a fantasy. What we can hope for is to create systems that we can control well, for example to make them learn what we want them to learn and not to overfit to their dataset. In this, also, there seems to be a regression, recently, as the tendency is to train systems end-to-end while trying to remove "the human in the loop" (which is, of course, impossible).


> At this point, AGI is a fantasy.

I thought you did, because fully automated problem solving is AGI, but I don't claim that we are anywhere near to reach it.

I agree it's probably far away, but it's a fantasy I like to imagine and I don't think it's an impossible goal. Don't be sorry, I appreciate a fair criticism.

> What we can hope for is to create systems that we can control well, for example to make them learn what we want them to learn and not to overfit to their dataset.

I wish you strong nerves, because it will have AGI written all over it :) Again, this is not a claim, just a prediction.


I don't think that having control over machine learning systems must necessarily mean AGI. For example, suppose I train an image classifier to learn to identify dogs in images. And suppose I find that my classifier has learned to label all images that have a dog collar as "dog" and no image that doesn't have a dog collar as "dog". That means my classifier has learned an over-specialised definition of "dog" that doesn't generalise to dogs without dog collars, and at the same time an over-general definition that may accept, say, a fetish model with a dog collar as a "dog". That is not a system that performs as I want it to perform: I wanted a dog classifier but I got a dog collar classifier. What I want is to be able to control my system to learn to identify a dog as a "dog" rather than a dog collar as a "dog". That's "control" and when I achieve it my system is still not an AGI, just an image classifier that generalises well.


Well, I'm not making a sterotype -- I'm offering an explanation of the people i've met. It's very hard to converse with people whose academic background is discrete mathematics, in an essentially empirical domain (that of modelling intelligence).

There's a lot which is odd (and dubious) in AI/ML that traces is origins to this peculilar situation of a discipline (csci) in the "early phases" of modelling an empirical phenomenon, without yet, an intra-disciplinary theoretical support for it -- csci people take geometry and classical physics to make video games. They dont yet take the equivalent to make intelligent systems (which would include stuff on learning in animals, humans; and in my view, more applied math).

In any case, to answer your question about implementation, see https://news.ycombinator.com/threads?id=mjburgess#30579711


NNs are clever, but what they do is essentially reverse programming. Rather than you write a function that translates a -> b, you give it a -> b mappings and the training writes the function. What GP was saying is something that most programmers should already know: that writing the function is usually the easy bit, and the hard bit is defining and then modelling the problem; NNs can't do that.


There is far more computational power in a single neuron than any NN that I've heard of.

One of the single approximated "neurons" in a AI/ML NN doesn't even operate anywhere close to a real neuron. Real neurons are oscillators which exhibit nonlinear dynamic behaviors


I think what the OP is implying that the intelligence we find in nature is not isolated computation but is physically situated - it can sense and manipulate it's environment.

The current DNN based approach do not do that. Sure you can create an autonomous vehicle that uses NN to help navigate - but it does not update it's inference if it is incorrect. That immediate feedback doesn't exist or rather it is not accessible to the system itself. One has to collect the data, retrain the network and deploy it again to close the loop.


[NOT THE OP]

>> All you see is numbers and functions, and all I see is problems being solved by these AI/ML methods and often in quantifiably better ways than intelligent humans can.

Maybe in playing chess and go, but in any other task that neural networks have been compared to humans directly, the ability to outperform humans is more informative of the limitations of the measures of performance, than of the capabilities of the tested systems (typically, deep neural networks).

Take large language models (LLMs) for example. Last time I looked -it was a while ago but- LLMs like BERT, ERNIE and their famous friends scored very highly and even above humans in natural language understanding tasks, such as those in the GLUE and SUPERGLUE benchmarks.

You'd think that a system that scores 85% in a language understanding task has developed at least _some_ ability to understand language. And yet this is not the case. LLMs, despite their ability to score highly in formal benchmarks, remain dumb as bricks and as incapable of understanding human language as a rock.

We know this because every once in a while someone in the field of NLP looks away from their favourite leaderboard for long enough to ask serious questions about the performance of systems, as for example [1] who started by asking why a language model like BERT would be any good at argument comprehension (BERT scored only three points below humans in a relevant benchmark task). Everytime anyone undertakes this kind of analysis of the abiltiy of LLMs to understand language, the conclusion is that they don't, and instead they learn "cheap tricks" and overfit to statistical regularities of the benchmark datasets. Similar analyses have found similar results in machine vision (e.g. [2], [3]) etc.

The question then is, in what sense, exactly, is an automated system solving problems "in quantifiably better ways than intelligent humans can", when that system is not even solving the problem it's supposed to be solving, like the problem of understanding language by LLMs? The idea that those systems "solve problems", when they 're really only overfitting to their training sets, is an artifact of the shoddy scholarship that is typical in deep learning research, and the OP is right to be concerned that there is little understanding between machine learning researchers of what, exactly, it is that their systems are doing- or even trying to do.

And, as the OP points out, I'm afraid that really is the result of a poor background, not only in sciences and mathematics, but in AI and even machine learning itself, among the current generation of machine learning researchers. Contrary to what the OP says, though, this paucity of relevant background is the result of an influx to the field of people from the sciences, who have a background in continuous mathematics, but not in AI or computer science. It is, indeed, a background in computer science that can best guard against misconceptions about computers being wonder machines with magickal intelligence abilities. At the very least, you don't have to explain to a computer scientist why solving one instance of a class of problems does not mean a general solution for the entire class. That comes built-in to most who have graduated in the discipline. But is it built-in to phycisists, or biologists, who study very different systems?

________________

[1] Probing Neural Network Comprehension of Natural Language Arguments

https://aclanthology.org/P19-1459/

[2] Intriguing properties of neural networks

https://research.google/pubs/pub42503/

[3] The Elephant in the Room

https://arxiv.org/abs/1808.03305


The influx of scientists is a fascinating criticism. I guess maybe the lunatics run the asylum here, and what I've seen from recent csci grads is partly because theyre being educated by people without the traditional csci scepticism (which I agree, is there: the older generation of csci academics and much more aligned with my view).

Perhaps "scientist" is too vague a term. I mean people with a "in their bones" appreciation of experimentalism, empirical adequecy etc.

What you're saying overall aligns with my view -- no on here is defining empirical adequacy on ML systems as "theories of natural language". As soon as you do this, which would be step1 in any experimental endeavour, the whole show collapses.

No one here is being taught to target empirical systems and develop criteria of empirical adequacy... thinking somehow "accuracy on benchmarks" tracks "ability to understand language" -- this, as we both think, is mad.

I guess my cultural issue is better described then as "pureish mathematicians" having come to own an empirical field, with all the predicatable nonesenses which follow (whether they are discrete or continous).


Well, AI research is full of pathologies and I don't decry the "influx" of people from the sciences into the field. I'm just worried that the people who enter the field don't understand what's going on because they don't know what's been done before. To the extent that scientists from other fields can bring in the tools that computer science is lacking and a fresh new look at what AI reserach has done so far, then it's great to have them. I'm convinced, like I think you are, that progress towards artificial intelligence can only be achieved by an inter-disciplinary approach.

>> (...) the older generation of csci academics and much more aligned with my view).

That's true and hardbreaking. I don't know what we can do about that. The success of big tech corporations has broken computer science education, I'm afraid, and now all that graduates care about is how to make a six-figure salary at a FAANG.


I think you don't recognize the power of an universal function approximator. You you seem to underestimate the possible number of ways that problems can be abstracted as functions to and from n dimensional vectors. It's a very non intuitive statement that a neural network can be used to recognize dogs and images from images.


Is it more intuitive to say that a convolutional neural network can map sets of pixels to scalars denoting categories of objects? I think not, but it's harder to misunderstand the less intuitive description, and much, much easier to overhype the more intuitive one. "OMG, neural nets can recognise cats and dogs!". Humans can recognise cats and dogs and also know what cats and dogs are, so it's very easy to misunderstand the ability of CNNs to map between sets of pixels and category labels as a general ability to understand the world, which is not there.


The point is that there are many different universal function approximators, so just that adds no value to explaining why NNs are good. This type of theory is meaningful (because it's harder to have taken anything as complex as multi-layer perceptron seriously without it back in the 80s), but it's not a fact that's going to help us make any progress nowadays. Mentioning it over and over again just distracts us from the real problems we need to address.


Personally, I feel like we're on the other end of this trend, and rather then people constantly claiming DL equals AI, there's now always someone who can't wait to explain why DL doesn't equal AI and bemoan the AI hype in the customer-facing side of the industry.


> Did you really compare...

I think the guilty party is whoever chose the label "neural" to describe high dimensional regression.

I don't see a rebuttal in your reply. NNs can make for good classifiers but they certainly get overused trying to make predictions in spaces that are not predictable.


It's just so annoying to read a condescending tone of a cool mathematical and computational tool to the degree of almost ridiculing it. It happened before and it caused the first AI winter, which was unfortunate. Maybe they are annoyed too that it gets overhyped and misrepresented.


> There is often no function to find in solving a problem, ie., there is no mapping from ImageSpace -> DogCatSpace. Ie., most things are genuine ambiguitites --- a stick in a water appears bent, indistinguishably from an actually bent stick in some other transparent fluid.

That's tangential imo. There is a function that maps from image space to dog/cat/don't know space. A sentient being that get the "don't know" can get more info to resolve the ambiguity (or rephrase the question). A universal function approximator can still make itself useful even if all it can do it say it doesn't know. This is a question of problem setup.

NNs are bad at extrapolating, including e.g. trivially to periodic functions. This is a limitation if you thought they could do that, but again a question of understanding what they do.

Hype, as you say, leads some people to believe NNs are magic, leading to mismatched expectation. A universal interpolate-only function approximator is still pretty useful though, just maybe disappointing if you understood that to imply sentience


Sure, we could see `animal = f(image)` as a partial function and lift it into some total function space, `maybe_animal = f(image)`.

Is our reasoning here actually this total function though? We really do always need to keep in mind that animal reasoning will terminate, animals will act "guided by other reasoning", and resume the original reasoning process.

Action really messes up this neat "totalising option" for partial functions. If i'm not sure if "fluffy" is a dog or a cat, i might wait longer for it to move; I might throw something at it; I might ask its owner.

This isnt as simple as "don't know", since reasoning is kinda time-bound and time-parameterised, my very measuring process is sensitive to my own confidence in what-something-is.

Is this really "f(image) = dont-know"? I dont think so.

I think it's more like, "judgement = body-brain-state(t, reasoning-goal, {action policies}, ...large-number-of-other-things)".

This gets to my issue. In my view what animals are doing is better described by a dynamical equation of state (like a wavefunction), as the whole system is operating under its own dynamical evolution (including, eg., what fluffy does in response to your puzzlement).

I dont see Dog|Cat|DontKnow as the answer here. I dont think it's the actual total function which corresponds to our judgement, though it probably is total -- we just end up with "DontKnow" in all the cases actual intellkigence is requireds.... the very think we're aiming to model.


"NNs are bad at extrapolating, including e.g. trivially to periodic functions."

WaveNet and its descendents are the obvious counter example here. It is excellent at learning to generate periodic and nearly periodic functions...


I know I've seen papers where they use sinusoidal nonlinearities to learn periodic functions. I felt like that's a bit of a hack though (not necessarily a bad thing)- you're bringing domain knowledge in, which if you allow makes it easy to extend to periodic functions. The failure of a vanilla nn to learn periodicity is I think a specific failure to extrapolate, which is the bigger problem.

I'll take a look at the architecture you mention.

Edit: looked it up, I see it's an auto regressive model, like pixel CNN in 1D. I do know pixelCNN, I hadn't really considered the connection to periodic functions but I see what you're saying. In a sense any AR model is extrapolating, but not in the sense I mean: it has seen training examples of the next point predicted from the last points, it's not extending a relationship it learned in training to something new. Anyway, thanks for pointing the model out


There's ample literature showing that neutral vocoders generalize to new speakers not seen in training. IME, once you train on around ten speakers the models generalize quite well to new speakers, and in practice training sets for production models are often using hundreds of speakers for training.

There are also non autoregressive GAN vocoders which work great, with a tiny fraction of the compute of the original WaveNet. Soundstream is a good example of this.

My own work on bird song separation uses non AR models for generating separation masks. It works great and generalizes to new habitats and species without difficulty. https://ai.googleblog.com/2022/01/separating-birdsong-in-wil...

Calling architecture variants a hack is risking no true Scotsman arguments... The only true neutral networks are the ones that don't work, perhaps?


I think the approach of [0] is closer to what’s going on in our brains, basically evolving symbolic equations of partial derivatives until we get something good enough. Really fascinating and succinct paper.

0. https://cdanfort.w3.uvm.edu/courses/237/schmidt-lipson-2009....


You've got a kind of narrow view of the matter... The need for interactive understanding is addressed in different ways by reinforcement learning, GANs, and autoregressive recurrent neutral networks.

In the latter cases, the generative output of the network is a kind of experiment, and back prop from the loss function provides a route to improvement.

I think it's an unfortunate historical accident that the field of machine learning is so transfixed with classifiers. But they're really not the only game in town.


The article also conveniently glosses over the fact that all AI calculations are limited in their maximum complexity by the depth of the AI. For example, for x! the best a regular DL AI can do is to memorize some values and interpolate between them.


> If you go out and actually study the only known systems which learn effectively (ie., animals), one does not find the need for universal approximator theorems in explaining their capacities.

What's the explanation for their capacities, then?


direct, theory-laden, causal contact and pro-active engagement with their environments... eg., taking the stick out of the water.

"data" (even from an animal's pov) is basically, by nature, ambiguous. Measurement is an event in the world (eg., light hitting the eye) which isn't somehow unambiguously informative. To over come this problem, basically, animals move stuff.

The adaptable motor cortex of the most intelligent animals, therefore, isnt something to be tacked-on to "intelligence", it's the precondition for it.

Glibly: tools before thoughts. Reason needs content to operate on, the content of our thoughts is built via our (sensory-)motor systems.

The idea that we need ever more theoretically powerful models of reasoning here is a misdirection -- it misses that the heart of everything we know arrives via some form of repeated experiment.

Every more automated "pure reasoning" either via statistics or symbolically, is always just a means of juicing the data we provide the system. Useful as a technology, but not as a genuine learning system. It will never have the means of resolving the many ambiguities within data itself.

In the case, for example, of NLP -- the structure of 1 trillion documents will never enable a machine to answer the question "what do you like about what i'm wearing?" -- because (1) the machine isnt here with me; (2) i am asking for its personal judgement, not a summary of a trillion documents; and (3) that summary of those trillion documents has to be unique, but the question has no "right answer".

Whilst computer science has the driving seat over what "intelligence" is, we will forever be stuck with this incredibly diminished view of our own capacities and the size of the technological challenge. The goal isnt to sift through everything we have already done and "take a mean", the goal is to produce a system which could have done "everything we have done" without us.


That clearly doesn't explain how you get from the biological system to intelligence though. NNs can be imagined as a (highly simplistic) form of repeated experiment too.


Intelligence starts when the internal physical structure of a system is "dynamically reflective" of its external environment in a way which is stable over time (basically, "complex hysteresis"). You get this with a hard drive (+CPU, etc.) sure. I'd call this sort of minimal intelligence merely "reactive".

You get, let's say "adaptive intelligence" when the physiological structure being changed adapts the system so that it is able to interact with its environment more (eg., it can move differently).

To get to more advanced forms we need an explicit reasoning process which can represent this physical state to the system itself to engage in inference. Let's say "cognitive intelligence".

We get typical mammalian intelligence when the broader physiological structure (in particular the sensory-motor structure) of the system is actually guided by this explicit reasoning process. (Eg., a cyclist grows their muscles differently by reasoning-when-cycling). Let's call this "skill intelligence".

You get human intelligence when the explicit reasoning process becomes communicable, ie., when interior representations can be shared without the physical activity of acquiring those representations. Human intelligence is really "outside-in" in a very important way, which AI today also neglects -- it took 100bn dead apes to write a book, not one. Let's call this "socio-symbolic intelligence".

What we have today is really just systems of "reactive intelligence" with weird frankesianian organs attached to them "look at alex turn the lights off!!!!". Alexa isnt turning the lights off in the manner of the "socio-symbolc intelligence" we attribute to alexa naively (and delusionally!).

Alexa is a reactive system which has something of socio-symbolic significance (to us, not it!) glued on. Alexa does not intend to turn the lights off, and we're not communicating our intent to her. She's a harddrive with a SATA cable to a lightswitch.


I feel like bringing David Marr's "levels of analysis" into the discussion is useful here, at least in very loose terms.

> explicit reasoning process which can represent this physical state to the system itself to engage in inference

> when interior representations can be shared without the physical activity of acquiring those representations

Roughly, you're talking at the 'computational level' or perhaps above. You're describing qualities of the computation that an agent must be doing. You don't descend to the algorithmic level, which is where the 'universality theorem' discussion is taking place. Which is not to say that any of what you've said is wrong, but to someone like the parent asking "_how_ you get from the biological system to intelligence though" (emphasis mine), I think it's basically a non-answer.


Well, my answer there will alarm many. Broadly, it's whatever algorithm(s) you'd call biochemistry. I think the self-replicating adaptive properties of organic stuff, is the heart of how we get beyond what the mere hysteresis of hard-drives can do. We require cells, and their organic depth.

We dont often describe reality with algorithms in the sciences, if reality admits a general algorithmic description, it is surely beyond any measurable level. So I dont think the answer to the problem of intelligence will require computer science till much later in the game; if it is even possible to actually artificially create it.

Whatever "algorithm" would comprehensively describe the relevant properties of cells, even if it could be written down, will never be implemented "from spec". One may as well provide the algorithm for "a neutron star" and expect playing around with sand will make one.


I think this is another non-answer.

Biochemistry is broad and does really diverse things, and if your answer to "how does a mammalian brain allow it to reason through interactions with its physical environment" is "biochemistry", and that's also presumably the answer to "why is that pond scum green?" and "why are fingernails hard?", then it fails to be an explanation of any sort.

Similarly, if someone asks "why is your program dying on a division-by-zero error", it's not an explanation to say "well, that's just a matter of how the program executes when provided those particular inputs".

What _specifically_ about the biochemistry of our nervous systems allows us to solve problems, or communicate, as versus just metabolize sugar?


It's about self-replication, adaption and "scale-free properties".

Consider that touch on the surface of my skin, which can be a few atoms of some object brushing a single cell -- that somehow "recuses up" the organic structure of my body (cell, tissue, organ, ...) both adapting it and coming to be "symbolically objectified" in my reasoning as "a scratch".

The relevant properties here are those that enable similar kinds of adaption (and physiological response) at all relevant scales from the cell to the organ to the whole-body.

I think cells-oragans-bodys are implementations of "scale-free adaption algorithms" (if you want to put it in those terms), which enable implementation of "higher-order intelligence algorithms".

If you want much much more than this, then even if I had the answer, it wouldnt be comment-sized, i'd be a textbook. But of course, no one has that textbook, or else we wouldnt be talking about this.

I think if you see cells as extremely self-reorganizing systems, and bodies as "recursive scale-free" compositions of "self-reorganizing adaptive systems", then you get somewhere towards the kinds of properties i'm talking about.

I think my ability to type because i can think, is a matter of that "organic recursion" from the sub-cellular to the whole-body.


I like what you wrote here and how you think.

However I have to make one remark.

Schopenhauer would say that Alexa does in fact have will to turn the lights off. A burning will, the same will within yourself and everything that is not idea. That it is your word that sets off an irreversible causal sequence of events leading to the turning off of the lights. Schopenhauer would ascribe his “Principle of Sufficient Reason” as the reason for happening. It is not that Alexa chooses to obey, but by the causal chain enforced by physics and more leaves the will of the universe no choice but to turn off the lights. Same reason why the ball eventually falls down when thrown up. I believe this is the metaphor Schopenhauer uses in his World as Will and Idea.


Well Schopenhauer is an idealist, which confuses the issue. The world as will and representation is quite different than "the world as stuff with will" -- the latter is panpsychism.

To address "world as will", i'd just say it more-or-less doesnt matter what the world is in this sense. There's a distinction between my asking you "please turn the light off" and my rehearsing sounds "alexa, light off" -- and that difference "leaves open" the question of to what degree theyre both grounded in "the nature of the world". Since everything is "will", distinctions then just become distinctions in will.

As for panpsychism, which is really "materialism + the-material-is-conscious", this threatens to confuse the issue more -- since it isnt saying everything is "grounded in x" which allows you to ignore x, really -- as "everything-grounding" theories rarely disable your ability just to ignore them.

In misattributing the properties we are looking to create/find/etc. in things, panpsychism runs the risk of creating a false picture of continuity which impairs our ability to see genuine difference.

In other terms: whilst idealism borders on a dissociative paranoia, pansychism borders on schizophrenia -- whilst dissociation is your problem, schizophrenia might be our problem too.

Here the schiozphrenia of panpsychism is thinking that "Alexa is talking back" -- it isnt.


I am interested in this way of looking at intelligence. Do you have any references which one can study where this is discussed i.e intelligence is affected by the physical nature of where/what it is situated in.

Another way of saying this is if I were to build atleast an animal level (cognitive or skill), I need to first give it tools to sense & manipulate it's environment and whose output it can understand. Has anyone worked on this before ?


I found a couple of references, bookmarking here

Physical Intelligence as a new paradigm https://www.sciencedirect.com/science/article/pii/S235243162...

How the Body Shapes the Way we Think https://books.google.co.in/books?hl=en&lr=&id=EHPMv9MfgWwC&o...


It seems to me like a "reactive intelligence" is basically equivalent to an "adaptive intelligence" which just hasn't begun making use of all of its possible outputs yet. Obviously even though we adapt to our environment, there are still ultimate limits to how far we can adapt.


The reason a dolphin isnt as smart as an ape, is that it's a tube of meat in the ocean; and an ape is a pianist up a tree.

I see non-trivial forms of intelligence as largely as symptoms of physiology. Even the human brain in a dolphin would be dumb, indeed that's basically just what a dolphin is.

There is something absolutely remarkable in a thought about something, moving my hands to type; and my hands actually typing. Personally, I think 90% of that miracle is organic -- it is in the ability of our ceullar microstructure to adapt quickly; and our macro-structure to adapt over time.

Either way, people who intend to build intelligent systems have a task ahead of them. Building a system which can really use tools it hasnt yet invented is, in my view, a problem of materials science more than it is of discrete mathematics.


Experiments are data collection. It's the iinput to the thinking process.


This is why we work with our tools though, no? A NN can decide where the ambiguities lie in the data set and communicate that to the data scientists who are working with it, who can then provide data to help disambiguate the data.


> Animals solve the problem of the "ambiguity of inference" by being in the world and being able to experiment. Ie., taking the stick out of the water. A neural network, in this sense, cannot "take the stick out the water" -- it cannot resolve ambiguities.

Couldn't training data include a video of someone taking a stick out of the water?


In that case, a neural net will learn to predict the next frame in a video stream, but it still won't learn what a stick is or why it looks bent when it's in the water.


Then we have no ambiguity. The data is inclusive.

And the cost is a very seriously larger problem space to classify.


Interpolation in GANs seems a lot like "being able to experiment. Ie., taking the stick out of the water"...


They are not taking the stick out of the water, because they don't have hands and are not looking at a real stick. They are being trained on a static data set, and a picture of a stick isn't a stick. They can try to extrapolate all they want, but they are fundamentally not going to be able to get more information out of the data than there exists.

And in a static photo of a bent stick, or a fluffy critter, there simply isn't any information to tell whether this is a bent stick or a stick in water; or whether it's a cat or a dog. The intelligent response is not "I don't know", and it's not "60% it's a cat, 40% it's a dog". It's "here is the set of actions that need to be taken to create more data to be able to settle the question".

And creating that set of actions is completely different from current approaches. No state of the art GAN can say "you need to view the subject from a steeper angle, check for a reflection to see if it's in water" or "poke it with a stick, see if it meows or barks", because they don't have enough information about the world in their train g sets to be able to even know that these are possibilities.


They simultaneously have not enough information, and too much information. A human child can learn what a chicken is by seeing a chicken once, but "seeing a chicken" is not the same as the representation of a chicken as a set of pixels in an image. It's ... something else, that we don't quite have the words to describe. Clearly, when we look at things in the real world we take in much more information than when a CNN takes in an image. Machine learning datasets have too much data, but it's the wrong kind of data and they're the wrong kind of learning system, to learn what humans can learn from "single examples".

And humans can also learn about radically new categories of things that they have never experienced just by looking at an image, even a low-quality image.

And that's humans which are like the apex of intelligence on the planet (I know, right?). Neural nets can't even built the representation of the world that a simpler animal, like a cockroach, must have to be able to exist autonomously in its environment. We don't have anything like a machine that can get close to the ability of a cockroach for autonomy. It's mad. But people keep thinking, "hey, we'll just use more data and more layers" and get to the moon by jumping higher with a larger pogo stick.


> There is often no function to find in solving a problem

In finance NN can be used to calibrate models, i.e. generate parameters to a model (function) that replicates existing data (obsered prices/volatilities).


Well i dont think prices are functions of economic variables.

Recall a function is `y = f(x)`, not `y1, y2, y3... = f(x)`.

So what you're modelling is something like `E_hope[y] = f(x)` where `E_hope` is "a hopeful expectation" that the mean of the underlying ambiguous y1,..yn does "reliably" mean to a unique `x`.

This "hopeful expectation" is certainly more common than there being any actual function connecting `y` to `x`, but i think its often quite false too. Ie., even the expectation of prices is genuinely ambiguous.

To handle this we might ensemble models `E_ensemble[E1_hope[y], ...En_hope[y]]`, but to repeat a famous idiom in finance, this is very much "building sandcastles in the sky".

The idea that you can just "expect" (/statstics) your way out of the need for experimentation is a dangerous superstition which is at the heart of ML. It is impossible to simply "model data", measurement produces genuine ambiguities which can only be resolved by changing the world and seeing-what-happens. There is no function to find.


>Recall a function is `y = f(x)`, not `y1, y2, y3... = f(x)`.

y1,...,yn is a perfectly reasonable function output. Functions don't have to produce scalars.


Functions have to resolve to one point in the output domain, even if that point is multi-dim.

Here, consider `y_houseprice = price(house data, economic data, etc.)`. There isnt a unique house price in terms of those variables. The real world observes many such prices for the same value of those variables.

An overly mathematical view of the world has obscured the scientific method from our thinking here. Generally, there arent actually functions from X to Y, and there arent actually stable XY distributions over time.

The world, as measured, is basically always ambigouous and discontinuous. Data, as measurement, isnt the foundation of our theory-building. We build theories by changing the world; data comes in as a guiding light to our theory building, not as the basis.

..wwhich is our direct causal interactions with our environment, ie., its the actual stuff of our bodies and the stuff of the world as we change it


> There isnt a unique house price in terms of those variables

> An overly mathematical view of the world has obscured the scientific method from our thinking here

Since it is understood that houseprice is not a function of just three variables, the mathematical view (statistical learning theory) commonly used when training models, defines house price as a random variable. This takes into account the uncertainty from all the unknown factors they contribute to house prices.

The distribution defining this random variable is a function of the 3 input observations. Commonly, the inputs are used to compute the mean, and the shape of the distribution is fixed - a Gaussian, for example - but not necessarily.

Given observations of the 3 inputs, each observed y_houseprice is just a sample from this random variable.


Well a random variable is a function, from event space to the real line. We return to a single measure by taking its expectation. We dont model `Y = f(X)`.

This doesn't play well with the universal theorem in the article. NNs can only be said to model expectations of random variables.


>We return to a single measure by taking its expectation

Only if you want to throw away most of the information in your model.


Well, indeed.

One then needs to explain how a NN being a "universal fn approximator" helps at all in this context.

One models RVs generativey using distributions (and so on), the actual model (eg., of house prices) isnt a function, it's often an infinity of them.


>One then needs to explain how a NN being a "universal fn approximator" helps at all in this context.

Given that I can't tell why you don't think it does, I don't think I can explain it to you. From the other contexts you've talked about here, you seem to be implying that the only thing that is potentially useful is an AGI which either carries out or merely designs experiments. But that's patently absurd.


I'm happy to hear the very narrow case on this. Can a NN learn geometric Brownian motion?


Can you just show the moving stick to the NN?

Are you just denying information to the NN that is available to the animal?


What techniques are used to estimate trajectories? Would something like a Gaussian Process perform better?


Well NNs like GPs are "basically" non-parametric methods, in the sense that one does not start with a known parameterised statistical distribution that comes from domain expertise. These are worst-case techniques when we dont have the option to start with "the right answer", eg., in the case of large datasets where we have no idea how some pixels distribute over cat/dog images.

In the case of a trajectory we would likey already know the answer, in the form of just doing some physics. The role of computational stats here then is to start with the known form of the solution and find the specific parameters to fit it.

Since we have physics, we can find the "perfect answer" to the trajectory question with very few data points -- and take as many derivatives as we like.

Brute-force ML is often used when we dont have theories, making it all the more dangerous; and alas, all the more useful. We can get a 5% improvement on click-thru rate without having any theory of human behavioural psychology --- god knows then, what we are doing to human behaviour when we implement this system.

`


If you’re asking how to solve difficult differential equations in general, we use numerical methods like finite elements or finite differences


that said, there has been some promising research on using NNs for solving nonlinear pdes.


This often cited fact is a red herring. Lots of things can compute (or rather approximate) any functions. Piece-wise constant functions obviously can approximate anything, but nobody's giddy about using piece-wise constant functions for any numerical purpose (if they do, and they often do, they don't point with pride of their new application of the piece-wise constant functions "universal approximation theorem"). Polynomials, trigonometric polynomials, splines (i.e piecewise polynomials), radial basis functions, and on and on.

Just put neural networks to a test against splines, and see how they fare. Take your favorite function, let's say sin(x) and try to approximate it with a neural net with 1000 nodes, or with splines with 20 nodes. You don't stand a chance to match the quality of the spline approximation.

Edit: here's a short python snippet to show how much better a spline with 20 nodes is vs a neural network with 1000 nodes for approximating the sin function

  import numpy as np
  from sklearn.neural_network import MLPRegressor
  from scipy.interpolate import UnivariateSpline

  N = 10000
  X_train = 2*np.pi*np.random.uniform(size=N)
  Y_train = np.sin(X_train)
  sin_NN = MLPRegressor(hidden_layer_sizes= (1000,)).fit(X_train.reshape(N,1), Y_train)

  spline_nodes = np.linspace(0,2*np.pi, 20, endpoint=True)
  sin_spl = UnivariateSpline(spline_nodes, np.sin(spline_nodes), s=0)

  X_test = np.linspace(0,2*np.pi, 5000, endpoint=True)
  Y_test = np.sin(X_test)
  rmse_NN  = np.mean((Y_test-sin_NN.predict(X_test.reshape(-1,1)))**2)
  rmse_spl = np.mean((Y_test-sin_spl(X_test))**2)

  print("RMSE for NN approx: ", rmse_NN)
  print("RMSE for spline approx: ", rmse_spl)

  >> RMSE for NN approx:  0.00011776185865537907
  >> RMSE for spline approx:  9.540536500968638e-10


This whole example is a red herring. This isn’t a splines versus NNs issue at all, you’re talking about the well known fact that choice of basis affects the ability to fit, which has nothing to do with whether you use a network. As a concrete proof, since a spline is (usually) a polynomial function, it can be defined as a linear network with as many layers as the spline’s polynomial order, in other words splines are a strict subset of the functions you can build using neural networks. You can also make a neural network out of spline neurons if you want. And you can cherry pick lots of different functions that work better for splines than other choices, and you can also cherry pick functions that perform worse for splines than other bases. Splines perform far worse on a periodic function of arbitrary domain than a Fourier fit. Your example is contrived because you artificially constrained the range to [0, 2pi].


I'm sorry, but I have to disagree with you here.

The "Universal Approximation Theorem" is not the point of neural networks. People should stop mentioning it, or if they do, they should state at the same time that there's nothing special about NNs, that numerous classes of functions possess the same property.

Here's my own pitch for neural networks: NN's suck. Big time. They suck in low dimensions and they suck in high dimensions. But the curse of dimensionality is so formidable, that everything sucks in high dimension. Neural networks just happen to suck less than all other known methods. And because they suck a bit less, there are applications where they are useful, and they have no substitute.


> Neural networks just happen to suck less than all other known methods.

Or, perhaps, the best demonstrated performance of NNs exceeds the best demonstrated performance of other known methods for many tasks. But ... the amount of compute, investment in tooling, and attention that have been thrown at deep learning in the past decade is at a scale where ... do we actually know that other methods would perform worse with the same resources? Is there some alternate timeline where in 2012 someone figured out how to run MCMC for bayesian non-parametrics over much larger datasets or something, and the whole field of ML just tilted in a different direction?


That's a very good observation. However, deep learning didn't just get the share of the first mover. Before DL was popular, Support Vector Machines used to be where all the ML fun research was happening. And just out of nowhere, Random Forests and XGBoost came and took the crown if only for a fleeting moment. Gaussian Processes always showed promise, but I'm not sure they delivered. Deep Learning just delivered. I guess it's because of the composability. But you are absolutely right that there's no proof and now way of knowing right now if DL is the best there possibly can be.


It doesn’t matter if you disagree (BTW I don’t know what you disagree with specifically, and I did not mention the Universal Approximation Theorem. It seems like you’re making some assumptions.) A polynomial spline is still a subset of a neural network, so if you’re right, all you’re demonstrating is that splines also suck at solving the same problems that neural networks solve. The discrepancy between the two here, again, has nothing to do with networks and everything to do with your contrived example.


> >> RMSE for NN approx: 0.00011776185865537907

Error is approximately 0.0001 because `tol`, the parameter that tells optimization to finish, is 0.0001.

Set tol=0, and then beta_2=1-1e-15, epsilon=1e-30 to maximize stability from the optimizer, and I got RMSE for the neural network to go below 5e-7.

This is all very academic because stochastic gradient descent is a horrific tool to be using for this purpose. You aren't wrong about that.


Fair enough, I wrote the snippet in 5 min and didn't check the tolerance parameter.

But with your improved choice of parameters, the NN is still about 1000 times worse than the cubic spline, despite having 50 times as many nodes.


I don't think these numbers are meaningful. It's not far off from a degree-2 interpolation already, and I got a hidden layer size of 20 to an error of 7e-5 by just letting it optimise for longer and picking a seed that worked well, which is basically the same error as a degree-1 interpolation that gets 5e-5.

Like sure the spline is doing better, but that's not why we care, it's not like there's a general sense in which spline interpolations are going to be better than the optimal fit from a larger neural network, they're just a simpler, faster, more numerically stable way of solving simpler problems. An optimiser designed for 1D interpolation of small neural networks, for all I know, might get extremely accurate results.


Indeed. I cannot upvote this enough New fanboys of DNN seem so enamored by the universal approximation property and cite it at the slightest provocation. There is no dearth of universal approximators, that's not what makes DNN special. The special thing is how does simple training procedures seem to find these approximations that generalize well (or doesnt as shown by the adversarial examples).


Universality shows potential, not optimality. the article covers this.


Since you are nitpicking, you could well use a sinusoid activation function on the Neural Network, and reach an even smaller loss value.


Not sure I understand your point. Do you want to use a bunch of sine functions to approximate a sine function? What would that show?

Splines don't know anything about the nature of a function. They approximate any function with piecewise polynomials.

Maybe you are trying to say that the default activation function (relu) in sklearn is not smooth. No problem, you can add

  activation='tanh'
inside the definition of the NN, and check the RMSE. Turns out it's for some reason worse.


I assume they're referring to Fourier expansion.

In general you can use a pretty wide set of functions to approximate an arbitrary function. You can do it with polynomials (Taylor expansion), and many others as long as they form a Hilbert space.

Producing a given function from a linear combination of other functions isn't groundbreaking in the least.


I see a lot of commenters up in arms about Universal Approximation for NN’s, and I think the issue is that it’s often framed as a superpower rather than table stakes for any kind of general purpose algorithm.

I posit that any modeling technique which does not have the universal approximation property will be guaranteed to fail on large classes of problems no matter how much elbow grease (say feature engineering) one puts into it. That is, UA is a necessary but not sufficient condition for a modeling technique to be fully general (i.e. could form the basis of whatever AGI is).


Really well said. Is there a term or concept in the AI literature++ that captures this point/conjecture?


I’m not sure. My intuition comes from real analysis, where one can think of the “space of all continuous functions” as being kind of geometric like Euclidean space. Then “Universal Approximation” for some Special set of functions (say those coming from neural networks) is equivalent to saying those special functions are “dense” in the space of all continuous functions.

My geometric reasoning is that if a special collection is not dense, then it’s probably something like a lower-dimensional subspace, which really means you’re missing most of the ambient space in some sense. It’s all a bit handwavy but I hope that paints a picture.

Getting more in the weeds, there are theorems like https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theo... which suggest that being “dense in the space of continuous functions” is not as hard to achieve as one might guess.


I don't understand all of the criticism here. This is Ch. 4 in a basic intro to neural networks. The author provides a well written, intuitive, and concise demonstration of how sigmoids can be pieced together to approximate functions, in the layman's sense of the word. It helped me build intuition when I came across this maybe 5 (?) years ago. There are other good 'Chapters' about vanishing gradient or backprop.

The criticism here is mostly about things that aren't even the topic of the article: 'it cannot compute non-computable functions', 'it cannot predict tomorrow's stock price', 'splines are more efficient', 'it cannot predict how a stick looks bent in water'.

It's like saying 'I read "Zen and the art of motorcycle maintenance, twice, and still don't know how to adjust spark plugs. Stupid book"'.


The criticism is that is over-representing neural networks' capabilities: saying that they can compute any function is a very big and incorrect claim, IMO it should at least mention that non-computable functions are a thing and of course a neural network can't have better capabilities than the machine that is running it. This is especially because there are people out there thinking that machine learning is able to solve any possible problem, if one of those people comes across this proof they'll think: "Neural netwoks really can solve any conceivable problem! Look, there's even a mathematical proof of this!"


As I wrote, 'function in the laymen sense of the word'. And no one actually reading this can, without malice, misunderstand this: the author uses a nice and smooth y=f(x) as an example, and shows that it can be approximated with sigmoids. Nothing more, nothing less. And he does a good job showing this.


You say no one can misunderstand what the author wrote without malice, but there is a saying that reads "you should not attribute to malice what can be attributed to ignorance". Especially among people outside of CS, there is currently a lot of hype regarding ML and AI in general, if one of those people looks around to check if a neural network can really do anything and founds this, what would they understand?


You're telling me you were confused by what the author could possible mean with his nice and smooth y=f(x) and that you misunderstood him that really he was implying that GAI is just around the corner? The whole discussion here is one long list of "Gotcha!", missing the point and the context of the article completely.

You think these non-CS people you evoke will be too dense to understand that the author is just approximating a simple function, and then will leverage their non-CS background and immediately conclude that ANN can 'solve' Cantor-dust, non-computability and the halting problem? That's a very specific set of background your non-CS people will have to have to fall for that.

[Edit:]

What do you think they'll make of "Multilayer feedforward networks are universal approximators"?

("This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.")


The people I'm referring to just don't know that there exist non-computabke function. So they wouln't draw specific conclusion about problems like the halting problem and silmilar because they don't know them, they would just conclude that any problem they might face can be solved by a NN. That's all.


It’s funny, but this is how I felt about my son’s kindergarten. They taught him about “numbers” but instead they only used Natural numbers. The least they could have done is mention at least the one-point compactification of that space.

Fortunately, I was able to intervene before he was permanently damaged by the class. He’s still struggling with showing that it is homeomorphic to the subspace of R formed by {1/n for all n in R that are in N} union {0} and I blame today’s pedagogy for that.


The context is different, the problem here is the hype that goes in a very specific direction (i.e. you can do anything with ML) as I briefly explained in another comment


I feel "Numbers" are pretty hyped-up!


Going back to your original comment, if I say "let's talk about linear functions" and then all I say also applies to a more general context, eg. polynomials, then it is fine. But if I say "let's talk about polynomials" and then all I say only applies to linear functions then it is wrong. Your sarcastic comment falls under the first case, the article discussed here falls under the second.


I wish you had read my original comment, where, let me repeat it a second time for you, I state that the author writes about how to approximate a 'function in the laymen sense of the word'. And that's what he does. Nowhere does he bring hype into it; no one reading about how to approximate a simple y=f(x) will misread this. My guess is that you read the website like you read my comment, if at all, otherwise we wouldn't be having this ridiculous conversation. renewiltord's comment is spot on and much better phrased than I could have done it.


I got that you say that the author is talking about a 'function in the laymen sense of the world', what I'm saying is that this might not be clear from reading that webpage without some prior knowledge of the cases when this will fail.


I think these sorts of arguments are not great because they confuse "limiting behavior" with "behavior at the limit". Yes if you are able to construct an infinite-sized MLP it can exactly replicate a given function, and you can construct a sequence of MLPs that in some sense converge to this infinite behavior. But in other measures the approximation might be infinitely bad and never get better unless the net is truly infinite.

For an example, consider approximating the identity function [f(x) = x] with a sigmoid-activation MLP. For any finite size of the net, the output will have a minimum and maximum value. One can change the parameters of the net to increase the range of the output, but at no point is the output range infinite. So even though you can construct a sequence of MLPs that in the limit in some sense converges to the identity function, in some sense it never does.

The same kind of thinking that leads to the conclusion "neural nets are universal approximators" would support the existence of perpetual motion machines; check out the "ellipsoid paradox" for more info.


I want to see it compute the Ackermann function.


I want to see it compute the Busy Beaver function.


> Ackermann function

Perhaps quantum neurons?


With the top comments (now) all talking about the fact that this universal approximation theorem doesn't really have much impact in the real world. I wonder, is this interesting outside of theory? Has this motivated any techniques that have created (or may create) real-world, empirical results? Could it even?


Plenty of problems in the area of computation have little impact on the real world, but they contribute to our fundamental understanding of ‘understanding’. The Entscheidungsproblem is probably the most significant of these.

https://en.wikipedia.org/wiki/Entscheidungsproblem


Given that any stock is a function over time as well, there should theoretically exist a neural net that can approximate the stock price for the future? This reasoning is obviously wrong, what is my exact error of thought though?


Your reasoning is not wrong, there is a neural net that approximates future stock price. The problem is we don’t know which one :)


Sure we do. Partition your neurons into a few billion independent networks, embed them on a spinning globe of mostly molten rock, and put each one in a leaky bag of mostly water.

Wait long enough, and one leaky bag will emit the number "42". If you get the initial conditions just right (and quantum nondeterminism isn't really a thing), then you'll also get a good approximation to the stock market.


So it's somewhat akin to the Library of Babel but instead of the set of all possible books, it's the set of all possible functions :p


Those are equivalent as long as you allow for infinite length books.


No need for infinite-length books to encode all possible functions, as there are notations to express infinity in finite space (e.g. programs, encoding an infinity of behaviour in a finite number of instructions).

The library of babel contain every possible finite description of every possible function.


And knowing which one it is will probably influence which one it should be, in a halting-problem-esque way.


besides karelp's sister comment, there's also the "obvious" fact that stock price is not a function of time, it's not P(t), it's a function of time and the entire f universe that also evolves through time, more like P(t, U(t, ....)) ....you can simplify things by assuming the laws of physics are deterministic and you only need one instance of the state of the universe, U, so you'd have P(t, U)

...now if you don't explicitly represent U as a parameter, you'll have it implicit in the function. So your "neural network" contains the entire state of the freakin universe (!!).

Ergo, contingent on your stance on theologic immanence vs. transcendence, what you'd call "neural network approximation of the stock's price function" is probably quite close to what other call... God (!).

(Now, if relativity as we know it is right, you might get aways with a "smaller slice of U" - lear about "light cone". And to phrase this in karelp's explanation context: you'd need to know U to know which of the practically infinitely many such neural networks to pick. The core of (artificial) intelligence is not neural networks in themselves, it's learning, the NN is a quite boring computational structure, but you can implement tractable learning strategies for it, both in code, and in living cells as evolution has shown...)


And you'd have to know the state of U to infinite precision. Which makes me wonder whether neural nets have any hope with a simple chaotic function. Maybe they do but just in the short term, like predicting the weather.


> what is my exact error of thought though?

There isn't one, you're just overestimating the value of existence-propositions.

In practice, knowing that something exists is not a very useful result - it is often more useful to know that something does NOT exist (e.g. solution to the halting problem).


Theoritically, there exists a model that predicts all future stock prices EXACTLY at any given time in the future, as long as the results are completely isolated from all market participants (i.e. the knowledge of the result is COMPLETELY isolated from the market). Here is how you can theoritically prove it exists:

Train a model today so that it is overfitting for a given stock. It would predict everything very accurately upto today. The ONLY way to make sure that the results are completely isolated from the market is to not make the result available to ANYONE (how do you know that an isolated human observer is not leaking data with some unknown phenomena... say quantum entanglement with particles in other people's brains, for example). So, the ONLY way to test the models is back-testing.

You can extend that to saying that for any given point in the future (say, this is a reference point), there will be an overtrained model which will backtest perfectly i.e. the theoritical model that works at any time during the past to predict the exact stock price in the future upto the point of reference.



It is not wrong.

It is just that the neural network would have to compute a model of the entire world or even the universe on an atomic scale. It would be computationally unfeasible but not theoretically impossible.

It is theoretically possible that the universe we live is already being computed on a neural network in some other external universe.


Contrary to the recent top comment on this, which fails to show that such a net existing could be no coincidence, I guess, the answer to your problem might be deeply physical and information theoretic, as soon as you speak of time. Simply speaking, any model is good enough if the approximation is tolerably accurate. In that sense, crude nets as well as expert systems that trigger off clear signals and ample evidence may already exist.

In particular, the way the stockmarkets are distributed the function of time is likely relativistic and every participant is acting under incomplete information even in the infinite limit.

Also, you have to be cautious what any function in this context really means, as I imagine it means differentiable functions (after somebody mentioned the Ackermann function, which is not anywhere differentiable).


The error is that stock price is not a function over time, but instantaneous demand which we record over time. That demand is a function over an undefined number of variables.


We can find a function to express past stock prices.

There isn't one for the future, unless said future is somehow predetermined?

Is it, given enough input data?

Does this discussion then distill down to philosophy?

Do living beings have agency, or are they simply very complex NNs?

How one answers that speaks to consciousness as much as it does the prospect of a predictive stock price model.


With the same type of reasoning, we could plot whatever output our brain gives and there will be some type of neural network that can predict what we'll think/see/feel in the future. What you said and the thing i just said were both ideas i had when starting learning how ai works, sadly it's something we can still not reach.at least for now


If I understand you correctly, we are very far away from that example as that is AGI and then some. You will not see that in your lifetime so ‘we can still’ seems an interesting (overly optimistic?) take on it.


AGI, since it lacks a technical/mathematical definition, can be anything. It’s mere philosophy at this point, actually even vaguer than most philosophical problems.


I meant it to mean, indeed in a vague way, what we call human intelligence or beyond; the parent says to make a neural network that can predict what someone will think/feel in the future, which seems the same or at least indistinguishable from the subject’s human intelligence as it will result in the same outputs. So to create the network implied by the parent, we would have to a) be able to make networks of that (unknown) complexity and b) ‘copy’ , or rather make it learn from the outputs, the current ‘state’ of the subject’s brain in it. That is incredibly far removed from anything cutting edge we know how to do. If it is at all possible.

So I was just surprised by their use of language as it seems to imply parent thought we would be closer to or there already with our developments of AI tech.


Although I share your sentiment in general, I would presume that @tluyben's take is fairly true to the broader philosophical view. The ritique of this view being at least as weak as the views on intelligence per se is a drop in the ocean really.

Implying, there is a wealth of thought devoted to inteligence! That fact is actually proving the conjecture in a nicely constructive way by itself, that we are thoughtful indeed, if only you believe this axiomatically like. The quintessential theorem was distiled by Descartes, of course, wherefore he is remembered.


Your input features are incomplete and mixed with noise.

The value of a dice is also a function over time. Can we learn this function with a neural net? No, because our features don't include the nitty gritty details of each throw so it's essentially random.


It would be the same as predicting the future, which is not possible using past performance


Stock prices are not well modelled as continuous functions- the prices you see are generally trades (discrete function) and the price may or may not have been available to you at the volume you wanted (there are more variables than time).


Your mistake is that you left off "given the right inputs". There are a lot of inputs to stock prices that are unlikely to be readily available to your function.


Doesn't seem wrong to me, the tricky part is to find this network and convince yourself that it is indeed predicts correctly over the period of interest.


A sequence of sinus curves can compute any function. A Turing machine can compute any function. Lambda calculus can compute any function. etc. etc.


I think the author might be overlooking just how weird something can be while still being a function.

Let f(x):R->{1,0} be such that f(x)=1 if x is rational and 0 otherwise.


"Visual proof" pretty much gave away the fact that it was going to be a non-rigorous reasoning process based on drawing an arbitary plot, once you draw a plot you're already assuming a lot about its underlying function. The title isn't very misleading.


He says continuous function


He also says in the first sentence "One of the most striking facts about neural networks is that they can compute any function at all", that later caveat is incompatible, most function are not continuous.


Ok so what about the Cantor function? Can it learn that?


That’s a pretty useless “proof” which gives no insight into how neural networks learn in practice. The author takes a sigmoid neuron, tweaks its parameters to the extreme so that it looks like a step function, then concludes that since a linear combination of step functions can approximate anything, so do neural networks. Bravo.


Did you ever consider that explaining how deep nets "learn in practice" is not the point?


If it's not about learning this result is useless. A lookup table can compute any function, but who cares?


> who cares

The people who need to build some (sophisticated kind of) lookup table for a function of unavailable details through automation.


As someone who has taught this to CS students, just scrolling through I have to say it looks like this has about 5x more text it should have. This is a homework problem for second-year students (literally, where I used to teach) that should take them no more than a page to answer.


>"No matter what the function, there is guaranteed to be a neural network so that for every possible input, x, the value f(x) (or some close approximation) is output from the network"

With "or some close approximation" being a key I fail to understand why is it not obvious.


What about a function like f(x) = sin(x)? I feel like when x gets big the error will start to increase.


This argument only applies to functions on a compact domain. So we should only consider trying to approximate sin(x) when x is in [0, 1], for example.


You are right. The argument is that it can approximate sin(x) over a compact interval, like [0,1].

You answer to that might be "I could approximate over [0,1] with just a lookup table, where I split the input range into n equal sized pieces for increasing values of n", and you'd be right -- the "proof" is basically just doing that.

It's one of those things which is nice to show as a basic theory thing (there are approximation methods which can never simulate certain functions), but it's not really of any real value.


> a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision.

That is a very different claim than being able to compute any function.


I was about to feed P = NP into GPT-3, sigh


This "proof" is pretty sketchy. I think he means every differentiable function? Because I fail to see how you can make a neural network evaluate the indicator function of the rationals


"The second caveat is that the class of functions which can be approximated in the way described are the continuous functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net. This is not surprising, since our neural networks compute continuous functions of their input. However, even if the function we'd really like to compute is discontinuous, it's often the case that a continuous approximation is good enough. If that's so, then we can use a neural network. In practice, this is not usually an important limitation."


I guess I didn't see that. But he's still using continuous where the correct term would be differentiable


You can uniformly approximate any continuous function on a compact domain with a differentiable/smooth function.


Continuous functions aren't necessarily differentiable.


yes, that's my point


This is a long written argument with some animations, not a visual proof.


This is called out deep into the article, but shouldn't it be "neural nets can approximate any function"?

Also, how does this relate to the traditional notation of computability of functions?


So can most linear combinations of functions set to the nth power.


any “well behaved” function.


A number of assumptions seem to be missing from this article. Since the author is using the sigmoid function which is smooth, this argument actually only applies to approximating smooth functions. That is, you dont just need f to be continuous, you need all of its derivatives to exist and all be continuous. Also, since we are only able to have finitely many neurons, we need to be able to approximate f using step functions with definitely many pieces. So this argument can only be used if f is constant outside of a compact region.


Why does the function you are approximating need to be smooth? From the paper cited in the article, all you need is for f to be continuous on a compact subset of R^n.


at which point, a fourier transform is a hell of a lot cheaper ;)


The author is taking non-linear neurons. This is, IMO, cheating.


It's not cheating considering it's not even possible otherwise.


My proof that you can compute any function with a single neuron:

1. Use the function as the activation function.


This is obviously true for non-linear neurons. Who need a proof for that?


Why do you think it's obvious? Can you spell it out?

I hope it's more than "presumably, non-linear neurons can approximate any non-linear function since they both have non-linear in the name".


All functions can be approximated by the Fourier transform. Look up the equation to see why it applies here (Hint: it is an integral of f(t)e^(t)).


This is about as far from a proof as we are from the Andromeda Galaxy.



The really simple proof I'd use is:

1. A function can be approximately implemented as a lookup table.

2. It's trivial to make a neural network act like a lookup table.

Which seems to resemble the article but it's much simpler.

Point 2 assumes the neurons are normal non-linear ones. I'm not saying that's cheating, but I do agree with it being pretty obvious, at least from the right angle.


If you fleshed out the "trivial" point 2 as a proof, I think the result would be essentially the same as the article.

The only way you can make it substantially simpler is if you use a neuron whose nonlinearity makes it essentially a restatement of another result. For example, if the neurons are basically just Haar wavelets.


"Divide x into a bunch of buckets.

Make two neurons tied to Input that activate very sharply at the bottom and top edge of each bucket.

Use those to make a neuron that activates when Input is in the bucket.

Weight it so it adds f(x) to Output."

That's over 100 times shorter than the article. The method isn't as elegant since it needs two internal layers but I think it's pretty clear.

Is it wrong to say that the logical leap from "a neuron can go from 0 to 1 at a specific input value" to "neurons can make a lookup table" is trivial? Oh well.

(With "go from 0 to 1 at a specific input value" being the nonlinear part.)


I encourage you to search for "lookup table" in the article.


Why? Yes, it says that.

It's also 6000 words long.

I'm saying it's not that hard.

I'm not saying the article is wrong or anything, I'm saying you can get to the same result MUCH faster.

"You can turn a neural net into a lookup table" should be easily understood by anyone that knows both of those concepts.

Edit: Like, isn't triggering specific outputs on specific input conditions the first thing that's usually shown about neural nets? If not a full lookup table, that's at least 90% of one and you just need to combine the outputs.


If you take linear neurons, then the whole network is just some linear (or affine function) of its inputs, and hence the universal approximation fails (not every continuous function can be uniformly approximated by linear functions).


Why? Non-linear activation functions are common and easy?


Did you know a big enough hash-table can compute any function?


Yeah, all you need to do is show you can build a NAND gate with a neural network and every other logical network follows from that. I remember doing that project in a machine learning class in 2010.


“Any function” is a big wide claim. Can someone fill me in on what’s required of these functions? Can a neural nets for example compute non-continuous functions like f(x) = [x is rational]?


The article says:

> The second caveat is that the class of functions which can be approximated in the way described are the continuous functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net. This is not surprising, since our neural networks compute continuous functions of their input. However, even if the function we'd really like to compute is discontinuous, it's often the case that a continuous approximation is good enough. If that's so, then we can use a neural network. In practice, this is not usually an important limitation.


it can't learn exponents. it can only multiply.

there's also a difference between memorizing part of a line, and being able to extrapolate it.


They can with the right activation function. I'm thinking inverse log.


I find this very interesting, can you expand or provide links?


weights that connect neurons can only be multiplied. now i am looking at activation functions like `exponent`. i just doubt a neural network regression would be able to predict rlly high values accurately for something like E = mc^2.


thank you.


Really? Can a neural compute the function halts(N, I): the neural net N will halt on input I, for any neural net N and input I?


The article says:

> The second caveat is that the class of functions which can be approximated in the way described are the continuous functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won't in general be possible to approximate using a neural net. This is not surprising, since our neural networks compute continuous functions of their input. However, even if the function we'd really like to compute is discontinuous, it's often the case that a continuous approximation is good enough. If that's so, then we can use a neural network. In practice, this is not usually an important limitation.


eh, more a proof with visual, not a visual proof.


s/neural net/turing machine/gi


s/function/computable function




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: