The "squashing function" necessarily is nonlinear in multilayer nueral networks....

quantadev · 2024-10-03T22:11:27 1727993487

Right. All the squashing is doing is keeping the output of any neuron in a range of below 1.

But the entire NN itself (Perceptron ones, which most LLMs are) is still completely using nothing but linearity to store all the knowledge from the training process. All the weights are just an 'm' in the basic line equation 'y=m*x+b'. The entire training process does nothing but adjust a bunch of slopes of a bunch of lines. It's totally linear. No non-linearity at all.

nazgul17 · 2024-10-03T22:30:26 1727994626

The non linearities are fundamental. Without them, any arbitrarily deep NN is equivalent to a shallow NN (easily computable, as GP was saying), and we know those can't even solve the XOR problem.

> nothing but linearity

No, if you have non linearities, the NN itself is not linear. The non linearities are not there primarily to keep the outputs in a given range, though that's important, too.

scarmig · 2024-10-04T12:42:15 1728045735

Nonlinearity somewhere is fundamental, but it doesn't need to be between each layer. You can, for instance, project each input to a higher dimensional space with a nonlinearity, and the problem becomes linearly separable with high probability (cf Cover's Theorem).

So, for XOR, (x, y) -> (x, y, xy), and it becomes trivial for a linear NN to solve.

Architectures like Mamba have a linear recurrent state space system as their core, so even though you need a nonlinearity somewhere, it doesn't need to be pervasive. And linear recurrent networks are surprisingly powerful (https://arxiv.org/abs/2303.06349, https://arxiv.org/abs/1802.03308).

quantadev · 2024-10-03T23:24:10 1727997850

> The non linearities are not there primarily to keep the outputs in a given range

Precisely what the `Activation Function` does is to squash an output into a range (normally below one, like tanh). That's the only non-linearity I'm aware of. What other non-linearities are there?

All the training does is adjust linear weights tho, like I said. All the training is doing is adjusting the slopes of lines.

uh_uh · 2024-10-04T01:41:27 1728006087

> That's the only non-linearity I'm aware of.

"only" is doing a lot work here because that non-linearity is enough to vastly expand the landscape of functions that an NN can approximate. If the NN was linear, you could greatly simplify the computational needs of the whole thing (as was implied by another commenter above) but you'd also not get a GPT out of it.

quantadev · 2024-10-04T05:01:11 1728018071

All the trainable parameters are just slopes of lines tho. Training NNs doesn't involve adjusting any inputs to non-linear functions. The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1. There's no "magic" or "knowledge" in the tanh smashing. All the magic is 100% in the weights. They're all linear. The amazing thing is that all weights are linear slopes of lines.

Nevermark · 2024-10-04T09:28:17 1728034097

Simply squashing the output of a linear signal would be multiplying by a small value. To avoid large y, you add a step y' = y/1000.

That would still be linear. And the result would be that despite squashing, no matter how many layers a model had, it could only fit linear problems. Which can always be fit with a single layer, i.e. single matrix.

So nobody does that.

The nonlinearity doesn't just squash some inputs. But create a new rich feature, decision making. That's because on one side of a threshold y gets converted very differently than another. I.e if y > 0, y' = y, otherwise y = 0.

Now you have a discontinuity in behavior, you have a decision.

Multiple layers making decisions can do far more than a linear layer. They can fit any continuous function (or any function with a finite number of discontinuities) arbitrarily well.

Non-linearities add a fundamental new feature. You can think of that features as being able to make decisions around the non-linear function's decision points.

---

If you need to prove this to yourself with a simple example, try to create an XOR gate with this function:

    y = w1 * x1 + w2 * x2 + b.

Where you can pick w1, w2 and b.

You are welcome to linearly squash the output, i.e. y' = y * w3, for whatever small w3 you like. It won't help.

Layers with non-linear transformations are layers of decision makers.

Layers of linear transforms are just unnecessarily long ways of writing a single linear transform. Even with linear "squashing".

quantadev · 2024-10-04T16:45:59 1728060359

Right, it's obvious that the ReLU is just a gating mechanism, and you can think of that as a decision maker. It's like a "pass thru linearly proportionally" or "block" function.

But I still find it counter-intuitive that it's not common practice in standard LLM NNs to have a trainable parameter that in some way directly "tunes" whatever Activation Function is being applied on EACH output.

For example I almost started experimenting with trigonometric activation functions in a custom NN where the phase angle would be adjusted, inspired by Fourier Series. I can envision a type of NN where every model "weight" is actually a frequency component, because Fourier Series can represent any arbitrary function in this way. There has of course already been similar research done by others along these lines.

uh_uh · 2024-10-04T13:32:49 1728048769

> The tanh smashing function just makes sure nothing can blow up into large numbers and all outputs are in a range of less than 1.

That's not the main point even though it probably helps. As OkayPhysicist said above, without a nonlinearity, you could collapse all the weight matrices into a single matrix. If you have 2 layers (same size, for simplicity) described by weight matrices A and B, you could multiply them and get C, which you could use for inference.

Now, you can do this same trick not only with 2 layers but 100 million, all collapsing into a single matrix after multiplication. If the nonlinearities weren't there, the effective information content of the whole NN would collapse into that of a single-layer NN.

quantadev · 2024-10-04T16:27:33 1728059253

You can explain the "effect" of tanh at any level of abstraction you like, up to including describing things that happen in Semantic Space itself, but my description of what tanh is doing is 100% accurate in the context I used it. All it's doing is squashing a number down to below one. My understanding of how the Perceptron works is fully correct, and isn't missing any details. I've implemented many of them.

beckhamc · 2024-10-04T17:58:22 1728064702

Your description of tanh isn't even correct, it squashes a real number to `(-1, 1)`, not "less than one".

You're curious about whether there is gain in parameterising activation functions and learning them instead, or rather, why it's not used much in practice. That's an interesting and curious academic question, and it seems like you're already experimenting with trying out your own kinds of activation functions. However, people in this thread (including myself) wanted to clarify some perceived misunderstandings you had about nonlinearities and "why" they are used in DNNs. Or how "squashing functions" is a misnomer because `g(x) = x/1000` doesn't introduce any nonlinearities. Yet you continue to fixate and double down on your knowledge of "what" a tanh is, and even that is incorrect.

quantadev · 2024-10-04T19:01:27 1728068487

When discussing `tanh squashing` among other AI experts it's generally assumed that even the most pedantic and uncharitable parsing of words won't be able to misinterpret "smashing to less than one" as an incorrect sentence fragment, because the "one", in that context, obviously refers to distance from zero.

wrs · 2024-10-04T02:09:02 1728007742

With a ReLU activation function, rather than a simple linear function of the inputs, you get a piecewise linear approximation of a nonlinear function.

ReLU enables this by being nonlinear in a simple way, specifically by outputting zero for negative inputs, so each linear unit can then limit its contribution to a portion of the output curve.

(This is a lot easier to see on a whiteboard!)

quantadev · 2024-10-04T05:47:14 1728020834

ReLU technically has a non-linearity at zero, but in some sense it's still even MORE linear than tanh or sigmoid, so it just demonstrates even better than tanh-type squashing that all this LLM stuff is being done ultimately with straight line math. All a ReLU function does is choose which line to use, a sloped one or a zero one.

wrs · 2024-10-04T17:34:06 1728063246

Well. The word “linear” the way you use it doesn’t seem to have any particular meaning, certainly not the standard mathematical meaning, so I’m not sure we can make further progress on this explanation.

I’ll just reiterate that the single “technical” (whatever that means) nonlinearity in ReLU is exactly what lets a layer approximate any continuous[*] function.

[*] May have forgotten some more adjectives here needed for full precision.

quantadev · 2024-10-04T18:00:24 1728064824

If you're confused just show a tanh graph and a ReLU graph to a 7 year old child and ask which one is linear. They'll all get it right. So you're not confused in the slightest bit about anything I've said. There's nothing even slightly confusing about saying a ReLU is made of two lines.

wrs · 2024-10-05T19:44:42 1728157482

Well, 7-year-olds don’t know a lot of math, typically, so I wouldn’t ask one that question. “Linear” has a very precise mathematical definition, which is not “made of some straight lines”, that when used properly enables entire fields of endeavor.

It would be less confusing if you chose a different word, or at least defined the ones you’re using. In fact, if you tried to precisely express what you mean by saying something is “more linear”, that might be a really interesting exploration.

quantadev · 2024-10-06T00:34:04 1728174844

It's perfectly legitimate to discuss the linear aspects of piecewise linear functions. I've heard Andrej Karpathy do it in precisely same way I did on this thread, talking about ReLU.

We just have a lot of very pedantic types on HN who intentionally misinterpret other people's posts in order to have something to "disprove".

mickg10 · 2024-10-04T21:09:25 1728076165

I.e. ReLU is _piecewise_ linear. That discontinuity that separates the 2 pieces is precisely what makes it non linear. Which is what enables the actual universal approximation.

quantadev · 2024-10-04T22:02:43 1728079363

Which is what I said two replies ago.

Followed by "in some sense it's [ReLU] still even MORE linear than tanh or sigmoid functions are". There's no way you misunderstood that sentence, or took it as my "definition" of linearity...so I guess you just wanted to reaffirm I was correct, again, so thanks.

jcparkyn · 2024-10-03T23:35:00 1727998500

> squash an output into a range

This isn't the primary purpose of the activation function, and in fact it's not even necessary. For example see ReLU (probably the most common activation function), leaky ReLU, or for a sillier example: https://youtu.be/Ae9EKCyI1xU?si=KgjhMrOsFEVo2yCe

quantadev · 2024-10-04T00:15:42 1728000942

You can change the subject by bringing up as many different NN architectures, Activation Functions, etc. as you want. I'm telling you the basic NN Perceptron design (what everyone means when they refer to Perceptrons in general), has something like a `tanh` and not only is it's PRIMARY function to squash a number, that's it's ONLY function.

mr_toad · 2024-10-04T01:32:50 1728005570

You need a non-linear activation function for the universal approximation theorem to hold. Otherwise, as others have said the model just collapses to a single layer.

Technically the output is still what a statistician would call “linear in the parameters”, but due to the universal approximation theorem it can approximate any non-linear function.

https://stats.stackexchange.com/questions/275358/why-is-incr...

quantadev · 2024-10-04T02:25:21 1728008721

As you can see in what I just posted about an inch below this, my point is that the process of training a NN does not involve adjusting any parameter to any non-linear functions. What goes into an activation function is a pure sum of linear multiplications and an add, but there's no "tunable" parameter (i.e. adjusted during training) that's fed into the activation function.

beckhamc · 2024-10-04T03:38:55 1728013135

Learnable parameters on activations do exist, look up parametric activation functions.

quantadev · 2024-10-04T04:20:13 1728015613

If course they do exist. A parameterized activation function is the most obvious thing to try in NN design, and has certainly been invented/studied by 1000s of researchers.

beckhamc · 2024-10-04T01:16:24 1728004584

How was that person derailing the convo? Nothing says an activation function has to "squash" a number to be in some range. Leaky ReLUs for instance do `f(x) = x if x > 0 else ax` (for some coefficient `a != 0`), that doesn't squash `x` to be in any range (unless you want to be peculiar about your precise definition of what it means to squash a number). The function takes a real in `[-inf, inf]` and produces a number in `[-inf, inf]`.

> Sure there's a squashing function on the output to keep it in a range from 0 to 1 but that's done BECAUSE we're just adding up stuff.

It's not because you're "adding up stuff", there is specific mathematical or statistical reason why it is used. For neural networks it's there to stop your multi layer network collapsing to a single layer one (i.e. a linear algebra reason). You can choose whatever function you want, for hidden layers tanh generally isn't used anymore, it's usually some variant of a ReLU. In fact Leaky ReLUs are very commonly used so OP isn't changing the subject.

If you define a "perceptron" (`g(Wx+b)` and `W` is a `Px1` matrix) and train it as a logistic regression model then you want `g` to be sigmoid. Its purpose is to ensure that the output can be interpreted as a probability (given that use the correct statistical loss), which means squashing the number. The inverse isn't true, if I take random numbers from the internet and squash them to `[0,1]` I don't go call them probabilities.

> and not only is it's PRIMARY function to squash a number, that's it's ONLY function.

Squashing the number isn't the reason, it's the side effect. And even then, I just said that not all activation functions squash numbers.

> All the training does is adjust linear weights tho, like I said.

Not sure what your point is. What is a "linear weight"?

We call layers of the form `g(Wx+b)` "linear" layers but that's an abused term, if g() is non-linear then the output is not linear. Who cares if the inner term `Wx + b` is linear? With enough of these layers you can approximate fairly complicated functions. If you're arguing as to whether there is a better fundamental building block then that is another discussion.

quantadev · 2024-10-04T02:20:20 1728008420

> What is a "linear weight"?

In the context of discussing linearity v.s. non-linearity adding the word "linear" in front of "weight" is more clear, which is what my top level post on this thread was all about too.

It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning. During training we're not [normally] adjusting any parameters or weights on any non-linear functions. I include the caveat "normally", because I'm speaking of the basic Perceptron NN using a squashing-type activation function.

viktor_von · 2024-10-04T04:20:34 1728015634

> It's astounding to me (and everyone else who's being honest) that LLMs can accomplish what they do when it's only linear "factors" (i.e. weights) that are all that's required to be adjusted during training, to achieve genuine reasoning.

When such basic perceptrons are scaled enormously, it becomes less surprising that they can achieve some level of 'genuine reasoning' (e.g., accurate next-word prediction), since the goal with such networks at the end of the day is just function approximation. What is more surprising to me is how we found ways to train such models i.e., advances in hardware accelerators, combined with massive data, which are factors just as significant in my opinion.

quantadev · 2024-10-04T05:38:23 1728020303

Yeah, no one is surprised that LLMs do what they're trained to do: predict tokens. The surprise comes from the fact that merely training to predict tokens ends up with model weights that generate emergent reasoning.

If you want to say reasoning and token prediction are just the same thing at scale you can say that, but I don't fall into that camp. I think there's MUCH more to learn, and indeed a new field of math or even physics that we haven't even discovered yet. Like a step change in mathematical understanding analogous to the invention of Calculus.