OK, I admit that the math flies way over my head and I barely understand the tex...

iNic · 2025-02-26T11:25:23 1740569123

- The Fourier Transform is an invertable operator (i.e. it acts on functions, in the case of matrices both functions and operators are themselves matrices). It transforms into what we call frequency space.

- This is most intuitive for signal analysis or images [1].

- Frequency space is inherently "complex", i.e. represented by complex numbers.

- Frequencies have the advantage that they take a "global" view of the problem.

- This mechanism is not equivalent to the attention mechanism. There is definitely a trade-off.

- But it is possible that it captures many of the important relationships that attention capture.

- I do not have good intuition for modReLU right away, but it seems important because it modifies the frequencies but preserves the inverse Fourier transform.

[1]: https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm

xpe · 2025-02-26T19:16:24 1740597384

Worth noting that frequency space is often considered one dimensional. Adding the phase is what gives the other dimension.

jampekka · 2025-02-26T11:41:24 1740570084

modReLU seems to just increase the magnitude of the input value, and rotate it to the original polar angle. With clipping off negative magnitudes.

Or equivalently rotates a (real) bias term with the input angle and adds that into the original.

  (abs(z) + c)*exp(i*arg(z)) = abs(z)*exp(i*arg(z)) + c*exp(i*arg(z)) = z + c*exp(i*arg(z))

jampekka · 2025-02-26T11:21:05 1740568865

The actual mechanism at least is quite simple. Essentially it takes the FFT of the input embeddings, multiplies it elementwise with weights that are gotten from the input embeddings using an MLP (plus a constant (but learnable) bias) and then runs it through an activation function and finally takes the inverse FFT.

The "frequencies" are probably something quite abstract. FFT is often used in ways where there aren't really clear frequency interpretation. The use is due to convenient mathematical properties (e.g. the convolution theorem).

Rather amazing if this really works well. Very elegant.

xeonmc · 2025-02-26T11:23:06 1740568986

Essentially leveraging Convolution theorem. Same philosophy pops up in many places, e.g. DFT calculations

physicsguy · 2025-02-26T11:59:50 1740571190

Yep, it's very common to use this. we used it for grid based pairwise interactions since it turns N^2 op into N log N.

dekhn · 2025-02-26T16:18:40 1740586720

That sounds just like Particle Mesh Ewald, which we use in molecular dynamics to approximate the forces of pairwise interactions (interpolated on a grid). Ihttps://en.wikipedia.org/wiki/P3M

physicsguy · 2025-02-26T17:31:33 1740591093

It's similar but I worked on magnetic spin systems with dipole-dipole interactions, so there wasn't the interpolation part, and as I understand it in Ewald summation you're always assuming periodic boundary conditions.

In our spin systems you basically pre-compute the interaction kernel tensor and can either take into account periodicity or ignore it depending on what sort of system you're looking at. Often you don't want the periodic effect since the dipole-dipole interaction is only one of many, much of the interesting phenomena in magnetics is in the interplay between short range forces and the long range forces. At each time step you FFT to the magnetisation tensor and then multiply with the interaction tensor, then iFFT.

jampekka · 2025-02-26T12:12:13 1740571933

Sorry, added the convolution theorem part in an edit after your comment.

pointlessone · 2025-02-26T11:35:03 1740569703

I’m still confused. Does it treat the input tokens as a sampled waveform?

I mean, say I have some text file in ASCII. Do I then just pretend it’s raw wav and do FFT on it? I guess it can give me some useful information (like does it look like any particular natural language or is it just random; sometimes used in encrytion analysis of simple substitution cyphers). It feels surprising that revers FFT can get a coherent output after fiddling with the distribution.

xeonmc · 2025-02-26T11:39:11 1740569951

Do keep in mind that FFT is a lossless, equivalent representation of the original data.

yobbo · 2025-02-26T20:24:39 1740601479

As I understand it, the token embedding stream would be equivalent to multi-channel sampled waveforms. The model either needs to learn the embeddings by back-propagating through FFT and IFFT, or use some suitable tokenization scheme which the paper doesn't discuss (?).

It seems unlikely to work for language.

jampekka · 2025-02-26T11:37:14 1740569834

It embeds them first into vectors. The input is a real matrix with (context length)x(embedding size) dimensions.

blovescoffee · 2025-02-26T12:47:47 1740574067

No. The FFT is an operation on a discrete domain, it is not the FT. In the same way audio waveforms are processed by an FFT you bucket frequencies which is conceptually a vector. Once you have a vector, you do machine learning like you would with any vector (except you do some FT in this case, I haven’t read the paper).

lta · 2025-02-26T13:06:51 1740575211

Most likely the embedding of the token is passed through FFT

agoose77 · 2025-02-26T13:21:32 1740576092

I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.

You can see conceptually how this is similar to a convolution with some simplification, e.g. https://openreview.net/pdf?id=8l5GjEqGiRG

Convolutions are often used in contexts where you want to account for global state in some way. - https://openreview.net/pdf?id=8l5GjEqGiRG