More

cgadski · 2024-08-26T08:19:16 1724660356

Yeah, also wanted to suggest Stylus labs' app. Don't use a tablet much at present, but two years ago this was the best software I found.

Besides selecting/manipulating with rectangle/lasso/intersection selections, handwritten text that is roughly level with the page rules can be deleted/moved by using the stylus as a "cursor." It understands punctuation and descenders and works surprisingly well. Text that is pushed too far to the right even "reflows" to the next line. The site [1] has a good explanation of how this works.

Rnote looks nice too though. I wonder if it'll be able to match the UX of Write.

[1] https://www.styluslabs.com/

cgadski · 2024-08-06T11:23:21 1722943401

B is the number of data vectors going on. You can erase the line labeled by B without much loss. (You just get the diagram for the feed-forward of a single vector.)

cgadski · 2024-08-06T08:22:24 1722932544

After learning about tensor diagrams a few months ago, they're my default notation for tensors. I liked your chart and also Jordan Taylor's diagram for multi-head attention.

Some notes for other readers seeing this for the first time:

My favorite property of that these diagrams is that they make it easy to re-interpret a multilinear expression as a multilinear function of any of its variables. For example, in standard matrix notation you'd write x^T A x to get a quadratic form with respect to the variable x. I think most people read this either left to right or right to left: take a matrix-vector product, and then take an inner product between vectors. Tensor notation is more like prolog: the diagram

  x - A - x

involves these two indices/variables (the lines) "bound" by three tensors/relations (A and two copies of x.) That framing makes it easier to think about the expression as a function of A: it's just a "Frobenius inner product" between -A- and the tensor product -x x-. The same thing happens with the inner product between a signal and a convolution of two other signals. In standard notation it might take a little thought to remember how to differentiate <x, y * z> with respect to y (<x, y * z> = <y, x * z'> where x' is a time-reversal), but thinking with a tensor diagram reminds you to focus on the relation x = y + z (a 3-dimensional tensor) constraining the indices x, y and z of your three signals. All of this becomes increasingly critical when you have more indices involved. For example, how can you write the flattened matrix vec(AX + XB) as a matrix-vector product of vec(X) so we can solve the equation AX + XB = C? (Example stolen from your book.)

I still have to get a hold of all the rules for dealing with non-linearities ("bubbles") though. I'll have to take a look at your tensor cookbook :) I'm also sad that I can't write tensor diagrams easily in my digital notes.

Tensor diagrams are algebraically the same thing as factor graphs in probability theory. (Tensors correspond to factors and indices correspond to variables.) The only difference is that factors in probability theory need to be non-negative. You can define a contraction over indices for tensors taking values in any semiring though. The max-plus semiring gives you maximum log-likelihood problems, and so on.

thomasahle · 2024-08-07T08:44:18 1723020258

I'm really glad you've found my "book" useful! Makes me want to continue writing it :)

cgadski · 2024-05-31T10:14:23 1717150463

Tensor network notation is really useful for differentiating with respect to tensors. For example, where F is a real function of a matrix variable, think of how you'd differentiate F(A X) with respect to X. Conceptually this is easy, but I used to have to slow down to write it in Einstein notation. Thinking in terms of tensor diagrams, I just see F', X and A strung together in a triangle. Differentiating with respect to X means removing it from the triangle. The dangling edges are the indices of the derivative, and what's left is a matrix product of A and F' along the index that doesn't involve X.

This blog post made me realize that tensor diagrams are the same as the factor graphs we talk about in random field theory. Indices of a tensor network become variables of a factor graph and tensors become factors. The contraction of a tensor network with positive tensors is the partition function of a corresponding field and so on.

cgadski · 2024-03-02T05:07:53 1709356073

Haha yeah, took some love. I have a scrappy little "framework" that I've been adjusting since I started making interactive posts last year. Writing my interactive widgets feels a bit like doing a game jam now: just copy a template and start compiling+reloading the page, seeing what I can get onto the screen. I've just been using the canvas2d API.

Besides figuring out a good way of dealing with reference frames, the only trick I'd pass on is to use CSS variables to change colors and sizes (line widths, arrow dimensions, etc.) interactively. It definitely helps to tighten the feedback loop on those decisions.

cgadski · 2024-03-02T04:46:08 1709354768

Thank you!

In the beginning, I used kognise's water.css [1], so most of the smart decisions (background/text color, margins, line spacing I think) probably come from there. Since then it's been some amount of little adjustments. The font is by Jean François Porchez, called Le Monde Livre Classic [2].

I draft in Obsidian [3] and build the site with a couple python scripts and KaTeX.

[1] https://watercss.kognise.dev/

[2] https://typofonderie.com/fr/fonts/le-monde-livre-classic

[3] https://obsidian.md/

cgadski · 2024-03-02T04:37:53 1709354273

Thanks so much!

And yes, that's quite true. When parameter gradients don't quite vanish, then the equation

<g_x, d x / d eps> = <g_y, d y / d eps>

becomes

<g_x, d x / d eps> = <g_y, d y / d eps> - <g_theta, d theta / d eps>

where g_theta is the gradient with respect to theta.

In defense of my hypothesis that interesting approximate conservation laws exist in practice, I'd argue that maybe parameter gradients at early stopping are small enough that the last term is pretty small compared to the first two.

On the other hand, stepping back, the condition that our network parameters are approximately stationary for a loss function feels pretty... shallow. My impression of deep learning is that an optimized model _cannot_ be understood as just "some solution to an optimization problem," but is more like a sample from a Boltzmann distribution which happens to concentrate a lot of its probability mass around _certain_ minimizers of an energy. So, if we can prove something that is true for neural networks simply because they're "near stationary points", we probably aren't saying anything very fundamental about deep learning.

riemannzeta · 2024-03-02T13:29:12 1709386152

Your work here is so beautiful, but perhaps one lesson is that growth and learning result where symmetries are broken. :-D

cgadski · 2024-03-01T17:25:00 1709313900

Hi, thanks!

In that sentence I was only talking about the translations and rotations of the plane as a group of invariances for the action of the two-body problem. This group is generated by one-parameter subgroups producing vertical translation, horizontal translation, and rotation about a particular point. Those are the "three degrees of freedom" I was counting.

You're right about the correspondence from symmetries to conservation laws in general.

cgadski · 2024-03-01T17:10:44 1709313044

That's also a neat result! I'd just like to highlight that the conservation laws proved in that paper are functions of the parameters that hold over the course of gradient descent, whereas my post is talking about functions of the activations that are conserved from one layer to the next within an optimized network.

By the way, maybe I'm being too much of a math snob, but I'd argue Kunin's result is only superficially similar to Noether's theorem. (In the paper they call it a "striking similarity"!) Geometrically, what they're saying is that, if a loss function is invariant under a non-zero vector field, then the trajectory of gradient descent will be tangent to the codimension-1 distribution of vectors perpendicular to the vector field. If that distribution is integrable (in the sense of the Frobenius theorem), then any of its integrals is conserved under gradient descent. That's a very different geometric picture from Noether's theorem. For example, Noether's theorem gives a direct mapping from invariances to conserved quantities, whereas they need a special integrability condition to hold. But yes, it is a nice result, certainly worth keeping in mind when thinking about your gradient flows. :)

By the way, you might be interested in [1], which also studies gradient descent from the point of view of mechanics and seems to really use Noether-like results.

[1] Tanaka, Hidenori, and Daniel Kunin. “Noether’s Learning Dynamics: Role of Symmetry Breaking in Neural Networks.” In Advances in Neural Information Processing Systems, 34:25646–60. Curran Associates, Inc., 2021. https://papers.nips.cc/paper/2021/hash/d76d8deea9c19cc9aaf22....

samatman · 2024-03-02T14:48:32 1709390912

I wouldn't call drawing a distinction between an isomorphism and an analogy to be maths snobbery. I would call it mathematics. :)

jonathanyc · 2024-03-02T05:01:39 1709355699

Not GP, but thanks for your detailed comment and the paper reference.

cgadski · 2024-03-01T15:50:04 1709308204

Exactly right! In fact, because that symmetry does not include an action on the parameters of the layer, your conserved quantity <gx, dx> should hold whether or not the network is stationary for a loss. This means that it'll be stationary on every single data point. (In an image classification model, these values are just telling you whether or not the loss would be improved if the input image were translated.)

empath-nirvana · 2024-03-01T20:10:33 1709323833

Everything in the paper is talking about global symmetries, is there also the possibility of gauge symmetries?