Einsum Is All You Need – Einstein Summation in Deep Learning (2018)

yorwba · on Feb 3, 2019

The problem with einsum is that you have to explicitly specify the mapping between dimensions and indices every time, without any way to enforce consistency. It would be more ergonomic if each tensor had labeled dimensions. That would prevent the kind of silly mistake where you mix up the ordering of dimensions and only notice it when you later change the shape of the tensors so the different dimensions no longer match up.

sischoel · on Feb 3, 2019

I think, this is something that the community is now slowly realizing. I've must have read about that multiple times recently. Some packages that seem to provide tensors with labeled dimensions.(I haven't personally worked with them):

xarray (Python): https://github.com/pydata/xarray NamedArrays.jl (Julia): https://github.com/davidavdav/NamedArrays.jl

claudius · on Feb 3, 2019

ITensor [1] has somewhat pioneered this concept in tensor networks for condensed matter physics. If I understand correctly, Uni10 [2], a similar project, is even working on a graphical interface for such networks so that you can "draw" the network and have the computer figure out the best contraction order.

In my own code I’ve recently also implemented such named indices (only had "einstein summation"-like contraction specifiers before) and they make tensor contractions so much simpler to write, especially since you can simply overload operator * and have it figure out which legs need to go together.

As a side effect, it also enforces your algorithms to make sense because you can't simply add two tensors living on different (but equal-dimensional) vector spaces together anymore.

[1] https://itensor.org

[2] https://uni10.gitlab.io/

mcabbott · on Feb 3, 2019

I made a thing to audit that you are consistent with your indices [1] as an alternative to explicit named-tensor objects. And found, but have not used, another package in a similar spirit [2].

Although to be honest, simply writing A[n,μ,ν,c] etc. (using different letters for different spaces) makes it pretty easy to visually check that you are getting this right. This is one of the attractions of the notation, even on paper. Unfortunately np.einsum's string notation makes this harder to see, as the indices aren't adjacent to the variable name.

[1] https://github.com/mcabbott/TensorCast.jl#checking (Julia)

[2] https://github.com/ofnote/tsalib (Python)

claudius · on Feb 3, 2019

> I made a thing to audit that you are consistent with your indices [1] as an alternative to explicit named-tensor objects. And found, but have not used, another package in a similar spirit [2].

Cool, I will have to check this out!

> Although to be honest, simply writing A[n,μ,ν,c] etc. (using different letters for different spaces) makes it pretty easy to visually check that you are getting this right. This is one of the attractions of the notation, even on paper. Unfortunately np.einsum's string notation makes this harder to see, as the indices aren't adjacent to the variable name.

The problem is not so much the letter-space association but also handling the ordering of the spaces inside the tensor. For example in my code, to do a contraction over two indices, you could do prod<2>(a, b, "tlx,tr,p1,p2|tlx,p1,tl|tl,p2,tr") where the result would then have index order tl,p2,tr. The problem was then that changing the index order in one place (e.g. for performance reasons) meant having to re-check all other places where this is used. If you want to contract a tensor network like (d) in [1], this quickly gets complicated. With named indices, the above becomes a * b and if I change the index order in any place, it gets automatically changed there, too.

[1] https://journals.aps.org/prb/article/10.1103/PhysRevB.81.165...

mcabbott · on Feb 3, 2019

Right, changing the order would be a pain. Although if the reason for doing so is memory layout (for speed), then a lazy permutedims(A) would decouple index order from this. Perhaps when creating a tensor for the first time there ought to be a way to specify the layout? Haven't thought much but something like A[n^4, μ,ν, c^1] := .... would not be hard to do.

In your prod<2>(a, b, ... example, if p1 and p2 are in the same space, how would a*b know which one to contract? Or do they have different names from when a was created?

claudius · on Feb 3, 2019

> In your prod<2>(a, b, ... example, if p1 and p2 are in the same space, how would a*b know which one to contract? Or do they have different names from when a was created?

ITensor introduced this concept and I mostly just followed their lead – spaces have unique names and tensor legs have a name label and a "prime level". So for example an operator O: A → A would have one leg labelled a[uuid]-prime0 and another leg a[uuid]-prime1. Similar to how one might write O: A → A has elements O_{a a’} when writing it down on paper.

adamnemecek · on Feb 3, 2019

http://nlp.seas.harvard.edu/NamedTensor

curiousmiro · on Feb 3, 2019

See HarvardNLP's namedtensor which is under active development!

disclaimer: i'm a contributor

CuriousSkeptic · on Feb 3, 2019

Why is it that mathematicians are so comfortable with using ordering to encode semantics?

Evident from this thread it seems many developer recognize the need to use proper labels to simplify reasoning. Doesn’t mathematicians have similar desires?

improbable22 · on Feb 3, 2019

Well, many different notations are convenient in different contexts. The compiler they target is a bit more flexible than developers', so they are free to just describe what they are doing on page 1.

Human languages face the same trade-off: in some word order is very important, in others less so, but they need to compensate with some kind of case labels.

Xcelerate · on Feb 3, 2019

I remember when I was learning matrix calculus and realized at some point that it was much simpler to convert everything to index notation, perform all operations, then convert everything back to standard notation at the end. It became almost comically simple, because you're "just" working with labeled scalars at that point. To be fair, it's convenient to memorize some of the more commonly used expressions (like ∂tr(AB)/∂B) rather than rederive them from scratch.

krastanov · on Feb 3, 2019

Roger Penrose (the famous mathematician) has been saying the same thing for a couple of decades (going even further than that, check out Penrose diagrams).

jesuslop · on Feb 3, 2019

And Penrose diagrams (tensor networks) became a prominent example of String Diagram in the monoidal category of vector spaces over a field.

cdavid · on Feb 3, 2019

I am interested: do you have a reference that shows this through examples ? In my grad school years, I was always dissatisfied by matrix calculus notations

cdavid · on Feb 3, 2019

Interesting, I almost always got the exact opposite experience. How do you calculate non trivial matrix derivatives using index notation ?

mcabbott · on Feb 3, 2019

You insert kronecker deltas: ∂M_ab / ∂M_cd = δ_ac δ_bd , and a,b here remain contracted with whatever M was originally contracted with. Then you simplify, δ_ac Z_xya = Z_xyc and so on.

stared · on Feb 3, 2019

Right now I work on Tensor diagram notation for deep learning (for project "thinking in tensors, writing in PyTorch").

To read more about it, see: https://medium.com/@pmigdal/in-the-topic-of-diagrams-i-did-w... (obviously, I refer to the post).

And if you want to create some, here is a short demo: https://jsfiddle.net/stared/8huz5gy7/

In general, I want to expand that to tensor structure (e.g. n, channel, x, y) plus, translate it to the Einstein summation convention.

joppy · on Feb 3, 2019

I've never really understood the point of Einstein notation, as a piece of mathematical notation. Is writing something like A[i, j] * B[j, k] really that much faster than writing something like Sum[j](A[i, j] * B[j, k])? Especially when you have to check the left hand side of the equality sign just to know which indices to sum over, it seems like making things less clear for a minuscule saving on ink.

ssivark · on Feb 3, 2019

It's not just about the sum. The upper and lower indices are semantically crucial. Eg: vectors have upper indices and gradients have lower indices. It's nonsensical to add the two, so the core step in gradient descent doesn't make sense. To add the two, you need the metric (Hessian) to raise the index of the gradient. And just like that, voila, you've discovered natural gradients!

mmmmpancakes · on Feb 3, 2019

Gradients are vectors. You can certainly add them to other vectors.

I think what you mean to say is that you can not add vectors and covectors. The differential of a function is a 1-form. Pointwise, it is a covector. This data is the natural starting point from which to compute gradients.

Metrics can be used to convert covectors into vectors and forms into vector fields. The result is the thing you call the natural gradient. If you do this with the standard metric on R^n, then you get the vector field which you call the gradient.

mhh__ · on Feb 3, 2019

Absolutely yes.

Doing tensor magic without einstein notation will make you shoot yourself, and even you don't someone else will if publish it.

The only real problem I have with it personally is the abstraction of upper and lower indices, which I constantly forget the conventions as to which is which.

krastanov · on Feb 3, 2019

This is interesting... I love the notation, but mainly because how upper and lower indices make it easy to distinguish vectors and forms (not that it matters in ML where everything is "euclidean").

joppy · on Feb 3, 2019

Upper and lower indices are great, but are still unrelated to the convention of dropping the big sigma at the front.

antidesitter · on Feb 3, 2019

> not that it matters in ML where everything is "euclidean"

Non-euclidean spaces are actually quite common in ML, but many people don’t realize the spaces they’re working in are non-euclidean!

improbable22 · on Feb 3, 2019

Would be curious to hear what you have in mind here -- could you expand?

g82918 · on Feb 6, 2019

Co -> low, primes below.

lenticular · on Feb 3, 2019

It's more succinct, and the thing is that in most physics, repeated indices really almost always are sums. When you're doing fluid mechanics or anything like that, it's tensors all the time. The terseness is nice once you get used to it.

ajkjk · on Feb 3, 2019

Yes, it's incredibly faster. You do a lot of tensor algebra in a relativity class.

joppy · on Feb 3, 2019

To the point where writing one large sigma and listing indices under it at the start of each line would significantly slow you down? I found I had to do this mentally during my own relativity class, just to figure out what each expression meant.

frutiger · on Feb 3, 2019

> To the point where writing one large sigma and listing indices under it at the start of each line would significantly slow you down?

Yes (in my relativity & QFT classes). But the timesaving aspect was not that important for me. The notation enabled the intuition of “zipping” together these somewhat unwieldy mathematical objects, and that was the clincher.

pishpash · on Feb 3, 2019

Then you're doing it wrong. The point of notation (or learning any new language) is to operate in that language, not to translate back and forth.

joppy · on Feb 3, 2019

It's not actually a new notation though, it's literally just deleting a part of the old notation and then getting the reader to fill that part back in. While this is fine for calculations, I would rather that when equations are actually presented, that one extra piece of information is explicitly given.

I guess a similar thing that happens on the programming language side would be leaving out types when the compiler can infer them. This is great, and can make code more concise, but when reading an API I would really like the types written down explicitly, rather than having the play the part of the compiler.

ajkjk · on Feb 3, 2019

If types are omitted then there's actual missing information from the equation. If summations are omitted, there... isn't. They add nothing (at least in relativity where every sum is over the same 4 values).

Anyway, index notation is more of a shorthand for symbolic tensor algebra computations than actual sums. The problem is that when dealing with 2, 3, and 4 index tensors, with co- and contra-variant indexes, it's quite cumbersome to notate which component of a tensor contracts with which component of another tensor, and anything you could come up with would end up looking like index notation. Rarely do you actually mean to sum anything.

joppy · on Feb 5, 2019

I don't have a problem with using indices for anything, they are fine. Distinguishing upper and lower indices is even better. The exact complaint I have is the implicit sums. For example, a matrix A would be written with one upper and one lower index, which I'll just write as A[i, j]. In einstein (implicit sum) notation, the trace of the matrix is written Tr A = A[i, i]. Is this really better than saying Tr A = Sum[i] A[i, i]?

mcabbott · on Feb 3, 2019

If you have just one term, then leaving out the summation sign saves little. But in larger expressions, the rule applies sums per term, and then the savings in clutter can be quite large. For example, here v_i is not summed, and the term with z_b has two sums:

A_ia x_a + v_i + x_a y_a ( w_i + B_ib z_b )

You should not need to check the LHS, x_a y_a should always mean the dot product of these two. At least this is the convention among the heavy addicts; among lighter users you will sometimes find other setups.

joppy · on Feb 5, 2019

I still feel that the whole expression could be clarified quite easily be writing a \Sum_{a, b} at the front?

mcabbott · on Feb 5, 2019

But that would sum everything, not just the parts which need it. You'd actually have to write this:

\sum_a(A_ia x_a) + v_i + \sum_a(x_a y_a) ( w_i + \sum_b(B_ib z_b) )

My example is actually still easier without indices. (Although you lose some information, like the fact that x & y are vectors in the indexed by a,b.) Here it's clear that Ax and x⋅y each involve a sum, which never includes v:

Ax + v + (x⋅y)(w + Bz)

feanaro · on Feb 3, 2019

As an analogy, why do you write A⋅B instead of Sum[i](A[i] * B[i])?

joppy · on Feb 5, 2019

This is a bit of a red herring - why would people using Einstein notation write A[i] B[i] instead of writing A⋅B?

feanaro · on Feb 5, 2019

They wouldn't. My point was that A⋅B is also hiding a sum across all indices, just like Einstein notation. The sum never changes and is therefore visual/cognitive noise, once you've understood it the first time. The dot product is an abstraction that hides it and makes it implied.

thecleaner · on Feb 3, 2019

Awesome examples. However einsum fails to express convolutions. Also functions that are applied element wise such as the sigmoid and softmax. All three are crucial in deep learning.

TTPrograms · on Feb 3, 2019

How do you einsum convolution? Arguably the single most important linear operation in deep learning?

ssivark · on Feb 3, 2019

Convolution is a linear operator (with a certain special structure). So convolutions are represented just like any matrix-vector multiplication. Given the special structure of the convolution operator, it can be represented in a sparse manner, and what's more, the actual computation can also be implemented in an efficient manner, compared to a naive matrix-vector multiplication.

TL;DR: It can be represented easily using Einstein notation. Einstein notation just does not capture the sparsity properties we like; it represents the transformation properties quite nicely.

krastanov · on Feb 3, 2019

The smart-aleck answer would be "with a Fourier transformation". But yeah, I think you have a point.

mlthoughts2018 · on Feb 3, 2019

Looks like this is a re-post of the same link from this somewhat recent post: https://news.ycombinator.com/item?id=16986759