Entropy: An Introduction

ethanweinberger · on July 17, 2020

Hi HN, I'm the author of this piece (Ethan Weinberger). I wrote this originally as a set of notes for myself when brushing up on concepts in information theory the past couple of weeks. I found the presentations I was reading of the material to be a little dry for my taste, so I tried to incorporate more visuals and really emphasize the intuition behind the concepts. Glad to see others are finding it useful/interesting! :)

spinningslate · on July 17, 2020

Thanks, I enjoyed reading. As an electronic engineering student, I remember grappling with information theory in the abstract: it was a weather example very similar to yours that gave me the intuition I was missing.

An observation/suggestion. The intro is accessible to many people; that drops off a steep cliff when you hit the maths. Now, I'm not complaining about that: it's instructive and necessary to formalise things. Where I struggle is in reading the equations in my head when I don't know what words to use for the symbols. For example, that very first `X ~ p(x)`. I didn't know what to say for the tilde character, so couldn't verbalise the statement. I do know that $\in$ (the rounded 'E') means 'is a member of' so I could read the next statement. The problem gets even more confusing for a non-mathematician as the same symbol is used with different meaning in different branches of maths/science (e.g. $\Pi$).

I get that writing out every equation in English isn't feasible (or, at least, is asking a lot of the writer). But I wonder if there's middle way, e.g. through hyperlinking?

As I say: not a criticism and I don't have a good solution. Just an observation from a non-mathematician. Enjoyed the piece anyway.

jessriedel · on July 17, 2020

"X ~ p(x)" means "X is a random variable drawn from the probability distribution p(x)" or maybe "X is drawn from p(x)" for short.

Are you sure it's a matter of knowing what to say (in your head) vs knowing the definition of the notation in the first place? I am pretty familiar with this notation, but I rarely verbalize it mentally. I can tell because I read and understand it quickly without problem, but on the rare occasion when I have to read it aloud I realize I'm not sure how I should pronounce it.

spinningslate · on July 17, 2020

Thanks for the explanation.

Agree it's more "say in my head" than "speak out loud". But I still need to know what to say - internally or externally. Without knowing that ~ denotes "drawn from", all I can say is "X tilde p of x". That has no semantics; no intuition. Whereas knowing that $\in$ means "is a member of", I can read "x \in X" as "x is a member of X".

> but I rarely verbalize it mentally

Neither do I when I know something well. For example, I don't explicitly verbalise "is a member of" now, even internally. There's a shortcut hard-wired in that understands it without needing to pronounce it explicitly. In fact that short cut goes beyond the syntax: it goes straight to the intuition of "x represents any member of the set X". But I had to go through the process of saying it on the way to the shortcut.

jessriedel · on July 18, 2020

OK, but if you know the formal definition, and you're not reading it out loud, why not just make something up? I actually don't know whether "is drawn from" is the "correct" way to pronounce the tilde. I think maybe other people say "is distributed as".

windsignaling · on July 18, 2020

I don't think a piece on information theory should necessarily be "accessible to many people". It's a topic which is normally taught in grad school.

Something like X ~ p(x) would be seen all over the place in probability, stats, ML, and related courses such as info theory, detection and estimation, etc. Likely by the time someone is interested in info theory this notation would be permanently etched into their minds. So for this article it is very "audience appropriate".

> not a criticism and I don't have a good solution

Having a mental map of how different subjects fit together (without actually having to studying them in-depth) is a good start.

I've seen so many people crash and burn with machine learning because they were unaware that it depends on linear algebra, calculus, and probability.

With a mental map there is less "surprise" and it's more a matter of simply understanding that they didn't have the right dependencies.

canjobear · on July 18, 2020

I read X ~ p(x) as "X is distributed as p of x"

amelius · on July 18, 2020

What would X ~ p(y) mean?

bllguo · on July 18, 2020

X is distributed as p(y) where p is a probability distribution parameterized by y

sohamsankaran · on July 17, 2020

Ethan also writes about machine learning at https://honestyisbest.com/kernels-of-truth each week -- his most recent piece there (https://honestyisbest.com/kernels-of-truth/2020/Jul/14/facia...) has a neat explanation of how convolutional neural networks (CNNs) work.

canjobear · on July 18, 2020

You might want to give more conditions for the claim that the self-information gives the length of the shortest possible code.

In particular the condition isn't only that it's the shortest code, it's the shortest self-delimiting code. In your example with probabilities {1/2, 1/4, 1/8, 1/8}, someone could come in and say let's code it as {0,1,01,00}, which would appear to encode the latter two outcomes 2 bits rather than 3. The problem, of course, is that {0,1,01,00} is not a self-delimiting code: after you receive the bit 0, you don't know if you're done or if you should wait for another bit to form 01. But the code {0,10,110,111} is self-delimiting, because after you get a 0 or a 10, etc., you know you're done.

I've found that when I teach this material, if I don't mention the self-delimiting condition, then a clever student in the class always thinks of the {0,1,01,00}-type code. (This can be a good way to identify clever students in an intro information theory class!)

onurcel · on July 17, 2020

Thank you for the great article. I believe there is a typo in "we assign a value of 0 to p(x) log p(x) when x=0", it should be "when p(x) = 0".

ethanweinberger · on July 17, 2020

Thanks for the catch. Fixed!

wcookeverton · on July 17, 2020

Awesome paper Ethan!!!

Analog24 · on July 17, 2020

Might be worth clarifying in the title that this is about entropy in the context of information theory.

jbay808 · on July 17, 2020

In which context is it a different concept?

hexxiiiz · on July 17, 2020

In thermodynamics there are two other formulations of entropy: the Clausius one in terms of temperature and heat, and the Boltzmann one. The latter defines entropy as the log of the number of microstates a system could be in a particular macrostate.

The Shannon definition is equivalent to the Boltzmann def only in the case that the micro state consists of infinitely many identical subsystems. If there are only finitely many, for instance, the log of the quantity does not correspond to the same "-p log p".

The Clausius def can be derived from the Boltzmann one, but they are nevertheless also distinct formulations.

jbay808 · on July 17, 2020

https://en.wikipedia.org/wiki/Boltzmann%27s_entropy_formula#...

According to Wikipedia, if you start with the Gibbs entropy (which is the same as Shannon entropy), and then assume all microstate probabilities are equal (which Boltzmann does), you get the Boltzmann entropy formula. It also says Boltzmann himself used a p ln(p) formulation.

So aren't they the same, perhaps up to a constant factor?

hexxiiiz · on July 17, 2020

If you count the number of microstates for a given macrostate you get a hyper geometric number N!/(n_1!n_2!...) The log of this is the Boltzmann entropy. However, if you consider N to be very large or infinite, you can show using the Stirling approximation that this ends up being the Gibbs/Shannon entropy in that case. So, in general, no.

kgwgk · on July 17, 2020

In thermodynamics / statistical mechanics there is another formulation of entropy: Gibbs entropy is different from Boltzmann entropy (and equivalent to Shanon entropy in information theory).

Analog24 · on July 17, 2020

"Entropy" has been a concept in physics for a long time. Far longer than the concept of information entropy.

jbay808 · on July 17, 2020

Sure, but that doesn't mean that the concept of entropy in physics is a different concept than its incarnation in information theory. Just like the concept of energy existed before the development of thermodynamics, but thermodynamic energy is still energy.

Analog24 · on July 17, 2020

If you want to be pedantic about it sure. The fact remains that a discussion or blog post can be about very different things depending on the context. You're not going to learn anything about the relationship between energy and entropy, for example, if you're talking about information theory. Hence my original comment.

cmehdy · on July 17, 2020

I imagine the timing of this post is correlated with release of the documentary Bit Player about Claude Shannon. Haven't seen it yet but looking forward to it.

The article does a decent job at graphing and laying out some of the concepts of entropy for information theory, but I'm not sure who the target reader is, since prereqs are perhaps only slightly narrower than what one needs to read Shannon's paper[0] and the article is really illustrating only a fraction of the concept.

It can perhaps work as a primer for what shows up starting on pages 10-11 of the original document, in any case, provided you grasp the mathematical definition of entropy through thermodynamics, and the microstates-based definition through Boltzman, as well as "basic probabilities" (expected value, typical discrete distributions, terms like "i.i.d"), you should be good to go. But then you might already know all this..

And if you do, and you like what you read, then the full original thing by Shannon is a delight to explore to truly grasp what has been so foundational to a lot of things since 1948.

[0] http://people.math.harvard.edu/~ctm/home/text/others/shannon...

abetusk · on July 17, 2020

Not sure if it's appropriate but here's my own take at a very terse restatement of Shannon's original paper [1]:

    https://mechaelephant.com/dev/Shannon-Entropy/

I recommend everyone that's interested to read Shannon's original paper. It's one of the few examples of an original paper that's both clear and readable.

[1] https://homes.cs.washington.edu/~ewein/blog/2020/07/14/entro...

wsowens · on July 18, 2020

For the example of H(X) where X ~ Geom(p), shouldn't the second term be multiplied by (k-1) after breaking up the logarithm? That is, shouldn't

  pq^(k-1)log(q)

be

  (k-1)pq^(k-1)log(q)

Apologies if I'm off base here, I'm pretty rusty on infinite series.

Great piece, this is really making this concept more intuitive for me.

ethanweinberger · on July 18, 2020

You're 100% right. Fixed!

danielrk · on July 17, 2020

Love the post. Just FYI, your post is not mobile-friendly. When scrolling down on iPhone it’s impossible not to accidentally shift the viewport away from the left margin making the left side hard to read

yters · on July 18, 2020

Neat fact: entropy and expected algorithmic information are asymptotically equivalent. Two quite different approaches to information theory converge.

narayan_s · on July 18, 2020

Possibly unrelated, but I love the simple an clean layout of the website. Does anyone know how to create something similar to this?

ethanweinberger · on July 18, 2020

I (shamelessly) stole the layout from Gregory Gundersen's wonderful research blog (https://gregorygundersen.com/blog/). He has instructions on how to replicate it here https://gregorygundersen.com/blog/2020/06/21/blog-theme/

mercer · on July 18, 2020

If you do, please make it 'zoomable' (and as a result mobile-friendly). I tend to prefer reading larger size and fewer words per line, but zooming in just kept everything the same, except bigger.

Not that it wasn't a great article though :).