Hacker News new | past | comments | ask | show | jobs | submit login
Seven basic rules for causal inference (pedermisager.org)
218 points by RafelMri 9 months ago | hide | past | favorite | 67 comments



>Controlling for a collider leads to correlation

This is a big one that most people are not aware of. Quite often, in economics, medicine, and epidemiology, you'll see researchers adjust for everything in their regression model: income, physical activity, education, alcohol consumption, BMI, ... without realizing that they could easily be inducing collider bias.

A much better, but rare, approach is to sit down with some subject matter experts and draft up a DAG - directed acyclic graph - that makes your assumptions about the causal structure of the problem explicit. Then determine what needs to be adjusted for in order to get a causal estimate of the effect. When you're explicit about your causal assumptions, it makes it easier for other researchers to propose different causal structures, and see if your results still hold up under alternative causal structures.

The DAGitty tool [1] has some cool examples.

[1] https://www.dagitty.net/dags.html


Collider bias or "Berkson's Paradox" is a fun one, there lots of examples of it in everyday life: https://en.wikipedia.org/wiki/Berkson%27s_paradox


At the bottom, the author mentions that by "correlation" they don't mean "linear correlation", but all their diagrams show the presence or absence of a clear linear correlation, and code examples use linear functions of random variables.

They offhandedly say that "correlation" means "association" or "mutual information", so why not just do the whole post in terms of mutual information? I think the main issue with that is just that some of these points become tautologies -- e.g. the first point, "independent variables have zero mutual information" ends up being just one implication of the definition of mutual information.


This isnt a correction to your post, but a clarification for other readers: correlation implies dependence, but dependence does not imply correlation. Conversely, two variables share non-zero mutual information if and only if they are dependent.


By that measure, all of these Spurious Correlations indicate insignificant dependence, which isn't of utility: https://www.tylervigen.com/spurious-correlations

Isn't it possible to contrive an example where a test of pairwise dependence causes the statistician to error by excluding relevant variables from tests of more complex relations?

Trying to remember which of these factor both P(A|B) and P(B|A) into the test


I think you're using the word "insignificant" in a possibly misleading or confusing way.

I think in this context, the issue with the spurious correlations from that site is that they're all time series for overlapping periods. Of course, the people who collected these understood that time was an important causal factor in all these phenomena. In the graphical language of this post:

T --> X_i

T --> X_j

Since T is a common cause to both, we should expect to see a mutual information between X_i, X_j. In the paradigm here, we could try to control for T and see if a relationship persists (i.e. perhaps in the same month, collect observations for X_i, X_j in each of a large number of locales), and get a signal on whether some the shared dependence on time is the only link.


If a test of dependence shows no significant results, that's not conclusive because of complex, nonlinear, and quantum 'functions'.

How are effect lag and lead expressed in said notation for expressing causal charts?

Should we always assume that t is a monotonically-increasing series, or is it just how we typically sample observations? Can traditional causal inference describe time crystals?

What is the quantum logical statistical analog of mutual information?

Are there pathological cases where mutual information and quantum information will not discover a relationship?

Does Quantum Mutual Information account for Quantum Discord if it only uses von Neumann definition of entropy?


Could you give some examples of dependence without correlation?


> A sailor is sailing her boat across the lake on a windy day. As the wind blows, she counters by turning the rudder in such a way so as to exactly offset the force of the wind. Back and forth she moves the rudder, yet the boat follows a straight line across the lake. A kindhearted yet naive person with no knowledge of wind or boats might look at this woman and say, “Someone get this sailor a new rudder! Hers is broken!” He thinks this because he cannot see any relationship between the movement of the rudder and the direction of the boat.

https://mixtape.scunning.com/01-introduction#do-not-confuse-...


A clear graphical set of illustrations is the bottom row in this famous set: https://en.wikipedia.org/wiki/Correlation#/media/File:Correl...

They have clear dependence; if you imagine fixing ("conditioning") x at a particular value and looking at the distribution of y at that value, it's different from the overall distribution of y (and vice versa). But the familiar linear correlation coefficient wouldn't indicate anything about this relationship.


I mentioned it in another comment, but the most trivial example is:

X ~ Unif(-1,1)

Y = X^2

In this case X and Y have a correlation of 0.


You can check the example described here: https://stats.stackexchange.com/questions/644280/stable-viol...

Judea Pearl’s book also goes into the above in some detail, as to why faithfulness might be a reasonable assumption.


Imagine your data points look like a U. There's no (lineral) correlation between x and y, you are equally likely to have a high value of y when x is high or low. But low values of y are associated with medium values of x, and a high value of y means x will be very high or very low.


I’d be more interested in those tautologies nonetheless. Much better than literally untrue statements that I have to somehow decipher.


Can these seven be reduced to three basic rules?

- controlling for a node increases correlation among pairs where both are ancestors

- controlling for a node does not affect (the lack of) correlation among pairs where at least one is categorically unrelated (shares no ancestry with that node)

- controlling for a node decreases correlation among pairs where both are related but at least one is not an ancestor


Rule 2 (“causation creates correlation”) would be strongly disputed by a lot of people. It relies on the assumption of “faithfulness” which is not discussed until the bottom of the article.

This is a very innocent sounding assumption but it’s actually quite strong. In particular it may be violated when there are control systems or strategic agents as part of the system you want to study — which is often the case for causal inference. In such scenarios (eg the famous thermostat example) you could have strong causal links which are invisible in the data.


I'd argue you both could be right. Your comment could lead to a definition of intelligence. Organisms capable of causally influencing deterministic systems to their advantage can be marked as intelligent. The complexity of which would determine the degree of intelligence.

Your point is great in that it pinpoints also the notions of agency scopes. In all the causal DAGs it feels like there are implicit regions: ones where we can influence or not, intervene or not, observe or not, where one is responsible for or not.

An intelligent agent is one capable of modelling a system, influence it, and bias it. Such that it can reach and exploit an existing corner case of it. I talk about a corner case because of entropy and murphy's law. For a given energy, there are way many more unadvantageous states than advantageous one. And the intelligence of a system is the complexity required to wield the entropy reduction of an energy source.


Two problem with this. 1. There are many other ways that correlation doesn't imply causation. 2. The phenomenon the gp describes doesn't require broad intelligence but just reactiveness - a thermostat or a guided missile could have this.


Indeed, causally linked variables need not be correlated in observed data; bias in the opposite direction of the causal effect may approximately equal or exceed it in magnitude and "mask" the correlation. Chapter 1 of this popular causal inference book demonstrates this with a few examples: https://mixtape.scunning.com/01-introduction#do-not-confuse-...


My fav way to intuit this is this example

https://stats.stackexchange.com/questions/85363/simple-examp...

Blew my mind the first time I saw it.

Not the same definitions one to one (author specifically talks about correlation vs linear correlation) but same idea.


This was my thought as well.

I don't like showing the scatterplots in these examples, as "correlation" I think is more associated with the correlation coefficient than the more generic independence that the author means in this scenario. E.g. a U shape in the scatterplot may have a zero correlation coefficient but is not conditionally independent.


From the article:

> NB: Correlated does not mean linearly correlated

> For simplicity, I have used linear correlations in all the example R code. In real life, however, the pattern of correlation/association/mutual information we should expect depends entirely on the functional form of the causal relationships involved.


The standard mathematical definition of correlation means linear correlation. If you are talking about non-independence, it would be better to use that language. This early mistake made me think the author is not really an expert.


That seems a bit harsh. People can independently become experts without being familiar with the terminology used by existing experts. Further, if intended for a non-expert audience, it may even be deliberate to loosen definitions of terms used by experts, and being precise by leaving a note about that instead, which apparently is exactly what this author did.


It's much better to use vocabulary consistently with what everyone else does in the field. Then you don't need to add footnotes correcting yourself. And if you are not familiar with what everyone else means by correlation, you're very unlikely to be an expert. This is not like that Indian mathematician who reinvented huge chunks of mathematics.


> It's much better to use vocabulary consistently with what everyone else does in the field.

Fine, but...

> And if you are not familiar with what everyone else means by correlation, you're very unlikely to be an expert.

Perhaps, but this is not relevant. If there's a problem with this work, then that problem can be criticized directly. There is no need, and it is not useful, to infer "expertise" by indirect means.


What is an appropriate measure of (in)dependence though, if not Pearson correlation? Such that you feed a scatter plot into the formula for this measure, and if the measure returns 0 dependence, the variables are independent.


it's a tough problem.

there are various schemes for estimating mutual information from samples. if you do that and mutual information is very close to zero, then I guess you can claim the two rvs are independent. But these estimators are pretty noisy and also often computationally frustrating (the ones I'm familiar with require doing a bunch of nearest-neighbor search between all the points).

I agree with the OP that it's better to say "non-independence" and avoid confusion, at the same time, I disagree that linear correlation is actually the standard definition. In many fields, especially those where nobody ever expects linear relationships, it is not and everybody uses "correlated" to mean "not independent".


Yeah. It would be simpler to talk about causal graphs if the nodes represented only events instead of arbitrary variables, because independence between events is much simpler to determine: X and Y are independent iff P(X) * P(Y) = P(X and Y). For events there also exists a measure of dependence: The so-called odds ratio. It is not influenced by the marginal probabilities, unlike Pearson correlation (called "phi coefficient" for events) or pointwise mutual information. Of course in practice events are usually not a possible simplification.


This is a separate issue and also a good point. Correlation sometimes means “Pearson’s correlation coefficient” and sometimes means “anything but completely independent” and it’s often unclear. In this context I mean the latter.


> E.g. a U shape in the scatterplot may have a zero correlation coefficient but is not conditionally independent.

Ok this is correct, but has nothing to do with causality. Whether or not two variables are correlated and whether or not they are independent, and when one does or doesn't imply the other, is a conversation that can be had without resorting to the concept of causality at all. And in fact that's how the subject is taught at an introductory level basically 100% of the times.


> Ok this is correct, but has nothing to do with causality.

It does. Dependence and independence have a lot to do with causation, as the article explains.

> Whether or not two variables are correlated and whether or not they are independent, and when one does or doesn't imply the other, is a conversation that can be had without resorting to the concept of causality at all.

Yes, but this is irrelevant. It's like saying "whether or not someone is married is a conversation that can be had without resorting to the concept of a bachelor at all".

You can talk about (in)dependence without talking about causation, but you can't talk in detail about causation without talking about (in)dependence.


For anyone else who went down a rabbit hole - this paper describes the problem control systems present for these methodologies: https://www.sciencedirect.com/science/article/abs/pii/B97801...

(paywalled link, but it's available on a well-known useful website)


> Independent variables are not correlated

But it's important to remember that dependent variables can also be not correlated. That is no correlation does not imply independence.

Consider this trivial case:

X ~ Uniform(-1,1)

Y = X^2

Cor(X,Y) = 0

Despite the fact that Y's value is absolutely determined by the value of X.


The author is using "correlation" in a somewhat non-standard way. He isn't referring to linear correlation as you are, but any sort of nonzero mutual information between the two variables. So in his usage those two variables are "correlated" in your example.


This is also why it's important to look at your plots. Because simply looking at your scatter plot makes it really obvious what methods you can't use, even if it doesn't really tell you anything about what you should use.


Humble reminder of how easy R is to use. Download and install R for your operating system: https://cran.r-project.org/bin/

Start it in the terminal by typing:

    R
Copy/paste the code from the article to see it run!


R is my least favorite language to use, thanks to the uni courses that force it

https://github.com/ReeceGoding/Frustration-One-Year-With-R


Can't use R without RStudio. It so much better than the terminal.


Agree RStudio makes R a dream, but isn't necessary for someone to run the code in the article =)


really??? I've developed in R for over a decade using two terminal windows. One runs vim, the other runs R. Keyboard shortcuts to send R code from vim to R.

first google hit if you want to try this yourself: https://www.freecodecamp.org/news/turning-vim-into-an-r-ide-...

Sooooooo much better than "notebooks". Hating on "notebooks" today.


>Humble reminder of how easy R is to use.

I had to learn R for a statistics course. This was a long time ago. But coming from a programming background I never found any other mainstream language as hard to grok as R.

Has this become better? Is it just me that doesn't get it?


Are the assumptions "No spurious correlation", "Consistency", and "Exchangeability" ever actually true? If a dataset's big enough you should generally be able to find at least one weird correlation, and the others are limits of doing statistics in the real world.


Some situations guarantee certain assumptions: Randomization, for example, guarantees exchangeability.


I'm keeping this link, taking a backup and handing it out whenever i can. It is succinct and effective.

These are concepts i find myself constantly having to explain and teach and they are critical to problem solving.


I highly suggest this paper here for a more complete view of causality that nests do-calculus (at least in economics):

Heckman, JJ and Pinto, R. (2024): “Econometric causality: The central role of thought experiments”, Journal of Econometrics, v.243, n.1-2.


Why should you look this paper up? It argues that certain approaches from statistics and computer science are limited, and (essentially) that economists have a better approach. YMMV, but the criticisms are specific (whether or not you buy the "fix").

From the paper:

> Each of the recent approaches holds value for limited classes of problems. [...] The danger lies in the sole reliance on these tools, which eliminates serious consideration of important policy and interpretation questions. We highlight the flexibility and adaptability of the econometric approach to causality, contrasting it with the limitations of other causal frameworks.


> Rule 8: Controlling for a causal descendant (partially) controls for the ancestor

perhaps this is a quaint or wildly off base question, but an honest one, please forgive any ignorance:

Isn't this essentiallydefining the partial derivative? Should one arrive at the calculus definition of a partial derivative by following this?


You probably could if you interpret that sentence very creatively. But I think it's useful to remember that this is mathematics, and words like "control", "descendant" and "ancestor" have specific technical meanings (all defined in the article, I believe).

The technical meaning of that sentence has to do with probability theory (probability distributions, correlation, conditionals), and not so much calculus (differentiable functions, limits, continuity).


This is missing my favourite rule.

0. The directions of all arrows not part of a collider are statistically meaningless.


What's not part of a collider? Good luck with your memory in that case.


I'm guessing they mean that given a bunch of correlated nodes but no collider (in which case the casual graph must be a tree of some sort) you not only don't know if the tree be bushy or linear, you don't even know which node may be the root.

(bushy trees, of which there are very many compared with linear ones, would be an instance of Gwern's model* of confounds being [much] more common than causality?)

* https://news.ycombinator.com/item?id=41291636


Right, but your memory functions as a collider, if there are literally no colliders anywhere you by definition won't be able to remember anything.


This is brilliant. The whole causal inference thing is something I only came across after university, either I missed it or it is a hole in the curriculum, because it seems incredibly fundamental to our understanding of the world.

The thing that made be read into it was a quite interesting sentence from lesswrong, saying that actually the common idea that correlation does not imply causation is wrong. Now it's not wrong in the face-value sense, it's wrong in the sense that actually you can use correlations to learn something about causation, and there turns out to be a whole field of study here.


When did you go to university? The terminology here came from Pearl 2000, and it probably took years and years after that to diffuse out.


I thought Pearl was writing from 1984 onwards?

I was at university around the millennium.


Causality (2000) made the topic accessible (to students and lecturers) as a single book.


Rigorous causal inference methods are just now starting to diffuse into the undergraduate curriculum, after gradually becoming part of the mainstream in a lot of social science fields. But this is just happening.

Judea Pearl is in some respects a little grandiose, but I think he is right to be express shock that it took almost a century to develop to this point, given how long the basic tools of probability and statistics have been fairly mature.


"correlation does not imply causation is wrong"

That's a specific instance of a more general problem in the "logical fallacies", which is that most of them are written to be true in an absolutist, Aristotelian frame. It is true that if two things are correlated you can not therefore infer a rigidly 100% chance that there is a causative relationship there. And that's how Aristotelian logic works; everything is either True or False and if there is anything else it is as most "Indeterminate" and there is absolutely, positively, no in betweens or probabilities or anything else.

However, consider the canonical "logical fallacy":

    1. A -> B.
    2. B
    3. Therefore, A.
It is absolutely a logical fallacy in the Aristotelian sense. Just because B is there does not mean A is. However, probabilistically, if you are uncertain about A, the presence of B can be used to update your expected probability of A. After all, this is exactly what Bayes' rule is for!

Many of the "fallacies" can be rewritten to be useful probabilistically, and aren't quite as fallacious as their many internet devotees fancy.

It is certainly reasonable to be "suspicious" about correlations. There often is a "there" there. Of course, whether you can ever figure out what the "there" is is quite a different question; https://gwern.net/everything really gets in your way. (I also recommend https://gwern.net/causality ).

The upshot is basically 1. the glib dismissal that correlation != causation is, well, too glib and throws away too many things but 2. it is still true you still generally can't assume it either. The reality of the situation is exceedingly complicated.


I don't disagree with the substance of your comment, but want to clarify something.

Lesswrong promulgated a seriously misleading view of Aristole as some fussy logician who never observed reality and was unaware of probability, chance, the unknown, and so on. It is entirely false. Aristotle repeats, again and again and again, that we can only seek the degree of certainty that is appropriate for a given subject matter. In the Ethics, perhaps his most-read work, he says this, or something like it, at least five times.

I mention this because your association of the words "absolutist" and "Aristotelian" suggests your comment may have been influenced by this.

ISTM that there are two entirely different discussions taking place here, not opposed to each other. "Aristotelian" logic tends to be more concerned with ontology -- measles causes spots, therefore if he has measles, then he will have spots. Whereas the question of probability is entirely epistemological -- we know he has spots, which may indicate he has measles, but given everything else we know about his history and situation this seems unlikely; let's investigate further. Both describe reality, and both are useful.

So the fallacies are entirely fallacious: I don't think your point gainsays this. But I agree that, to us, B may suggest A, and it is then that the question of probability comes into play.

Aquinas, who was obviously greatly influenced by Aristotle, makes a similar point somewhere IIRC (I think in SCG when he's explaining why the ontological argument for God's existence fails), so it's not as if this is a new discovery.


I consider Aristotelian logic to be a category. It is the Newtonian physics of the logic world; if your fancier logic doesn't have some sort of correspondence principle to Aristotelian logic, something has probably gone wrong. (Or you're so far out in whacky logic land you've left correspondence to the real universe behind you. More power to you, as long as you are aware you've done that.) And like Newton, being the first to figure it out labels Aristotle as a certifiable genius.

See also Euclid; the fact that his geometry turns out not to be The Geometry does not diminish what it means to have blazed that trail. And it took centuries for anyone to find an alternative; that's quite an accomplishment.

If I have a backhanded criticism hiding in my comment, it actually isn't pointed at Aristotle, but at the school system that may teach some super basic logic at some point and accidentally teaches people that's all logic is, in much the same way stats class accidentally teaches people that everything is uniformly randomly distributed (because it makes the homework problems easier, which is legitimately true, but does reduce the education's value in the real world), leaving people fairly vulnerable to the lists of fallacies they may find on the internet and unequipped to realize that they only apply in certain ways, in certain cases. I don't know that I've ever seen such a list where they point out that they have some validity in a probabilistic sense. There's also the fallacies that are just plain fallacious even so, but I don't generally see them segmented off or anything.


> ...it took centuries for anyone to find an alternative...

Pedantry: s/centuries/millennia/ (roughly 21 of the former, 2 of the latter?)

EDIT: does anyone remember the quote about problems patiently waiting for our understanding to improve?


I liked the way Pearl phrased it originally. A calculus of anti-correlations implies causation. That makes the nature of the analysis clear and doesn't set of the classic minds alarm bells.


Unfortunately this calculus is exceedingly complicated and I haven't even seen a definition of "a causes b" in terms of this calculus. One problem is that Pearl and others make use of the notion of "d-separation". This allows for elegant proofs but is hard to understand. I once found a paper which replaced d-separation with equivalent but more intuitive assumptions about common causes, but I since forgot the source.

By the way, there is also an alternative to causal graphs, namely "finite factored sets" by Scott Garrabrant. Probably more alternatives exist. Though I don't know more about (dis)advantages.


Great post. It's nice that these rules can be trivially demonstrated by simulation. The simulation (and visuals) helps validate the concepts.


Is there a simple R example for Rule 4?


It is sort of tautological:

    # variable A has three causes: C1,C2,C3
    C1 <- rnorm(100)
    C2 <- rnorm(100)
    C3 <- rnorm(100)

    A <- ifelse(C1 + C2 + C3 > 1, 1, 0)

    cor(A, C1)
    cor(A, C2)
    cor(A, C3)

    # If we set the values of A ourselves...
    A <- sample(c(1,0), 100, replace=TRUE)

    # then A no longer has correlation with its natural causes

    cor(A, C1)
    cor(A, C2)
    cor(A, C3)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: