This statistical test for causation (X->Y) is based on the idea that X and Y each contain noise - noise present in X flows causally to Y but noise present in Y won't flow back to X.
But, even if true, it isn't clear that this makes for a good test. For example, it's plausible that Y could have a damping effect and remove noise, which would reverse the results of the test.
"They say the additive noise model is up to 80 per cent accurate in correctly determining cause-and-effect."
This has been exaggerated by Medium from "accuracies between 65% and 80%" in the original article.
But a coin-flip model should be 50% accurate. 65% accuracy is unconvincing. The journal article's conclusion admits that their results are not statistically significant in any sense. As such, the results do not even meet the weakest possible scientific standard. They couldn't reproduce earlier published results in this field (typical of publication bias).
Their final paragraph concludes that there is surely a method of doing this, but they just haven't found that method here.
In my opinion, the results do not support that conclusion.
I think we are talking about independent samples of X and Y, not about time series. If X causes Y, then you model y = f(x, u), where U is a random variable independent of X (think: unexplained, e.g. noise). I don't think there can be any dampening effect in this setup. This model is generic: you can find a f(x, u) for any relationship between X and Y. But you may get a much simpler noise model (like additive gaussian noise) in one direction. It's no proof, but a strong hint (Occam's razor). There is also the family of algorithms like IC* and FCI that can recover a causal graph from statistical dependencies between random variables. As output you get a set of causal graphs that are still plausible given the observed dependencies, including constraints about the presence or absence of latent common causes.
> But, even if true, it isn't clear that this makes for a good test. For example, it's plausible that Y could have a damping effect and remove noise, which would reverse the results of the test.
It would still be a good test of bivariate causation even if it doesn't handle the case of feedback well.
> The journal article's conclusion admits that their results are not statistically significant in any sense.
...and in the same paragraph, they point out that a standard Bonferroni correction would be inappropriate and very conservative, because many of the methods are very similar & the results correlated, and that almost all the variant methodologies (except 1 or 2) do >50%. (If you really want to be a smart-ass about this point, please calculate the odds of almost all the variants performing >50%.)
The Bonferroni part of the paper is horrible; I almost picked up on that earlier.
By my reading, they don't apply Bonferroni, nor any other correction for multiple-testing. The results are already not significant before the correction, so there is no point in making a correction step. Read this part again: they are saying their results are insignificant and would be further away from signficance if they had bothered to apply any correction.
Criticising Bonferroni as being too conservative for this purpose might be fashionable, but it's irrelevant and highly misleading here (neither is it evidenced). They haven't applied Bonferroni and they haven't even identified an appropriate replacement - this is an irrelevant comment which seeks to distract.
> The results are already not significant before the correction, so there is no point in making a correction step. Read this part again: they are saying their results are insignificant and would be further away from signficance if they had bothered to apply any correction.
Give me a break. Let me go back and quote from the paper:
"Whether the performance of a method is significantly better than random guessing is not so clear-cut. For example, the p-value for ANM-pHSIC under the null hypothesis of random guessing is 0.068 on the CEP benchmark, only slightly larger than the popular α = 5% threshold. However, because we tested many different methods, one should correct for multiple testing. Using a conservative Bonferroni correction, the results on a single data set would be far from being significant. In other words, from this study we cannot conclude that the ANM-pHSIC method significantly outperforms random guessing on the CauseEffectPairs benchmark. However, the Bonferroni correction would be overly conservative for our purposes, as many of the methods that we compare are small variations of each other, and their results are clearly dependent. In addition, good performance across data sets increases the significance of the results. Although a proper quantitative evaluation of the significance of our results is a nontrivial exercise, we believe that when combining the results on the CEP benchmark with those on the simulated data, the hypothesis that the good performance of the methods ANM-pHSIC, ANM-HSIC, ANM-PSD and ANM-MML is only due to chance is implausible."
A p-value of 0.06 is not horrible, and they do not say it is the smallest p-value obtained. And since they test multiple methods, in different conditions, Bonferonni would divide 0.05 by combinatorially many, which is absurd because clearly these are not independent p-values (which is the assumption behind Bonferroni) and it is not irrelevant at all (and you saying it is suggests you don't understand how Bonferroni works).
In fact, you're not even right about "The results are already not significant before the correction" - ANM-pHSIC was almost statistically-significant at the normal threshold, and it didn't perform the best on the full unmodified dataset, that was ANM-HSIC (see figure 11, pg28, left table; or pg27, figure 9).
A more reasonable interpretation would be, rather than shoehorning in a Bonferroni where it has no business, to ask whether there's an overall difference of the various algorithms from baseline chance of 50-50, with something like an ANOVA.
How would that work? Well, we can estimate the fraction right by eye, use the known 88 pairs, and estimate from there. (There's probably a table of exact results somewhere but I didn't see it.) What is the p-value of all the algorithms, lumped together? Does it suggest outperformance? Why yes it does:
data: sum(cep$Correct) and sum(cep$N)
number of successes = 633, number of trials = 1144, p-value = 0.0003429
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.523970021 0.582398490
sample estimates:
probability of success
0.553321678
So even including all the crappy algorithms, they still outperform. An ANOVA would show the same thing. As the authors say: "the hypothesis that the good performance of the methods ANM-pHSIC, ANM-HSIC, ANM-PSD and ANM-MML is only due to chance is implausible."
They don't have that many cause-effect pairs in their dataset (~80, IIRC), so it could be considered insufficient data depending on how skeptical you are. But you don't really need to do a meta-analysis: the cause-effect pair collection is available, so if you dig up more, you can simply rerun the algorithms and see what the new results are with more pairs available.
I don't know how hard it would be to collect more examples: on the one hand, any randomized experiment should be a source of such examples, but on the other hand, you might want datasets large enough to definitively establish a causal effect before you go using it as a ground-truth/benchmark in statistical studies like this (and larger datasets tend to be harder to get the individual datapoints for).
This isn't as generally useful as the title suggests... due to these assumptions:
"that X and Y are dependent (i.e., PXY=PXPY), there is no confounding (common cause of X and Y), no selection bias (common effect of X and Y that is implicitly conditioned on), and no feedback between X and Y (a two-way causal relationship between X and Y)"
> This isn't as generally useful as the title suggests...
My understanding is that the more general case of causal discovery, where you have lots of variables, is already somewhat solved by Pearlian techniques (if you have A/B/C you can conditionalize on each and see what graph fits best - all independent, C confounding A & B, A -> B -> C, etc). But these techniques break down when it's just A and B: A->B and B->A look symmetrical - the problem is so simple that it's hard, because there's no C which can help break the symmetry and suggest something about the underlying causal graph.
OP is interesting because it points out that even without a C, A and B often aren't symmetrical in one particular way. So now pairs can be attacked as well as triplets and bigger; and maybe it points the way to new ways to tackle more complicated datasets.
I must say, I do not understand why they have to assume that there is no common cause. What you're looking for is an asymmetry in the the bleeding of noise from one variable into another, right? I.e., if you're hypothesis is that A causes B, then you want to confirm that noise in A also appears in B, but that noise in B does not also appear in A.
The fact that there is a common source of noise in both might make it more difficult to detect this asymmetry (because it weakens your characterization of A or B specific noise) but, on their assumption that noise is typically additive, wouldn't it still be theoretically possible to detect the asymmetry you're looking for? I.e., if A and B have common cause X, then noise from X will appear in A and B, but A noise will not appear in B, and B noise will not appear in A (assuming that A and B also have intrinsic noise sources in addition to the noise they inherit from X). Symmetry is maintained.
Maybe because you cannot guarantee that noise will bleed?
If you detect bleeding, then great, you have a result.
If you do not detect bleeding, then it could be that there is a confounding variable or it could just be that the noise doesn't bleed. Similarly if there is a confounding variable then you cannot guarantee that the noise will bleed to both sets of observations. You have know way of knowing. Make sense?
> Another dataset relates to the daily snowfall at Whistler in Canada and contains measurements of temperature and the total amount of snow. Obviously temperature is one of the causes of the total amount of snow rather than the other way round.
This isn't obvious to me at all.
It's true that rainfall causes trees (and that drought can kill them). But it's less obviously true that trees (in massive numbers) can affect the regional climate enough to cause rain. They do this by pumping water out of the ground and increasing humidity, by changing wind patterns, etc.
When trees cause rain, it's a lesser effect than when rain causes trees, but it's still there.
So when someone tells me that it's obvious that hundreds of thousands of tons of frozen, powdered water laying on the ground doesn't cause the temperature, I have to wonder if they're a retard.
"The key assumption is that the pattern of noise in the cause will be different to the pattern of noise in the effect. That’s because any noise in X can have an influence on Y but not vice versa."[...] "That’s a fascinating outcome. It means that statisticians have good reason to question the received wisdom that it is impossible to determine cause and effect from observational data alone." https://medium.com/the-physics-arxiv-blog/cause-and-effect-t...
Does not rule out shared cause. You observe X and Y and find a correlation. You still have to consider that X and Y are caused by Z. Noise in Z will have echoes in both X and Y.
I don't think it's hype or over stating things to suggest that this may be the most significant advance in practical statistics and methodologies for scientific investigation in years, perhaps decades.
Like many brilliant ideas, it seems so obvious in retrospect, another great "Why didn't I think of that?" moment.
Particularly since the confounding issue is really enormous in science. And sits at the core of the example the article gives in introduction... That would be an achievement in itself to build an experiment without confounder(s).
In econometrics this approach is called "identification thorough functional form" because it relies on assumptions about the exact distribution of some is the variables.
The main problem is that it requires making assumptions that are very hard or impossible to test. Nonetheless it's an interesting idea, but I doubt this method can replace randomized trials or instrumental variables except in a tiny fraction is cases
>Obviously temperature is one of the causes of the total amount of snow rather than the other way round.
Can someone explain how this is 'obvious'?
How can this be a claimed scientific way to tell cause and effect and then drop a sentence like that in the middle of the explanation?
Even if you accept that it's true that temperature determines snowfall, it seems there is likely some feedback loop in there. The fallen snow doesn't just disappear, wouldn't it affect later measured temperatures? Remove a bunch of (cold) snow from an area and the average temperature of the area should increase faster than if you had left the snow, no?
I think this claim is even more dubious: "Another dataset relates to the cost of rooms and apartments for students and consists of the size of the apartment and the monthly rent. Again it is obvious that the size causes the rent and not vice versa"
I think the mechanisms which set rental prices in a market, and the availability of properties of different sizes to students at different prices, are probably pretty complex functions of supply and demand factors that, themselves, are dependent on behaviors in other markets, like the mortgage market, the higher education loan market, the job market, and so on. You don't even need to talk to an economist; a realtor will tell you the most important factor driving housing costs: location, location and location. That's why two apartments of the same size in New York and Nebraska will have different rental rates.
You can think about causality in terms of intervention. If you make an intervention that changes only the price of an apartment (e.g. you subsidize it), will the size of the apartment change because of that? Obviously not. If you make an intervention that changes only the size of the apartment (and leaves all other mechanisms as they are), will the price of the apartment change? Most likely it will.
Nice puzzle. Keep in mind that OP is not interested in finding all causes (like the availability of concrete or the size of the rooms), only in checking the causal link between two observed variables. I think the solution is this: to check if there is a causal link from price to size, you need to make an intervention on the price, not on the buyer. (Unless you are sure your manipulation of the buyer can affect the size only through its effect on the price.) The (price, size) dataset in the paper is from a website for students who look for an apartment. Your manipulated student who is suddenly willing to pay more will probably rent a larger apartment, maybe even cause a new large apartment to be built and prices may shift a tiny bit. But you largely failed to manipulate of the price of a particular data point. The apartment that your student would have rented is still in the data set and its price is still largely influenced by the number of rooms, and the chance that it has gotten a room added is so tiny that it won't register in the data. (The causal link is there, I agree, but it's as weak as a butterfly.)
Why would you be sampling at a single arbitrary point like that?
I'm interested in how things diverge at all scales.
Imagine two identical snowing days, except on one of the days you remove the 6 feet of snow on the mountain. There wouldn't be any difference in the amount of snow that falls? Or the local and global temperatures?
I think it depends on the variance of the temperature across the entire space. If a place has been very cold (below freezing temperature of the water content) for a long period of time, then you might expect all of the water to have precipitated.
However, in many real situations on earth, the temperatures often fluctuate right around the freezing point of the water in the space.
This strikes me as a fairly useless test, because it only works in situations where you are sure there are only 2 variables, and you're trying to determine which one is dependent. Such a situation only happens in a carefully controlled experiment--and in those situations, you can easily determine causation by creating counterfactual tests.
What people really want to know is whether statistics alone can be used to exclude hidden shared causes from an uncontrolled data set. Even the article itself uses such an example: the impact of hormone replacement on heart disease.
This test does not further that goal. I remain convinced that it is impossible. In fact to my understanding, that is the origin of the scientific method: rather than accepting conclusions from the first data set, science constructs hypotheses and tests to exclude hidden causes.
Ideally, upon finding correlated variables, one would perform an experiment, changing one to see if it causes the other(s) to change. Looking at noise enables the same principle to be applied when the researcher lacks the ability to perform such an experiment.
It's like all statistical tests -- it works really well (provably well) when the assumptions it requires hold. However, it's usually impossible to know if those assumptions hold without holding the desired answer in the first place. That's why nonparametric tests are so popular (not saying they have much to do with the article at hand, but people are definitely willing to get less definitive results in exchange for making fewer assumptions).
Nice article. I think the fact that testing if "X-caused-Y", by exploiting the fact that this is not symmetrical, has also been used by the "pseudo-causality" Granger causality test: http://en.wikipedia.org/wiki/Granger_causality
Also, causality in reality can be quite complicated if there are feedback loops: X-causes-Y-causes-X.
Reminds me of http://www.pnas.org/content/104/16/6533.full - interesting, but probably only applicable to very simple systems. If you have various complex interconnections between components, simple A -> B reasoning is not helpful.
http://noosphere.princeton.edu/ - The Global Consciousness Project
"When human consciousness becomes coherent, the behavior of random systems may change. Random number generators (RNGs) based on quantum tunneling produce completely unpredictable sequences of zeroes and ones. But when a great event synchronizes the feelings of millions of people, our network of RNGs becomes subtly structured. We calculate one in a trillion odds that the effect is due to chance. The evidence suggests an emerging noosphere or the unifying field of consciousness described by sages in all cultures."
Looks pretty, shall we say, wacky, but they are supposedly finding correlations in quantum RNGs.
> Concluding, our results provide evidence that distinguishing cause from effect is indeed possible by exploiting certain statistical patterns in the observational data. However, the performance of current state-of-the-art bivariate causal methods still has to be improved further in order to enable practical applications.
We changed the URL from http://arxiv.org/pdf/1412.3773v1.pdf because, with some exceptions (such as computing), HN tends to prefer the highest-quality general-interest article on a topic with the paper linked in comments.
This comes up often enough that it is a good case for linking related URLs together, which is something we intend to work on in the new year.
I think with science that can be a dangerous preference, because of the tendency of the general-interest articles to overstate (and/or incorrectly state) the results. This one is better than most, especially compared to the junk that comes out of university press offices. But it still somewhat oversells the result compared to the more modest claims of the paper. The paper's introduction and conclusion are pretty accessible, imo, and a better representation of the findings than the Medium blogpost is.
That's why I said "tends to prefer". There's no rigid policy—it's a lot of case by case, and feedback from HN users makes the biggest difference in deciding them. You may be right on this one, but it's a marginal call, and the HN comments have clarified the situation quite a bit either way.
There is indeed an industry based on blowing scientific findings out of all proportion, and we do a lot to keep most of that at bay. But the solution isn't as simple as always linking to the papers, and there's no hope of keeping HN immune from this systemic dysfunction, only of correcting its worst excesses.
Thank you. Yes, that reflects my preference. I come to HN for deep interesting information that's also well-summarized and well-discussed. I may or may not always read through to the journal article, depending on my interest in the area, but I value the article link, the summary, and the discussion.
A high quality summary is important. For example, I would not grasp the significance (in the general case) of an HN post linking to an Arxiv paper titled "The entropy formula for the Ricci flow and its geometric applications". The summary will tell me that this is claimed to be (part of?) a proof of the Poincaré conjecture.
Interesting, a year ago this was one of the challenge say Kaggle. Given a set of sample pairs determine which one of them (if any) is causing the others.
But, even if true, it isn't clear that this makes for a good test. For example, it's plausible that Y could have a damping effect and remove noise, which would reverse the results of the test.
"They say the additive noise model is up to 80 per cent accurate in correctly determining cause-and-effect." This has been exaggerated by Medium from "accuracies between 65% and 80%" in the original article.
But a coin-flip model should be 50% accurate. 65% accuracy is unconvincing. The journal article's conclusion admits that their results are not statistically significant in any sense. As such, the results do not even meet the weakest possible scientific standard. They couldn't reproduce earlier published results in this field (typical of publication bias).
Their final paragraph concludes that there is surely a method of doing this, but they just haven't found that method here.
In my opinion, the results do not support that conclusion.