Hacker News new | past | comments | ask | show | jobs | submit login
The claimed effect size is about a zillion times higher than is plausible (columbia.edu)
275 points by luu on Feb 25, 2022 | hide | past | favorite | 61 comments



One interesting issue with plausible effect sizes is that they are very context dependent.

During a graduate seminar, someone asked for a show of hands for people who thought a relative risk of 1.10 (i.e. the exposed are 10% more likely to develop a disease than the unexposed) was a big deal.

Very few of the infectious disease epidemiologists raised their hands. We are often used to double-digit effect sizes for our risk factors (HPV and cervical cancer, a particular exposure in a foodborne outbreak, etc.)

The cancer epidemiology folks were like "Yeah, that's a pretty big deal, and worth digging into."

The environmental epidemiology folks were like "Are you kidding? There are going to be lawsuits. Congressional hearings. Maybe arrests."


It's true that effect sizes are very context dependent. Heck, you see it in technology as well. Sometimes, it's worth paying someone to spend a week to optimize an instruction or two out of code. Othertimes, the idea that you would spend even an extra 30 seconds looking for that optimization is heresy.


That follows from Amdahls law, doesn't it? Speeding up the most time consuming part by a little is worth a lot more than speeding up a tiny part of the execution time a lot.


It doesn't follow from Amdahl's law directly. Amdahl's law sets an upper limit on how much you can gain in total efficiency from effort. But I actually was speaking about how valuable the effort is. Is it the control loop on a self-landing rocket, a rendering engine for a video game or HFT algorithms? Then the efficiency is probably very valuable in its effects. Is it an app for a dating startup? No one will ever care if you could have shaved a millisecond off the match time.


I suspect such sites have a timer to ensure they take longer so people think it is a lot of work to find a match.


Like TurboTax’s fake progress bars making it look like it’s “trying hard” to get you your maximum refund: https://medium.com/the-billfold/the-progress-bars-in-turbota...


Yeah, before switching to taxact I did my taxes by hand, even itemized deductions, the computer doesn't save time and there isn't actually much money to save for most people anymore. I only do it by computer now because I once forgot to copy line 13b form 2345 to line 47d of some other form. The resulting mess was annoying to take care of, not hard, but annoying.


It can, but optimization can mean more than just speed.


This holds true for any optimization process. You have more money to save cutting costs by a little from your biggest expense than cutting costs by a lot from a small expense.


Your 2nd sentence is very difficult to parse; not nit-picking, just hoping to inspire you to reword for clarity.


I enjoyed the target paper in question that Gelman was posting about (as is common with Gelman, you have to wade through his comments to find the work of others buried in it). It was interesting and makes a strong case for use of more interesting controls in behavioral research.

After I thought about it for a bit though, I had some concerns. I read another paper on a totally different topic where they generated a sort of negative control, and although that paper was interesting too, a problem with it was that it was really easy to argue that the negative control wasn't actually a control, that it was set up in such a way as to inadvertently induce the thing they were trying to eliminate.

I kinda wondered about this paper too. They seem to assume you'll go along with the idea that how you imagine a video game character to act will induce the most extreme effect size you could imagine. But is how you report you'll act even remotely comparable to how you imagine a video game character will react? I'm not really sure. I could imagine some people being like "WTH? I have no idea" and blowing the task off, or seeing the video game character as incoherent and having any kind of behavior consistent with it. People's own behavior can be difficult to predict from what they say, so it seems like a step even further removed to put any weight into what they say a videogame character would do.

I guess it's an interesting idea that deserves more attention, but in my experience, psychological controls are notoriously difficult to implement, which is why you don't see it used more often.


The issue here is that infectious disease epidemiologists don't think anything of a large effect size because they're all about models that constantly mis-predict huge effect sizes. They should be surprised by a large effect, but because they never update their beliefs in response to model failure, they think (predicted but unreal) effect sizes are no big deal.


That's not really true. The example I used was a very large effect size, because it's causal in a way few things are (there's a reason you can use Koch's postulates in infectious diseases, but rarely in anything else).

And these are all things estimated from data, using standard, boring statistical methods like logistic regression.

Dynamical systems models, which is what you're talking about, is a whole different field, and also don't inherently predict huge effect sizes - the last non-COVID one I was working on, for example, was estimating fairly small effect sizes.


Aren't the unexposed at a risk of zero? Or was that your point?


The unexposed aren't at zero risk, but are often quite close.

For example, cervical cancer without HPV infection is extremely rare, but non-zero. Or someone at the party who didn't eat the potato salad could still come down sick by another pathway.

On the other hand, for things like environmental exposures, you're often looking at very small increases in risk over baseline. For example, your increased risk of asthma with a higher increase in particulate matter in the air.


There's no premise that you need to be exposed to develop some issue X. The comparison is between developing disease X while being exposed to something S, with developing it without being exposed.

(The example the parent gives is "HPV and cervical cancer" - it's not necessary to have HPV to get the latter, so those not exposed to HPV don't get a 0% chance of that disease).

E.g. say a study finds that if you're exposed to VirusX (or ShouldNotHaveBeenApprovedByFDADrugX, as this applies beyond viruses) you have 10% more chances to develop a heart arythmia than without being exposed to it (which, if I'm not mistaken, the latter should be the same as saying: "as compared to the control group").


A good example of this is one’s cancer risk, versus one’s cancer risk after a radiation dose.

One is non zero, and the other is non zero + X.


I suppose the above scenario is probably contrived, but it seems all of those replies fail to grasp the nature of the problem - I.e. none of them attempt to quantify the risk of doing nothing against the cost of doing something. Instead, just opinions/speculation, which is disappointing coming from a room full of doctors.


You can't listen to someone's anecdote intended to convey one thing and assume they included sufficient detail for you to determine whether it also conveys some other thing.


We expect large overestimates of effect size in studies as a function of sample size. In statistical genetics this goes by the memorable name of the Beavis effect (https://pubmed.ncbi.nlm.nih.gov/14704201/)

I once mapped a modulator of lifespan in a small family of mice precisely to a chunk of chromosome 2, but using only 24 BXD family members (https://pubmed.ncbi.nlm.nih.gov/23698443/). The placement of this locus is solid, but the effect size is inflated about 3 to 5X.

In recent work in mice we know the effect sizes for single longevity loci max-out at about 30 days per gene/allele, but in the small study above the effect size was about 100+ days. Eye opener for me.

Beavis effect evaporates with much larger sample size. In our current research on the genetics of longevity in mice we are up to 6000+ animals and have about a dozen significant hits that collectively explain roughly 125 days of life span difference. Should be in bioRxiv in next month or two Arends DA et al).


Related: http://datacolada.org/18

> we were interested in the question of how many participants a typical between-subjects psychology study needs to have an 80% chance to detect a true effect.

> Turns out you need 47 participants per cell to detect that people who like eggs eat egg salad more often than those who dislike eggs.


>We expect large overestimates of effect size in studies as a function of sample size. In statistical genetics this goes by the memorable name of the Beavis effect.... Beavis effect evaporates with much larger sample size.

This idea would seem to suggest special restraint when negatively interpreting a lot of A/B test replication. It's common to get a good result with a small sample, retest with a much larger audience and get a less-impressive result. If I understand, that outcome should not necessarily be discouraging for the success of the experiment. No?


Nothing works against p-values than a large sample, no matter what the power analysis says.-Folk Science Wisdom 101


Shrinking would work to some extent, by borrowing information from other variables. This is better done in a Bayesian context with multilevel models.

Also weakly informative priors, which have regularizing effects.

Actually these two things are a big focus area of Stan, which is the probabilistic programming language built by the group which writes that blog.


Also relevant: http://daniellakens.blogspot.com.au/2017/07/impossibly-hungr... which applies the same logic, but to the famous study showing that judges hand out harsher sentences before lunch than after lunch


Charts are great for asking questions, lousy for answering them.

> That is the type of effects that has a Cohen’s d of around 2: Tautologies.

I thought the counterargument to this finding was that judges are reasonably good at predicting which cases are going to be slamdunks versus which are going to drag on. What I don't recall is if the judge is involved in sorting the cases for the day, or just in looking at the next case and saying, "You know what, this is probably going to drag on for two hours at the end of which I'm going to say "guilty" or "denied" so I should just call lunch now.

Which would make the case immediately before lunch a simple yes/no, and the one after a complex 'hell no' a statistically significant percent of the time. Tautology. Or maybe judges don't like giving bad news on an empty stomach.

You are entitled to due process even if no bookie would take odds on the outcome of that process. At the end you're going to jail unless your lawyer really surprises me, but you'll have had your day in court and hopefully that will help you appreciate just how badly you done fucked up. But you are not entitled to plead your case while I'm hungry and need to use the bathroom. Court adjourned until 1 pm.


This has largely been debunked. Trial sentencing times are not random. Judges set their schedule and have preferences for when to hear different cases.


The link was about debunking it...


Guess the judge hates to be separated from his/her food. Should check the deviation as a function of time to lunch


>an R^2 of 78%

I generally assume that more intelligent people than me would have stumbled onto any connection that obvious. So, I always take it as a warning sign that I've messed up a parameter or put a proxy variable into my training data if I see things reaching that level of effect.


And this kind of assumption is part of why it took so long for the speed of light to be established!


Can you elaborate


I can't find a source now, but I recall reading something like:

The first measurement of the speed of light was too low. Subsequent experimenters assumed that when their measurement differed from the previous measurement by too much, then their experiment was wrong. So they fiddled with it until it wasn't too far off the previous value, and then published.

EDIT: it was about the charge on an electron, not the speed of light, my mistake: https://scipython.com/blog/measurements-of-the-electron-char...


I think this can be healthy. Being wrong is not necessarily a bad thing as long as you don't stop there. Science is an iterative process: questioning your methodology-- especially with unexpected results-- and then going back to tweak things or double-check, I think that's a healthy way to approach a problem. So it takes longer to hone in on an accurate result. That's okay when it means we get a result with more confidence. Speed is nice, but it's not the goal, and a certain amount of skepticism (call it confirmation bias if you want) can serve the real goal if it is used a ls a tool to spur further investigation.


That chart doesn't really back up the story. There's an oil drop experiment shortly afterwards but presumably that just made the same mistake. Then there's one paper after that. And then they got it right.


Agreed.


Not the OP, but I am guessing something along the lines of the following.

If you assume that it's your calculation errors that lead to observed discrepancies (in eg. speed of light), you won't be able to measure speed of light until you can get those errors below the observation threshold.

If you don't even assume that you can get those errors below the threshold, you won't try and you will live happily in belief that the speed of light is, well, infinite.


Linked handy statistical lexicon glossary is neat: https://statmodeling.stat.columbia.edu/2009/05/24/handy_stat...


Nice post. Nice linked journal article. Nice blog. Nice vocabulary page (the Kangaroo bit).

Bookmarked :)


I like the critique of Pascals wager. I mean if the devil exists maybe he will decide who goes to hell!


> If philosophy is outlawed, only outlaws will do philosophy.


If you stare too long into the abyss, you're under arrest.


You do linguistics? Right to jail. Mathematics? Believe it or not, jail.


Is it better that 10 guilty man go free or 1 innocent man go to jail. Trick question. Anyone raising their hand with an answer goes straight to jail.


Am I one of the guilty men or the innocent man?


The article in question has a preprint at https://psyarxiv.com/ezkh4/


I think the article is confusing three things. In Bayesian terminology, suppose we've found a posterior distribution for effect size. Then there are three things we might consider: (a) how tightly concentrated the posterior distribution is, (b) how much probability mass there is in a given part of the space, Prob(effect size >= thresh), (c) how we're choosing to incorporate other data.

To illustrate: suppose we've found a posterior distribution for effect size, and we've found that 95% of the posterior probability mass is >= 0. It might be a broad flat distribution (huge MAP effect size, massive uncertainty about effect size), or it might be a tight distribution (small MAP effect size, small uncertainty about effect size). There's nothing intrinsically wrong or suspicious about either of these two cases.

The headline of the linked article has the words "higher than is plausible". In other words, the premise of the article is that there is some other evidence which tells us what effect size is plausible. We could incorporate that evidence by adjusting our prior, which will obviously push the posterior distribution closer to what prior belief says it should be. Alternatively we could incorporate that evidence as a subsequent Bayesian conditioning step, i.e. treat the other evidence as further observations. The magic of Bayes's rule means that both of these approaches give the same answer. (The frequentist approach is pretty much like the latter -- it says "here's the result of my experiment, and I'll leave the reader to incorporate it with their own prior beliefs, in a meta-analysis." There's nothing intrinsically wrong with this.)

The article says "I find it frustrating when researchers don't think about their effect sizes." That's the wrong conclusion. The proper conclusion is (1) always report confidence intervals for your effect sizes, (2) don't pay any attention to a statement about effect sizes unless it comes with confidence / credible intervals -- and so the article's "painful email exchange" is missing the point, (3) when you report your results, make sure there's enough information for the reader to incorporate it with their own priors -- and the whole point of confidence intervals is to let us do this.


The problem (with science in general) is in trying to collapse multivariate and multicausal representations of reality into a single value (cf. Procrustes).


What if peer reviewers put their name behind the review and were held to account in some fashion when issues like the article talk about come up?


Why would anyone volunteer to review?


It could be the cost of also publishing - you are required to be part of some number of other's peer review.


Thanks! Not sure whether it would be a great idea, but it certainly seems like an incentive that could conceivably work.


About 2.5yr late for this I'm afraid, but very interesting reading.

General rule of thumb from my experience though is that if the author is resorting to p values to justify their findings they're either a fool or an idiot, or both... (I'm not criticizing p values but they're not the only metric of statistical worth, nor are they an accurate reflection of complex systems)


Yeah this type of thing is used to justify all kinds of draconian policies.

Be extremely wary of any one who just says 'It works' to justify an action and gives no context.

'It works' always has a ton of context. How much does it work? What situations does it work? What is the sampling error? Is the tradeoff for that improvement worth it?


I found the handy statistical lexicon useful.


Andrew Gelman's blog is fantastic and always happy to see it on HN. I don't have the link handy, but the HN crowd would love his critiques of Why We Sleep.



Now I want to know what the effect size for 'men are taller than women' is.


Yes, using statistics wrong is particularly awful.

But ... Obligatory xkcd. https://xkcd.com/2400/ If you have to use statistics at all, then your data likely isn't good enough to be repeatable. I'm sorry about this. I wish the world was easier to measure.


The thing is, just because you can prove there’s a correlation, doesn’t mean you’ve figured out the actual cause. Like the Flying Spaghetti Monster doctrine of piracy and global warming. And this one they discussed elsewhere in the thread:

http://daniellakens.blogspot.com/2017/07/impossibly-hungry-j...

http://sparrowism.soc.srcf.net/home/graph.png


I don't understand science so this sounds like some bullshit but I have absolutely no idea why




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: