Statistically speaking, isn't it sound to throw all the papers into the mulcher and see what comes out the other end? We do use the term "outliers" a lot in statistics, do we not? I understand that the quality might not be up to snuff for some, but won't the law of averages take care of that?
Have you ever used a mulcher to chop up some yard waste, only to accidentally put in some dog shit, and then the whole thing stinks to high heaven?
In all seriousness, with meta-analyses it's still "garbage in, garbage out". It only takes one or a few egregiously bad studies to throw off your results if that study has a large sample size but something fundamentally wrong with its methodology or implementation.
I've dealt with enough types of data that I feel super skeptical that you can just dump numbers from hundreds of studies into some data store programmatically, do statistical calculations, and get valid results. It's very difficult to believe that there aren't a ton of variations in how the data is gathered, filtered, and presented that need to be accounted for before any comparisons can be made. I'm not going to trust the law of averages to negate the effect of completely out of whack data when peoples' health is on the line.
This assumes all papers are of equal quality, peer-review and accuracy of results. Which we know they are not. Some studies should have more weight than others. Which has been mentioned in a previous comment; there is no 'right' answer, just a variety of ways to allocate different weights to papers based on various metrics.
You misinterpret the law of large numbers. What the law says is that if you have a large amount of samples, and assuming there's no pervasive bias in the samples, then any large enough sample (and often that's much smaller than you think - the classic example being election voters, with a group of only a few thousand representative voters being enough to predict the outcome of an election over a large country with millions of voters) will look identical to any other... that is, over a large enough sample, in the case of this article, the conclusion of many papers should converge to the same answer, with outliers being marked out as likely "bad" papers.
The only assumption you may reject here is that there's no systematic bias in the papers. Perhaps there is... or perhaps most papers are just very unreliable, in which case there should also be no convergence... but if you find convergence, there's a good chance the result is "real".
But the crucial bit here is the "large" in "large numbers". I expect that even for quite popular drugs the number of studies are maybe in the hundreds, which depending on statistics could well be quite a way from large enough. In particular if a significant fraction are crap studies.
You mean the Law of Large Numbers (LLN), not the Law of Averages, right? Both the Weak LLN and the Strong LLN presume all samples are independent and identically distributed. If we make a hierarchical model on the data of each paper, we can bind all the data into a single distribution, but assuming that each of these studies is independent is a _long_ shot. WLLN and SLLN _only_ apply to, roughly, sampling from the same process. Its scope is more applicable to things like sensor readings.
The Law of Large Numbers is an actual math theorem. The Law of Averages is a non-technical name for various informal reasoning strategies, some fallacious (like the gamblers fallacy), but mostly just types of estimation that are justified by more formal probability theory.
You get some numbers, they look good - fine, but at best it’s grounds for a proper study, at worst wildly misleading. You can easily fool yourself with statistics, and other people too.
For a good read about studies with solid statistics and bogus results, see [0].