I mean, you're asking for a retrospective study, as opposed to a randomized controlled trial. It's useful and a great idea, but it's not like it's an equivalent way of getting equal quality data.
But is the goal to conduct a randomized controlled trial, or to measure the correction rate within the bounds of ethics? You go to war with the army you have.
Well the goal is to measure the correction rate within the bounds of ethics, but the question is how accurate the result would be without an RCT. Intuitively I would hope it's accurate, but how would you know without an experiment actually doing it? How do you know there aren't confounding factors greatly skewing the result?
If you'll grant that we're also able to replicate the study many times, we're left with errors that are not caught by Wikipedians or independent teams of experts. At that point I think we're looking at errors that have been written into history - the kind of error that originates in primary sources and can't be identified through fact checking. We could maybe estimate the size of that set by identifying widely-accepted misconceptions that were later overturned, but then we're back to my first suggestion and your objection to it.
But more importantly we probably won't catch that sort of error by introducing fabrications, either. Fabrications might replicate a class of error we're interested in, but if we just throw it onto Wikipedia, it's not going to be a longstanding misunderstanding which is immune to fact checking (at least without giving it a lot of time to develop into a citogenesis event, but that's exactly the kind of externality we're trying to avoid).
(Of course, "how many times do we need to replicate it?" remains unanswered. I think maybe after we have several replications and have data on false negatives by our teams of experts, we could come up with an estimate.)