Les Twins; Wikipedia |
Because there's no such thing as a perfect study, replication studies can be victims of any of the same issues, so interpreting lack of replication isn't necessarily straightforward, and certainly doesn't always mean that the original study was flawed.
The Open Science Collaboration was scrupulous in its efforts to replicate original studies as carefully and faithfully as possible. Still, the results weren't pretty. The authors write:
Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.Interestingly enough, the authors aren't quite sure what any of this means. First, as they point out, direct replication doesn't verify the theoretical interpretation of the result, which could have been flawed originally, and remain flawed. And it's impossible to know when a study is not replicated why that is, whether the original was flawed or the replication effort was flawed, or even both were flawed.
This effort has been the subject of much discussion, naturally enough. In a piece published last week in The Atlantic, Ed Yong quotes several psychologists, including the project's lead author, saying that this project has been a welcome learning experience for the field. There are plans afoot to change how things are done, including pre-registration of hypotheses so that the reported results can't be cherry-picked, or increasing the size of studies to increase their power, as has been done in the field of genetics.
We'll see whether this is just a predictable wagon-circling welcome, or really means something. One has every reason to be skeptical, and wonder if these fields really are sciences in the proper sense of the term. Indeed, it's quite interesting to see genetics held up as an exemplar of good and reliable study design. After billions of dollars being spent on studies large and small of the genetics of asthma, heart disease, type 2 diabetes, obesity, hypertension, stroke, and so on, we've got not only a lot of contradictory findings, but most of what has been found are genes with small effects. And epidemiology, many of the 'omics fields, evolutionary biology, and others haven't done any better.
Why? The vagueness of the social and behavioral sciences is only part of the problem (unlike, say, force, outcome variables such as stress, aggression, crime, or intelligence are hard to consistently define, and can vary according to the instrument with which they are measured). Biomedical outcomes can be vague and hard to define as well (autism, schizophrenia, high blood pressure). We don't understand enough about how genes interact with each other or with the environment to understand complex causality.
Statistics and science
The problem may be much deeper than any of this discussion of non-replicable results suggests. First, from an evolutionary point of view, we expect organisms to be different, not replicates. This is because mutational changes (and recombination) are always making each individual organism's genotype unique and, second, the need to adapt--Darwin's central claim or observation--means that organisms have to be different so that their 'struggle for life' can occur.
We have only a general theory for this, since life is an ad hoc adaptive/evolutionary phenomenon. Far more broadly than just the behavioral or social sciences, our investigative methods are based on 'internal' comparisons (e.g., cases vs controls, various levels of blood pressure and stroke, fitness relative to different trait values) to evaluate samples against each other, rather than as representations of an externally derived, a priori theory. When we rely on statistics and p-value significance tests and probabilities and so on, we are implicitly confessing that we don't in fact really know what's going on, and all we can get are a kind of shadow of the underlying process that is cast by the differences we detect, and we detect them with generic (not to mention subjective) rather than specific criteria. We've written about these things several times in the past here.
The issue is not just weakly defined terms and study designs. As Freeman Dyson (in "A meeting with Enrico Fermi") wrote in 2004:
In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, "How many arbitrary parameters did you use for your calculations?" I thought for a moment about our cut-off procedures and said, "Four." He said, "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." With that, the conversation was over. . . .Here, parameters refers to what are more properly called 'free parameters', that is, ones not fixed in advance, but that are estimated from data. By contrast, for example, in physics the speed of light and gravitational constant are known, fixed values a priori, not estimated from data (though data were used to establish those values). We just lack such understanding in many areas of science, not just behavioral sciences.
In a sense we are using Ptolemaic tinkering to fit a theory that doesn't really fit, in the absence of a better (e.g., Copernican or Newtonian) theoretical understanding. Social and behavioral sciences are far behind the at least considerably more rigorous genetic and evolutionary sciences when the latter are done at their best (which isn't always). Like the shadows of reality seen in Plato's cave, statistical inference reflects the shadows of the reality we want to understand, but for many cultural and practical reasons we don't recognize that, or don't want to acknowledge it. The weaknesses and frangibility of our predictive 'powers' could, if properly understood by the general public, be a threat to our business and our culture doesn't reward candor when it comes to that business. The pressures (including from the pubic media, with their own agenda and interests) necessarily lead to reducing complexity to simpler models and claims far beyond what has legitimately been understood.
The problem is not just with weakly measured variables or poorly defined terms of, for example, outcomes. Nor is the problem if, when, or that people use methods wrongly. The problem is that statistical inference is based on a sample and is often retrospective, or mainly empirical and based on only rather generic theory. No matter how well chosen and rigorously defined, in these various areas (unlike much of physics and chemistry) the estimates of parameters and the like fitted to data that is necessarily about the subjects past, such as their culture or upbringing or lifestyles, but in the absence of adequate formal theory, these findings cannot be used to predict the future with knowable accuracy. That is because the same conditions can't be repeated, say, decades from now, and we don't know what future conditions will be, and so on.
Rather alarmingly, we were recently discussing this with a colleague who works in very physics- and chemistry-rigorous material science. She immediately told us that they, too, face problems in data evaluation with the number of variables they have to deal with, even under what the rest of us would enviously say were very well-controlled conditions where the assumptions of statistics--basically amounting to replicability of some underlying mathematical process--should really apply well.
So the social and related sciences may be far weaker than other fields, and should acknowledge that. But the rest of us, in various purportedly 'harder' biological, biomedical, and epidemiological sciences, are often not so much better off. Statistical methods and theory work wonderfully well when their assumptions are closely met. But there is too much out-of-the-box analytic toolware, that lures us into thinking that quick and definitive answers are possible. Those methods never promise that because what statistics does is account for repeated phenomena following the same rules, and the rule of many sciences is that, in their essence they are not following such rules.
But the lure of easy-answer statistics, and the understandable lack of deeply better ideas, perpetuates the expensive and misleading games that we are playing in many areas of science.