Yesterday, I discussed current issues related to statistical studies of things like genetic or other disease risk factors. Recent discussion has criticized the misuse of statistical methods, including a
statement on
p-values by the American Statistical Association. As many have said, the over-reliance on
p-values can give a misleading sense that significance means importance of a tested risk factor. Many touted claims are not replicated in subsequent studies, and analysis has shown this may preferentially apply to the 'major' journals. Critics have suggested that
p-values not be reported at all, or only if other information like confidence intervals (CIs) and risk factor effect sizes be included (I would say
prominently included). Strict adherence will likely undermine what even expensive major studies can claim to have found, and it will become clear that many purported genetic, dietary, etc., risk factors are trivial, unimportant, or largely uninformative.
However, today I want to go farther, and question whether even making these correctives doesn't go far enough, and would perhaps serve as a convenient smokescreen for far more serious implications of the same issue. There is reason to believe the problem with statistical studies is more fundamental and broad than has been acknowledged.
Is reporting p-values really the problem?Yesterday I said that statistical inference is only as good as the correspondence between the mathematical assumptions of the methods and what is being tested in the real world. I think the issues at stake rest on a deep disparity between them. Worse, we don't and often
cannot know which assumptions are violated, or how seriously. We can make guesses and do all auxiliary tests and the like, but as decades of experience in the social, behavioral, biomedical, epidemiological, and even evolutionary and ecological worlds show us, we typically have no serious way to check these things.
The problem is not just that significance is not the same as importance. A somewhat different problem with standard
p-value cutoff criteria is that many of the studies in question involve many test variables, such as complex epidemiological investigations based on long questionnaires, or genomewide association studies (GWAS) of disease. Normally,
p=0.05 means that by chance one test in 20 will seem to be significant, even if there's nothing causal going on in the data (e.g., if no genetic variant actually contributes to the trait). If you do hundreds or even many thousands of 0.05 tests (e.g., of sequence variants across the genome), even if some of the variables really are causative, you'll get so many false positive results that follow-up will be impossible. A standard way to avoid that is to correct for multiple testing by using only
p-values that would be achieved by chance only once in 20 times of doing a whole multivariable (e.g., whole genome) scan. That is a good, conservative approach, but means that to avoid a litter of weak, false positives, you only claim those 'hits' that pass that standard.
You know you're only accounting for a fraction of the truly causal elements you're searching for, but they're the litter of weakly associated variables that you're willing to ignore to identify the mostly likely true ones. This is good conservative science, but if your problem is to understand the beach, you are forced to ignore all the sand, though you know it's there. The beach cannot really be understood by noting its few detectable big stones.
But even this sensible play-it-conservative strategy has deeper problems.
How 'accurate' are even these preferred estimates? The metrics like CIs and effect sizes that critics are properly insisting be (clearly) presented along with or instead of
p-values face exactly the same issues as the
p-value: the degree to which what is modeled fits the underlying mathematical assumptions on which test statistics rest.
To illustrate this point, the Pythagorean Theorem in plane geometry applies
exactly and
universally to right triangles. But in the real world there are
no right triangles! There are approximations to right triangles, and the value of the Theorem is that the more carefully we construct our triangle the closer the square of the hypotenuse is to the sum of the squares of the other sides. If your result doesn't fit, then you know something is wrong and you have ideas of what to check (e.g., you might be on a curved surface).
In our statistical study case, knowing an estimated effect size and how unusual it is seems to be meaningful, but we should ask how accurate these estimates are. But that question often has almost no testable meaning: accurate relative to
what? If we were testing a truth derived from a rigorous causal theory, we could ask by how many decimal places our answers differ from that truth. We could replicate samples and increase accuracy, because the signal to noise ratio would systematically improve. Were that to fail, we would know something was amiss, in our theory or our instrumentation, and have ideas how to find out what that was. But we are far, indeed unknowably far, from that situation. That is because we don't
have such an
externally derived theory, no analog to the Pythagorean Theorem, in important areas where statistical study techniques are being used.
In the absence of adequate theory, we have to concoct a kind of data that rests almost entirely on
internal comparison to reveal whether 'something' of interest (often that we don't or cannot specify) is going on. We compare data such as cases vs controls, which forces us to make statistical assumptions such as that, other than (say) exposure to coffee, our sample of diseased vs normal subjects differ
only in their coffee consumption, or that the distribution of other variation in unmeasured variables is random with regard to coffee consumption among our cases and controls subjects. This is one reason, for example, that even statistically significant correlation does not imply causation or importance. The underlying, often unstated assumptions are often impossible to evaluate. The same problem relates to replicability: for example, in genetics, you can't assume that some other population is the same as the population you first studied. Failure to replicate in this situation does
not undermine a first positive study. For example, a result of a genetic study in Finland cannot be replicated properly elsewhere because there's only one Finland! Even another study sample within Finland won't necessarily replicate the original sample. In my opinion, the need for internally based comparison is the core problem, and a major reason why theory-poor fields often do so poorly.
The problem is subtleWhen we compare cases and controls and insist on a study-wide 5% significance level to avoid a slew of false-positive associations, we know we're being conservative as described above, but at least those variables that
do pass the adjusted test criterion are really causal with their effect strengths accurately estimated. Right? No!
When you do gobs of tests, some very weak causal factor may by good luck pass your test. But of those many contributing causal factors, the estimated effect size of the lucky one that passes the conservative test is something of a fluke. The estimated effect size may well be inflated, as experience in follow-up studies often or even typically shows.
In this sense it's not just
p-values that are the problem, and providing ancillary values like CIs and effect sizes in study reports is something of a false pretense of openness, because all of these values are vulnerable to similar problems. The promise to require these other data is a stopgap, or even a strategy to avoid adequate scrutiny of the statistical inference enterprise itself.
It is nobody's fault if we don't have adequate theory. The fault, dear Brutus, is in ourselves, for using Promissory Science, and feigning far deeper knowledge than we actually have. We do that rather than come clean about the seriousness of the problems. Perhaps we are reaching a point where the let-down from over-claiming is so common that the secret can't be kept in the bag, and the paying public may get restless. Leaking out a few bits of recognition and promising reform is very different from letting all it all out and facing the problem bluntly and directly. The core problem is not whether a reported association is strong or meaningful, but, more importantly, that we
don't know or know how to know.This can be seen in a different way. If all studies including negative ones were reported in the literature, then it would be only
right that the major journals should carry those findings that are most likely true, positive, and important. That's the actionable knowledge we want, and a top journal is where the most important results should appear. But the first occurrence of a finding, even if it turns out later to be a lucky fluke, is after all a new finding! So shouldn't investigators report it, even though lots of other similar studies haven't yet been done? That could take many years or, as in the example of Finnish studies, be impossible. We should expect negative results should be far more numerous and less interesting in themselves, if we just tested every variable we could think of willy-nilly, but in fact we usually have at least some reason to look, so it is far from clear what fraction of negative results would undermine the traditional way of doing business. Should we wait for years before publishing anything? That's not realistic.
If the big-name journals are still seen as
the place to publish, and their every press conference and issue announcement is covered by the splashy press, why should they change? Investigators may feel that if they don't stretch things to get into these journals, or just publish negative results, they'll be thought to have wasted their time or done poorly designed studies. Besides normal human vanity, the risk is that they will not be able to get grants or tenure. That feeling is the fault of the research, reputation, university, and granting systems, not the investigator. Everyone knows the game we're playing. As it is, investigators and their labs have champagne celebrations when they get a paper in one of these journals, like winning a yacht race, which is a reflection of what one could call the bourgeois nature of the profession these days.
How serious is the problem? Is it appropriate to characterize what's going on as fraud, hoax, or silent conspiracy? Probably in some senses yes; at least there is certainly culpability among those who do understand the epistemological nature of statistics and their application. Plow ahead anyway is not a legitimate response to fundamental problems.
When reality is closely enough approximated by statistical assumptions, causation can be identified, and we don't need to worry about the details. Many biomedical and genetic, and probably even some sociological problems are like that. The methods work very well in those cases. But this doesn't gainsay the accusation that there is widespread over-claiming taking place and that the problem is a deep lack of sufficient theoretical understanding of our fields of interest, and a rush to do more of the same year after year.
It's all understandable, but it needs fixing. To be properly addressed, an entrenched problem requires more criticism even than this one has been getting recently. Until better approaches come along, we will continue wasting a lot of money in the rather socialistic support of research establishments that keep on doing science that has well-known problems.
Or maybe the problem isn't the statistics, after all?The world really does, after all, seem to involve causation and at its basis seems to be law-like. There
is truth to be discovered. We know this because when causation is simple or strong enough to be really important, anyone can find it, so to speak, without big samples or costly gear and software. Under those conditions, numerous details that modify the effect are minor by comparison to the major signals. Hundreds or even thousands of clear, mainly single-gene based disorders are known, for example. What is needed is remediation, hard-core engineering to do something about the known causation.
However, these are not the areas where the
p-value and related problems have arisen. That happens when very large and SASsy studies seem to be needed, and the reason is that there causal factors are weak and/or so complex. Along with trying to root out misrepresentation and failure to report the truth adequately, we should ask whether, perhaps, the results showing frustrating complexity are correct.
Maybe there is not a need for better theory after all. In a sense the defining aspect of life is that it evolves not by the application of external forces as in physics, but by
internal comparison--which is just what survey methods assess. Life is the result of billions of years of differential reproduction, by chance and various forms of selection--that is, continual relative comparison by local natural circumstances. 'Differential' is the key word here. It is the
relative success among peers today that determines the genomes and their effects that will be here tomorrow. In a way, in effect and if often unwittingly and for lack of better ideas, that's just the sort of comparison made in statistical studies.
From that point of view, the problem is that we don't want to face up to the resulting truth, which is that a plethora of changeable, individually trivial causal factors is what we find because that's what exists. That we don't like that, don't report it cleanly, and want strong individual causation is our problem, not Nature's.