There is a valuable discussion in Nature about the problems that have arisen related to the (mis)use of statistics for decision-making. To simplify the issue, it is the idea that a rather subjectively chosen cutoff, or p, value leads to dichotomizing our inferences, when the underlying phenomena may or may not be dichotomous. For example, in a simplistic way to explain things, if a study's results pass such a cutoff test, it means that the chance the observed result would arise if nothing is going on (as opposed to the hypothesized effect) is so small--less than p percent of the time--that we accept the data as showing that our suggested something is going on. In other words, rare results (using our cutoff criterion for what 'rare' means) are considered to support our idea of what's afoot. The chosen cutoff level is arbitrary and used by convention, and its use doesn't reflect the various aspects of uncertainty or alternative interpretations that may abound in the actual data.
The Nature commentaries address these issues in various ways, and suggestions are made. These are helpful and thoughtful in themselves but they miss what I think is a very important, indeed often the critical point, when it comes to their application in many areas of biology and social science.
Instrumentation errors
In these (as other) sciences, various measurements and technologies are used to collect data. These are mechanical, so to speak, and are always imperfect. Sometimes it may be reasonable to assume that the errors are unrelated to what is being measured (for example, their distribution is unrelated to the value of a given instance) and don't affect what is being measured (as quantum measurements can do), then correcting for them in some reasonably systematic way, such as assuming normally distributed errors, clearly helps adjust findings for the inadvertent but causally unconnected errors.
Such corrections seem to apply quite validly to social and biological, including evolutionary and genetic, sciences. We'll never have perfect instrumentation or measurement, and often don't know the nature of our imperfections. Assuming errors uncorrelated with what is being sought seems reasonable even if approximate to some unknown degree. It's worked so well in the past that this sort of probabilistic treatment of results seems wholly appropriate.
But instrumentation errors are not the only possible errors in some sciences.
Conceptual errors: you can't 'correct' for them in inappropriate studies
Statistics is, properly, a branch of mathematics. That means it is an axiomatic system, an if-then way to make deductions or inductions. When and if the 'if' conditions are met, the 'then' consequences must follow. Statistics rests on probabilism rather than determinism, in the sense that it relates to and is developed around, the idea that some phenomena only occur with a given probability, say p, and that such a value somehow exists in Nature.
It may have to do with the practicalities of sampling by us, or by some natural screening phenomenon (as in, say, mutation, Mendelian transmission, natural selection). But it basically always rests on some version or other of an assumption that the sampling is parametric, that is, that our 'p' value somehow exists 'out there' in Nature. If we are, say, sampling 10% of a population (and the latter is actually well-defined!) then each draw has the same properties. For example, if it is a 'random' sample, then no property of a potential samplee affects whether or not it is actually sampled.
But note there is a big 'if' here: Sampling or whatever process is treated as probabilistic needs to have a parameter value! It is that which is used to compute significance measures and so on, from which we draw conclusions based on the results of our sample.
Is the universe parametric? Is life?
In physics, for example, the universe is assumed to be parametric. It is, universally, assumed to have some properties, like gravitational constant, Planck's constant, the speed of light, and so on. We can estimate the parameters here on earth (as, for example, Newton himself suggested), but assume they're the same elsewhere. If observation challenges that, we assume the cosmos is regular enough that there are at least some regularities, even if we've not figured them all out yet.
A key feature of a parametric universe is replicability. When things are replicable, because they are parametric--have fixed universal properties, then statistical estimates and their standard deviations etc. make sense and should reflect the human-introduced (e.g., measurement) sources of variation, not Nature's. Statistics is a field largely developed for this sort of context, or others in which sampling was reasonably assumed to represent the major source of error.
In my view it is more than incidental, but profound, that 'science' as we know it was an enterprise developed to study the 'laws' of Nature. Maybe this was the product of the theological beliefs that had preceded the Enlightenment or, as I think at least Newton said, 'science' was trying to understand God's laws.
In this spirit, in his Principia Mathematica (his most famous book), Newton stated the idea that if you understand how Nature works in some local example, what you learned would apply to the entire cosmos. This is how science, usually implicitly, works today. Chemistry here is assumed to be the same as chemistry on any distant galaxy, even those we cannot see. Consistency is the foundation upon which our idea of the cosmos and in that sense, classical science has been built.
Darwin was, in this sense, very clearly a Newtonian. Natural selection was a 'force' he likened to gravity, and his idea of 'chance' was not the formal one we use today. But what he did observe, though implicitly, was that evolution was about competing differences. In this sense, evolution is inherently not parametric.
Not only does evolution rest heavily on probability--chance aspects of reproductive success, which Darwin only minimally acknowledged, but it rests on each individual's own reproductive success being unique. Without variation, and that means variation in the traits that affect success, not just 'neutral' ones, there would be no evolution.
In this sense, the application of statistics and statistical inference in life sciences is legitimate relative to measurement and sampling issues, but is not relevant in terms of the underlying assumptions of its inferences. Each study subject is not identical except for randomly distributed 'noise', whether in our measurement or in its fate.
Life has properties we can measure and assign average values to, like the average reproductive success of AA, Aa, and aa genotypes at a given gene. But that is a retrospective average, and it is contrary to what we know about evolution to assume that, say, all AA's have the same fitness parameter and their reproductive variation is only due to chance sampling from that parameter.
Thinking of life in parametric terms is a convenience, but is an approximation of unknown and often unknowable inaccuracy. Evolution occurs over countless millennia, in which the non-parametric aspects can be dominating. We can estimate, say, recombination or mutation or fitness values from retrospective data, but they are not parameters that we can rigorously apply to the future and they typically are averages among sampled individuals.
Genetic effects are unique to each background and environmental experience, and we should honor that uniqueness as such! The statistical crisis that many are trying valiantly to explain away, so they can return to business as usual (even if not reporting p values) is a crisis of convenience, because it makes us think that a bit of different reportage (confidence limits rather than p values, for example) will cure all ills. That is a band-aid that is a convenient port-in-a-storm, but an illusory fix. It does not recognize the important, or even central, degree to which life is not a parametric phenomenon.
p-values etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster
p-values etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster
The statistics of Promissory Science. Part II: The problem may be much deeper than acknowledged
Yesterday, I discussed current issues related to statistical studies of things like genetic or other disease risk factors. Recent discussion has criticized the misuse of statistical methods, including a statement on p-values by the American Statistical Association. As many have said, the over-reliance on p-values can give a misleading sense that significance means importance of a tested risk factor. Many touted claims are not replicated in subsequent studies, and analysis has shown this may preferentially apply to the 'major' journals. Critics have suggested that p-values not be reported at all, or only if other information like confidence intervals (CIs) and risk factor effect sizes be included (I would say prominently included). Strict adherence will likely undermine what even expensive major studies can claim to have found, and it will become clear that many purported genetic, dietary, etc., risk factors are trivial, unimportant, or largely uninformative.
However, today I want to go farther, and question whether even making these correctives doesn't go far enough, and would perhaps serve as a convenient smokescreen for far more serious implications of the same issue. There is reason to believe the problem with statistical studies is more fundamental and broad than has been acknowledged.
Is reporting p-values really the problem?
Yesterday I said that statistical inference is only as good as the correspondence between the mathematical assumptions of the methods and what is being tested in the real world. I think the issues at stake rest on a deep disparity between them. Worse, we don't and often cannot know which assumptions are violated, or how seriously. We can make guesses and do all auxiliary tests and the like, but as decades of experience in the social, behavioral, biomedical, epidemiological, and even evolutionary and ecological worlds show us, we typically have no serious way to check these things.
The problem is not just that significance is not the same as importance. A somewhat different problem with standard p-value cutoff criteria is that many of the studies in question involve many test variables, such as complex epidemiological investigations based on long questionnaires, or genomewide association studies (GWAS) of disease. Normally, p=0.05 means that by chance one test in 20 will seem to be significant, even if there's nothing causal going on in the data (e.g., if no genetic variant actually contributes to the trait). If you do hundreds or even many thousands of 0.05 tests (e.g., of sequence variants across the genome), even if some of the variables really are causative, you'll get so many false positive results that follow-up will be impossible. A standard way to avoid that is to correct for multiple testing by using only p-values that would be achieved by chance only once in 20 times of doing a whole multivariable (e.g., whole genome) scan. That is a good, conservative approach, but means that to avoid a litter of weak, false positives, you only claim those 'hits' that pass that standard.
You know you're only accounting for a fraction of the truly causal elements you're searching for, but they're the litter of weakly associated variables that you're willing to ignore to identify the mostly likely true ones. This is good conservative science, but if your problem is to understand the beach, you are forced to ignore all the sand, though you know it's there. The beach cannot really be understood by noting its few detectable big stones.
But even this sensible play-it-conservative strategy has deeper problems.
How 'accurate' are even these preferred estimates?
The metrics like CIs and effect sizes that critics are properly insisting be (clearly) presented along with or instead of p-values face exactly the same issues as the p-value: the degree to which what is modeled fits the underlying mathematical assumptions on which test statistics rest.
To illustrate this point, the Pythagorean Theorem in plane geometry applies exactly and universally to right triangles. But in the real world there are no right triangles! There are approximations to right triangles, and the value of the Theorem is that the more carefully we construct our triangle the closer the square of the hypotenuse is to the sum of the squares of the other sides. If your result doesn't fit, then you know something is wrong and you have ideas of what to check (e.g., you might be on a curved surface).
In our statistical study case, knowing an estimated effect size and how unusual it is seems to be meaningful, but we should ask how accurate these estimates are. But that question often has almost no testable meaning: accurate relative to what? If we were testing a truth derived from a rigorous causal theory, we could ask by how many decimal places our answers differ from that truth. We could replicate samples and increase accuracy, because the signal to noise ratio would systematically improve. Were that to fail, we would know something was amiss, in our theory or our instrumentation, and have ideas how to find out what that was. But we are far, indeed unknowably far, from that situation. That is because we don't have such an externally derived theory, no analog to the Pythagorean Theorem, in important areas where statistical study techniques are being used.
In the absence of adequate theory, we have to concoct a kind of data that rests almost entirely on internal comparison to reveal whether 'something' of interest (often that we don't or cannot specify) is going on. We compare data such as cases vs controls, which forces us to make statistical assumptions such as that, other than (say) exposure to coffee, our sample of diseased vs normal subjects differ only in their coffee consumption, or that the distribution of other variation in unmeasured variables is random with regard to coffee consumption among our cases and controls subjects. This is one reason, for example, that even statistically significant correlation does not imply causation or importance. The underlying, often unstated assumptions are often impossible to evaluate. The same problem relates to replicability: for example, in genetics, you can't assume that some other population is the same as the population you first studied. Failure to replicate in this situation does not undermine a first positive study. For example, a result of a genetic study in Finland cannot be replicated properly elsewhere because there's only one Finland! Even another study sample within Finland won't necessarily replicate the original sample. In my opinion, the need for internally based comparison is the core problem, and a major reason why theory-poor fields often do so poorly.
The problem is subtle
When we compare cases and controls and insist on a study-wide 5% significance level to avoid a slew of false-positive associations, we know we're being conservative as described above, but at least those variables that do pass the adjusted test criterion are really causal with their effect strengths accurately estimated. Right? No!
When you do gobs of tests, some very weak causal factor may by good luck pass your test. But of those many contributing causal factors, the estimated effect size of the lucky one that passes the conservative test is something of a fluke. The estimated effect size may well be inflated, as experience in follow-up studies often or even typically shows.
In this sense it's not just p-values that are the problem, and providing ancillary values like CIs and effect sizes in study reports is something of a false pretense of openness, because all of these values are vulnerable to similar problems. The promise to require these other data is a stopgap, or even a strategy to avoid adequate scrutiny of the statistical inference enterprise itself.
It is nobody's fault if we don't have adequate theory. The fault, dear Brutus, is in ourselves, for using Promissory Science, and feigning far deeper knowledge than we actually have. We do that rather than come clean about the seriousness of the problems. Perhaps we are reaching a point where the let-down from over-claiming is so common that the secret can't be kept in the bag, and the paying public may get restless. Leaking out a few bits of recognition and promising reform is very different from letting all it all out and facing the problem bluntly and directly. The core problem is not whether a reported association is strong or meaningful, but, more importantly, that we don't know or know how to know.
This can be seen in a different way. If all studies including negative ones were reported in the literature, then it would be only right that the major journals should carry those findings that are most likely true, positive, and important. That's the actionable knowledge we want, and a top journal is where the most important results should appear. But the first occurrence of a finding, even if it turns out later to be a lucky fluke, is after all a new finding! So shouldn't investigators report it, even though lots of other similar studies haven't yet been done? That could take many years or, as in the example of Finnish studies, be impossible. We should expect negative results should be far more numerous and less interesting in themselves, if we just tested every variable we could think of willy-nilly, but in fact we usually have at least some reason to look, so it is far from clear what fraction of negative results would undermine the traditional way of doing business. Should we wait for years before publishing anything? That's not realistic.
If the big-name journals are still seen as the place to publish, and their every press conference and issue announcement is covered by the splashy press, why should they change? Investigators may feel that if they don't stretch things to get into these journals, or just publish negative results, they'll be thought to have wasted their time or done poorly designed studies. Besides normal human vanity, the risk is that they will not be able to get grants or tenure. That feeling is the fault of the research, reputation, university, and granting systems, not the investigator. Everyone knows the game we're playing. As it is, investigators and their labs have champagne celebrations when they get a paper in one of these journals, like winning a yacht race, which is a reflection of what one could call the bourgeois nature of the profession these days.
How serious is the problem? Is it appropriate to characterize what's going on as fraud, hoax, or silent conspiracy? Probably in some senses yes; at least there is certainly culpability among those who do understand the epistemological nature of statistics and their application. Plow ahead anyway is not a legitimate response to fundamental problems.
When reality is closely enough approximated by statistical assumptions, causation can be identified, and we don't need to worry about the details. Many biomedical and genetic, and probably even some sociological problems are like that. The methods work very well in those cases. But this doesn't gainsay the accusation that there is widespread over-claiming taking place and that the problem is a deep lack of sufficient theoretical understanding of our fields of interest, and a rush to do more of the same year after year.
It's all understandable, but it needs fixing. To be properly addressed, an entrenched problem requires more criticism even than this one has been getting recently. Until better approaches come along, we will continue wasting a lot of money in the rather socialistic support of research establishments that keep on doing science that has well-known problems.
Or maybe the problem isn't the statistics, after all?
The world really does, after all, seem to involve causation and at its basis seems to be law-like. There is truth to be discovered. We know this because when causation is simple or strong enough to be really important, anyone can find it, so to speak, without big samples or costly gear and software. Under those conditions, numerous details that modify the effect are minor by comparison to the major signals. Hundreds or even thousands of clear, mainly single-gene based disorders are known, for example. What is needed is remediation, hard-core engineering to do something about the known causation.
However, these are not the areas where the p-value and related problems have arisen. That happens when very large and SASsy studies seem to be needed, and the reason is that there causal factors are weak and/or so complex. Along with trying to root out misrepresentation and failure to report the truth adequately, we should ask whether, perhaps, the results showing frustrating complexity are correct.
Maybe there is not a need for better theory after all. In a sense the defining aspect of life is that it evolves not by the application of external forces as in physics, but by internal comparison--which is just what survey methods assess. Life is the result of billions of years of differential reproduction, by chance and various forms of selection--that is, continual relative comparison by local natural circumstances. 'Differential' is the key word here. It is the relative success among peers today that determines the genomes and their effects that will be here tomorrow. In a way, in effect and if often unwittingly and for lack of better ideas, that's just the sort of comparison made in statistical studies.
From that point of view, the problem is that we don't want to face up to the resulting truth, which is that a plethora of changeable, individually trivial causal factors is what we find because that's what exists. That we don't like that, don't report it cleanly, and want strong individual causation is our problem, not Nature's.
Is reporting p-values really the problem?
Yesterday I said that statistical inference is only as good as the correspondence between the mathematical assumptions of the methods and what is being tested in the real world. I think the issues at stake rest on a deep disparity between them. Worse, we don't and often cannot know which assumptions are violated, or how seriously. We can make guesses and do all auxiliary tests and the like, but as decades of experience in the social, behavioral, biomedical, epidemiological, and even evolutionary and ecological worlds show us, we typically have no serious way to check these things.
The problem is not just that significance is not the same as importance. A somewhat different problem with standard p-value cutoff criteria is that many of the studies in question involve many test variables, such as complex epidemiological investigations based on long questionnaires, or genomewide association studies (GWAS) of disease. Normally, p=0.05 means that by chance one test in 20 will seem to be significant, even if there's nothing causal going on in the data (e.g., if no genetic variant actually contributes to the trait). If you do hundreds or even many thousands of 0.05 tests (e.g., of sequence variants across the genome), even if some of the variables really are causative, you'll get so many false positive results that follow-up will be impossible. A standard way to avoid that is to correct for multiple testing by using only p-values that would be achieved by chance only once in 20 times of doing a whole multivariable (e.g., whole genome) scan. That is a good, conservative approach, but means that to avoid a litter of weak, false positives, you only claim those 'hits' that pass that standard.
You know you're only accounting for a fraction of the truly causal elements you're searching for, but they're the litter of weakly associated variables that you're willing to ignore to identify the mostly likely true ones. This is good conservative science, but if your problem is to understand the beach, you are forced to ignore all the sand, though you know it's there. The beach cannot really be understood by noting its few detectable big stones.
Sandy beach; Wikipedia, Lewis Clark |
But even this sensible play-it-conservative strategy has deeper problems.
How 'accurate' are even these preferred estimates?
The metrics like CIs and effect sizes that critics are properly insisting be (clearly) presented along with or instead of p-values face exactly the same issues as the p-value: the degree to which what is modeled fits the underlying mathematical assumptions on which test statistics rest.
To illustrate this point, the Pythagorean Theorem in plane geometry applies exactly and universally to right triangles. But in the real world there are no right triangles! There are approximations to right triangles, and the value of the Theorem is that the more carefully we construct our triangle the closer the square of the hypotenuse is to the sum of the squares of the other sides. If your result doesn't fit, then you know something is wrong and you have ideas of what to check (e.g., you might be on a curved surface).
Right triangle; Wikipedia |
In our statistical study case, knowing an estimated effect size and how unusual it is seems to be meaningful, but we should ask how accurate these estimates are. But that question often has almost no testable meaning: accurate relative to what? If we were testing a truth derived from a rigorous causal theory, we could ask by how many decimal places our answers differ from that truth. We could replicate samples and increase accuracy, because the signal to noise ratio would systematically improve. Were that to fail, we would know something was amiss, in our theory or our instrumentation, and have ideas how to find out what that was. But we are far, indeed unknowably far, from that situation. That is because we don't have such an externally derived theory, no analog to the Pythagorean Theorem, in important areas where statistical study techniques are being used.
In the absence of adequate theory, we have to concoct a kind of data that rests almost entirely on internal comparison to reveal whether 'something' of interest (often that we don't or cannot specify) is going on. We compare data such as cases vs controls, which forces us to make statistical assumptions such as that, other than (say) exposure to coffee, our sample of diseased vs normal subjects differ only in their coffee consumption, or that the distribution of other variation in unmeasured variables is random with regard to coffee consumption among our cases and controls subjects. This is one reason, for example, that even statistically significant correlation does not imply causation or importance. The underlying, often unstated assumptions are often impossible to evaluate. The same problem relates to replicability: for example, in genetics, you can't assume that some other population is the same as the population you first studied. Failure to replicate in this situation does not undermine a first positive study. For example, a result of a genetic study in Finland cannot be replicated properly elsewhere because there's only one Finland! Even another study sample within Finland won't necessarily replicate the original sample. In my opinion, the need for internally based comparison is the core problem, and a major reason why theory-poor fields often do so poorly.
The problem is subtle
When we compare cases and controls and insist on a study-wide 5% significance level to avoid a slew of false-positive associations, we know we're being conservative as described above, but at least those variables that do pass the adjusted test criterion are really causal with their effect strengths accurately estimated. Right? No!
When you do gobs of tests, some very weak causal factor may by good luck pass your test. But of those many contributing causal factors, the estimated effect size of the lucky one that passes the conservative test is something of a fluke. The estimated effect size may well be inflated, as experience in follow-up studies often or even typically shows.
In this sense it's not just p-values that are the problem, and providing ancillary values like CIs and effect sizes in study reports is something of a false pretense of openness, because all of these values are vulnerable to similar problems. The promise to require these other data is a stopgap, or even a strategy to avoid adequate scrutiny of the statistical inference enterprise itself.
It is nobody's fault if we don't have adequate theory. The fault, dear Brutus, is in ourselves, for using Promissory Science, and feigning far deeper knowledge than we actually have. We do that rather than come clean about the seriousness of the problems. Perhaps we are reaching a point where the let-down from over-claiming is so common that the secret can't be kept in the bag, and the paying public may get restless. Leaking out a few bits of recognition and promising reform is very different from letting all it all out and facing the problem bluntly and directly. The core problem is not whether a reported association is strong or meaningful, but, more importantly, that we don't know or know how to know.
This can be seen in a different way. If all studies including negative ones were reported in the literature, then it would be only right that the major journals should carry those findings that are most likely true, positive, and important. That's the actionable knowledge we want, and a top journal is where the most important results should appear. But the first occurrence of a finding, even if it turns out later to be a lucky fluke, is after all a new finding! So shouldn't investigators report it, even though lots of other similar studies haven't yet been done? That could take many years or, as in the example of Finnish studies, be impossible. We should expect negative results should be far more numerous and less interesting in themselves, if we just tested every variable we could think of willy-nilly, but in fact we usually have at least some reason to look, so it is far from clear what fraction of negative results would undermine the traditional way of doing business. Should we wait for years before publishing anything? That's not realistic.
If the big-name journals are still seen as the place to publish, and their every press conference and issue announcement is covered by the splashy press, why should they change? Investigators may feel that if they don't stretch things to get into these journals, or just publish negative results, they'll be thought to have wasted their time or done poorly designed studies. Besides normal human vanity, the risk is that they will not be able to get grants or tenure. That feeling is the fault of the research, reputation, university, and granting systems, not the investigator. Everyone knows the game we're playing. As it is, investigators and their labs have champagne celebrations when they get a paper in one of these journals, like winning a yacht race, which is a reflection of what one could call the bourgeois nature of the profession these days.
How serious is the problem? Is it appropriate to characterize what's going on as fraud, hoax, or silent conspiracy? Probably in some senses yes; at least there is certainly culpability among those who do understand the epistemological nature of statistics and their application. Plow ahead anyway is not a legitimate response to fundamental problems.
When reality is closely enough approximated by statistical assumptions, causation can be identified, and we don't need to worry about the details. Many biomedical and genetic, and probably even some sociological problems are like that. The methods work very well in those cases. But this doesn't gainsay the accusation that there is widespread over-claiming taking place and that the problem is a deep lack of sufficient theoretical understanding of our fields of interest, and a rush to do more of the same year after year.
It's all understandable, but it needs fixing. To be properly addressed, an entrenched problem requires more criticism even than this one has been getting recently. Until better approaches come along, we will continue wasting a lot of money in the rather socialistic support of research establishments that keep on doing science that has well-known problems.
Or maybe the problem isn't the statistics, after all?
The world really does, after all, seem to involve causation and at its basis seems to be law-like. There is truth to be discovered. We know this because when causation is simple or strong enough to be really important, anyone can find it, so to speak, without big samples or costly gear and software. Under those conditions, numerous details that modify the effect are minor by comparison to the major signals. Hundreds or even thousands of clear, mainly single-gene based disorders are known, for example. What is needed is remediation, hard-core engineering to do something about the known causation.
However, these are not the areas where the p-value and related problems have arisen. That happens when very large and SASsy studies seem to be needed, and the reason is that there causal factors are weak and/or so complex. Along with trying to root out misrepresentation and failure to report the truth adequately, we should ask whether, perhaps, the results showing frustrating complexity are correct.
Maybe there is not a need for better theory after all. In a sense the defining aspect of life is that it evolves not by the application of external forces as in physics, but by internal comparison--which is just what survey methods assess. Life is the result of billions of years of differential reproduction, by chance and various forms of selection--that is, continual relative comparison by local natural circumstances. 'Differential' is the key word here. It is the relative success among peers today that determines the genomes and their effects that will be here tomorrow. In a way, in effect and if often unwittingly and for lack of better ideas, that's just the sort of comparison made in statistical studies.
From that point of view, the problem is that we don't want to face up to the resulting truth, which is that a plethora of changeable, individually trivial causal factors is what we find because that's what exists. That we don't like that, don't report it cleanly, and want strong individual causation is our problem, not Nature's.
The statistics of Promissory Science. Part I: Making non-sense with statistical methods
Statistics is a form of mathematics, a way devised by humans for representing abstract relationships. Mathematics comprises axiomatic systems, which make assumptions about basic units such as numbers; basic relationships like adding and subtracting; and rules of inference (deductive logic); and then elaborates these to draw conclusions that are typically too intricate to reason out in other less formal ways. Mathematics is an awesomely powerful way of doing this abstract mental reasoning, but when applied to the real world it is only as true or precise as the correspondence between its assumptions and real-world entities or relationships. When that correspondence is high, mathematics is very precise indeed, a strong testament to the true orderliness of Nature. But when the correspondence is not good, mathematical applications verge on fiction, and this occurs in many important applied areas of probability and statistics.
You can't drive without a license, but anyone with R or SAS can be a push-button scientist. Anybody with a keyboard and some survey generating software can monkey around with asking people a bunch of questions and then 'analyze' the results. You can construct a complex, long, intricate, jargon-dense, expansive survey. You then choose who to subject to the survey--your 'sample'. You can grace the results with the term 'data', implying true representation of the world, and be off and running. Sample and survey designers may be intelligent, skilled, well-trained in survey design, and of wholly noble intent. There's only one little problem: if the empirical fit is poor, much of what you do will be non-sense (and some of it nonsense).
Population sciences, including biomedical, evolutionary, social and political fields are experiencing an increasingly widely recognized crisis of credibility. The fault is not in the statistical methods on which these fields heavily depend, but in the degree of fit (or not) to the assumptions--with the emphasis these days on the 'or not', and an often dismissal of the underlying issues in favor of a patina of technical, formalized results. Every capable statistician knows this, but of course might be out of business if openly paying it enough attention. And many statisticians may be rather disinterested or too foggy in the philosophy of science to understand what goes beyond the methodological technicalities. Jobs and journals depend on not being too self-critical. And therein lie rather serious problems.
Promissory science
There is the problem of the problems--the problems we want to solve, such as in understanding the cause of disease so that we can do something about it. When causal factors fit the assumptions, statistical or survey study methods work very well. But when causation is far from fitting the assumptions, the impulse of the professional community seems mainly to increase the size, scale, cost, and duration of studies, rather than to slow down and rethink the question itself. There may be plenty of careful attention paid to refining statistical design, but basically this stays safely within the boundaries of current methods and beliefs, and the need for research continuity. It may be very understandable, because one can't just quickly uproot everything or order up deep new insights. But it may be viewed as abuse of public trust as well as of the science itself.
The BBC Radio 4 program called More Or Less keeps a watchful eye on sociopolitical and scientific statistical claims, revealing what is really known (or not) about them. Here is a recent installment on the efficacy (or believability, or neither) of dietary surveys. And here is a FiveThirtyEight link to what was the basis of the podcast.
The promotion of statistical survey studies to assert fundamental discovery has been referred to as 'promissory science'. We are barraged daily with promises that if we just invest in this or that Big Data study, we will put an end to all human ills. It's a strategy, a tactic, and at least the top investigators are very well aware of it. Big long-term studies are a way to secure reliable funding and to defer delivering on promises into the vague future. The funding agencies, wanting to seem prudent and responsible to taxpayers with their resources, demand some 'societal impact' section on grant applications. But there is in fact little if any accountability in this regard, so one can say they are essentially bureaucratic window-dressing exercises.
Promissory science is an old game, practiced since time immemorial by preachers. It boils down to promising future bliss if you'll just pay up now. We needn't be (totally) cynical about this. When we set up a system that depends on public decisions about resources, we will get what we've got. But having said that, let's take a look at what is a growing recognition of the problem, and some suggestions as to how to fix it--and whether even these are really the Emperor of promissory science dressed in less gaudy clothing.
A growing at least partial awareness
The problem of results that are announced by the media, journals, universities, and so on but that don't deliver the advertised promises is complex but widespread, in part because research has become so costly, that some warning sirens are sounding when it becomes clear that the promised goods are not being delivered.
One widely known issue is the lack of reporting of negative results, or their burial in minor journals. Drug-testing research is notorious for this under-reporting. It's too bad because a negative result on a well-designed test is legitimately valuable and informative. A concern, besides corporate secretiveness, is that if the cost is high, taxpayers or share-holders may tire of funding yet more negative studies. Among other efforts, including by NIH, there is a formal attempt called AllTrials to rectify the under-reporting of drug trials, and this does seem at least to be thriving and growing if incomplete and not enforceable. But this non-reporting problem has been written about so much that we won't deal with it here.
Instead, there is a different sort of problem. The American Statistical Association has recently noted an important issue, which is the use and (often) misuse of p-values to support claims of identified causation (we've written several posts in the past about these issues; search on 'p-value' if you're interested, and the post by Jim Wood is especially pertinent). FiveThirtyEight has a good discussion of the p-value statement.
The usual interpretation is that p represents the probability that if there is in fact no causation by the test variable, that its apparent effect arose just by chance. So if the observed p in a study is less than some arbitrary cutoff, such as 0.05, it means essentially that if no causation were involved the chance you'd see this association anyway is no greater than 5%; that is, there is some evidence for a causal connection.
Trashing p-values is becoming a new cottage industry! Now JAMA is on the bandwagon, with an article that shows in a survey of biomedical literature from the past 25 years, including well over a million papers, a far disproportionate and increasing number of studies reported statistical significant results. Here is the study on the JAMA web page, though it is not public domain yet.
Besides the apparent reporting bias, the JAMA study found that those papers generally failed to provide adequate fleshing out of that result. Where are all the negative studies that statistical principles might expect to be found? We don't see them, especially in the 'major' journals, as has been noted many times in recent years. Just as importantly, authors often did not report confidence intervals or other measures of the degree of 'convincingness' that might illuminate the p-value. In a sense that means authors didn't say what range of effects is consistent with the data. They report a non-random effect, but often didn't give the effect size, that is, say how large the effect was even assuming that effect was unusual enough to support a causal explanation. So, for example, a statistically significant increase of risk from 1% to 1.01% is trivial, even if one could accept all the assumptions of the sampling and analysis.
Another vocal critic of what's afoot is John Ionnides; in a recent article he levels both barrels against the misuse and mis- or over-representation of statistical results in biomedical sciences, including meta-analysis (the pooling of many diverse small studies into a single large analysis to gain sufficient statistical power to detect effects and test for their consistency). This paper is a rant, but a well-deserved one, about how 'evidence-based' medicine has been 'hijacked' as he puts it. The same must be said of 'precision genomic' or 'personalized' medicine, or 'Big Data', and other sorts of imitative sloganeering going on from many quarters who obviously see this sort of promissory science as what you have to do to get major funding. We have set ourselves a professional trap, and it's hard to escape. For example, the same author has been leading the charge against misrepresentative statistics for many years, and he and others have shown that the 'major' journals have in a sense the least reliable results in terms of their replicability. But he's been raising these points in the same journals that he shows are culpable of the problem, rather than boycotting those journals. We're in a trap!
These critiques of current statistical practice are the points getting most of the ink and e-ink. There may be a lot of cover-ups of known issues, and even hypocrisy, in all of this, and perhaps more open or understandable tacit avoidance. The industry (e.g., drug, statistics, and research equipment) has a vested interest in keeping the motor running. Authors need to keep their careers on track. And, in the fairest and non-political sense, the problems are severe.
But while these issues are real and must be openly addressed, I think the problems are much deeper. In a nutshell, I think they relate to the nature of mathematics relative to the real world, and the nature and importance of theory in science. We'll discuss this tomorrow.
You can't drive without a license, but anyone with R or SAS can be a push-button scientist. Anybody with a keyboard and some survey generating software can monkey around with asking people a bunch of questions and then 'analyze' the results. You can construct a complex, long, intricate, jargon-dense, expansive survey. You then choose who to subject to the survey--your 'sample'. You can grace the results with the term 'data', implying true representation of the world, and be off and running. Sample and survey designers may be intelligent, skilled, well-trained in survey design, and of wholly noble intent. There's only one little problem: if the empirical fit is poor, much of what you do will be non-sense (and some of it nonsense).
Population sciences, including biomedical, evolutionary, social and political fields are experiencing an increasingly widely recognized crisis of credibility. The fault is not in the statistical methods on which these fields heavily depend, but in the degree of fit (or not) to the assumptions--with the emphasis these days on the 'or not', and an often dismissal of the underlying issues in favor of a patina of technical, formalized results. Every capable statistician knows this, but of course might be out of business if openly paying it enough attention. And many statisticians may be rather disinterested or too foggy in the philosophy of science to understand what goes beyond the methodological technicalities. Jobs and journals depend on not being too self-critical. And therein lie rather serious problems.
Promissory science
There is the problem of the problems--the problems we want to solve, such as in understanding the cause of disease so that we can do something about it. When causal factors fit the assumptions, statistical or survey study methods work very well. But when causation is far from fitting the assumptions, the impulse of the professional community seems mainly to increase the size, scale, cost, and duration of studies, rather than to slow down and rethink the question itself. There may be plenty of careful attention paid to refining statistical design, but basically this stays safely within the boundaries of current methods and beliefs, and the need for research continuity. It may be very understandable, because one can't just quickly uproot everything or order up deep new insights. But it may be viewed as abuse of public trust as well as of the science itself.
The BBC Radio 4 program called More Or Less keeps a watchful eye on sociopolitical and scientific statistical claims, revealing what is really known (or not) about them. Here is a recent installment on the efficacy (or believability, or neither) of dietary surveys. And here is a FiveThirtyEight link to what was the basis of the podcast.
The promotion of statistical survey studies to assert fundamental discovery has been referred to as 'promissory science'. We are barraged daily with promises that if we just invest in this or that Big Data study, we will put an end to all human ills. It's a strategy, a tactic, and at least the top investigators are very well aware of it. Big long-term studies are a way to secure reliable funding and to defer delivering on promises into the vague future. The funding agencies, wanting to seem prudent and responsible to taxpayers with their resources, demand some 'societal impact' section on grant applications. But there is in fact little if any accountability in this regard, so one can say they are essentially bureaucratic window-dressing exercises.
Promissory science is an old game, practiced since time immemorial by preachers. It boils down to promising future bliss if you'll just pay up now. We needn't be (totally) cynical about this. When we set up a system that depends on public decisions about resources, we will get what we've got. But having said that, let's take a look at what is a growing recognition of the problem, and some suggestions as to how to fix it--and whether even these are really the Emperor of promissory science dressed in less gaudy clothing.
A growing at least partial awareness
The problem of results that are announced by the media, journals, universities, and so on but that don't deliver the advertised promises is complex but widespread, in part because research has become so costly, that some warning sirens are sounding when it becomes clear that the promised goods are not being delivered.
One widely known issue is the lack of reporting of negative results, or their burial in minor journals. Drug-testing research is notorious for this under-reporting. It's too bad because a negative result on a well-designed test is legitimately valuable and informative. A concern, besides corporate secretiveness, is that if the cost is high, taxpayers or share-holders may tire of funding yet more negative studies. Among other efforts, including by NIH, there is a formal attempt called AllTrials to rectify the under-reporting of drug trials, and this does seem at least to be thriving and growing if incomplete and not enforceable. But this non-reporting problem has been written about so much that we won't deal with it here.
Instead, there is a different sort of problem. The American Statistical Association has recently noted an important issue, which is the use and (often) misuse of p-values to support claims of identified causation (we've written several posts in the past about these issues; search on 'p-value' if you're interested, and the post by Jim Wood is especially pertinent). FiveThirtyEight has a good discussion of the p-value statement.
The usual interpretation is that p represents the probability that if there is in fact no causation by the test variable, that its apparent effect arose just by chance. So if the observed p in a study is less than some arbitrary cutoff, such as 0.05, it means essentially that if no causation were involved the chance you'd see this association anyway is no greater than 5%; that is, there is some evidence for a causal connection.
Trashing p-values is becoming a new cottage industry! Now JAMA is on the bandwagon, with an article that shows in a survey of biomedical literature from the past 25 years, including well over a million papers, a far disproportionate and increasing number of studies reported statistical significant results. Here is the study on the JAMA web page, though it is not public domain yet.
Besides the apparent reporting bias, the JAMA study found that those papers generally failed to provide adequate fleshing out of that result. Where are all the negative studies that statistical principles might expect to be found? We don't see them, especially in the 'major' journals, as has been noted many times in recent years. Just as importantly, authors often did not report confidence intervals or other measures of the degree of 'convincingness' that might illuminate the p-value. In a sense that means authors didn't say what range of effects is consistent with the data. They report a non-random effect, but often didn't give the effect size, that is, say how large the effect was even assuming that effect was unusual enough to support a causal explanation. So, for example, a statistically significant increase of risk from 1% to 1.01% is trivial, even if one could accept all the assumptions of the sampling and analysis.
Another vocal critic of what's afoot is John Ionnides; in a recent article he levels both barrels against the misuse and mis- or over-representation of statistical results in biomedical sciences, including meta-analysis (the pooling of many diverse small studies into a single large analysis to gain sufficient statistical power to detect effects and test for their consistency). This paper is a rant, but a well-deserved one, about how 'evidence-based' medicine has been 'hijacked' as he puts it. The same must be said of 'precision genomic' or 'personalized' medicine, or 'Big Data', and other sorts of imitative sloganeering going on from many quarters who obviously see this sort of promissory science as what you have to do to get major funding. We have set ourselves a professional trap, and it's hard to escape. For example, the same author has been leading the charge against misrepresentative statistics for many years, and he and others have shown that the 'major' journals have in a sense the least reliable results in terms of their replicability. But he's been raising these points in the same journals that he shows are culpable of the problem, rather than boycotting those journals. We're in a trap!
These critiques of current statistical practice are the points getting most of the ink and e-ink. There may be a lot of cover-ups of known issues, and even hypocrisy, in all of this, and perhaps more open or understandable tacit avoidance. The industry (e.g., drug, statistics, and research equipment) has a vested interest in keeping the motor running. Authors need to keep their careers on track. And, in the fairest and non-political sense, the problems are severe.
But while these issues are real and must be openly addressed, I think the problems are much deeper. In a nutshell, I think they relate to the nature of mathematics relative to the real world, and the nature and importance of theory in science. We'll discuss this tomorrow.
Kaydol:
Kayıtlar (Atom)
Rare Disease Day and the promises of personalized medicine
O ur daughter Ellen wrote the post that I republish below 3 years ago, and we've reposted it in commemoration of Rare Disease Day, Febru...
-
Pakistan dizileri Hint dizilerinden farklı. Onlar gibi coşkulu olmuyor genelde. Bu yüzden yarım bıraktıklarım hayli fazla. Ama bu dizi ...
-
Pakistan dizisi önyargımı biraz olsun kıran bir dizi izledim geçenlerde. Baştan söyleyeyim Hindistan dizilerindeki gibi rüzgarlar essi...
-
İnternette bu görselle karşılaştım ve içimde derinden bir öfke dalgası yükseldi. Böyle şeyleri genelde paylaşmazdım. Çoğunlukla susan ve ken...