epistemology etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster
epistemology etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster

Statistics controversy: missing the p-oint.

There is a valuable discussion in Nature about the problems that have arisen related to the (mis)use of statistics for decision-making.  To simplify the issue, it is the idea that a rather subjectively chosen cutoff, or p, value leads to dichotomizing our inferences, when the underlying phenomena may or may not be dichotomous.  For example, in a simplistic way to explain things,  if a study's results pass such a cutoff test, it means that the chance the observed result would arise if nothing is going on (as opposed to the hypothesized effect) is so small--less than p percent of the time--that we accept the data as showing that our suggested something is going on.  In other words, rare results (using our cutoff criterion for what 'rare' means) are considered to support our idea of what's afoot.  The chosen cutoff level is arbitrary and used by convention, and its use doesn't reflect the various aspects of uncertainty or alternative interpretations that may abound in the actual data.

The Nature commentaries address these issues in various ways, and suggestions are made.  These are helpful and thoughtful in themselves but they miss what I think is a very important, indeed often the critical point, when it comes to their application in many areas of biology and social science.

Instrumentation errors
In these (as other) sciences, various measurements and technologies are used to collect data.  These are mechanical, so to speak, and are always imperfect.  Sometimes it may be reasonable to assume that the errors are unrelated to what is being measured (for example, their distribution is unrelated to the value of a given instance) and don't affect what is being measured (as quantum measurements can do), then correcting for them in some reasonably systematic way, such as assuming normally distributed errors, clearly helps adjust findings for the inadvertent but causally unconnected errors.

Such corrections seem to apply quite validly to social and biological, including evolutionary and genetic, sciences.  We'll never have perfect instrumentation or measurement, and often don't know the nature of our imperfections.  Assuming errors uncorrelated with what is being sought seems reasonable even if approximate to some unknown degree. It's worked so well in the past that this sort of probabilistic treatment of results seems wholly appropriate.

But instrumentation errors are not the only possible errors in some sciences.

Conceptual errors: you can't 'correct' for them in inappropriate studies
Statistics is, properly, a branch of mathematics.  That means it is an axiomatic system, an if-then way to make deductions or inductions.  When and if the 'if' conditions are met, the 'then' consequences must follow.  Statistics rests on probabilism rather than determinism, in the sense that it relates to and is developed around, the idea that some phenomena only occur with a given probability, say p, and that such a value somehow exists in Nature.

It may have to do with the practicalities of sampling by us, or by some natural screening phenomenon (as in, say, mutation, Mendelian transmission, natural selection). But it basically always rests on some version or other of an assumption that the sampling is parametric, that is, that our 'p' value somehow exists 'out there' in Nature.  If we are, say, sampling 10% of a population (and the latter is actually well-defined!) then each draw has the same properties.  For example, if it is a 'random' sample, then no property of a potential samplee affects whether or not it is actually sampled.

But note there is a big 'if' here: Sampling or whatever process is treated as probabilistic needs to have a parameter value!  It is that which is used to compute significance measures and so on, from which we draw conclusions based on the results of our sample.

Is the universe parametric?  Is life?
In physics, for example, the universe is assumed to be parametric.  It is, universally, assumed to have some properties, like gravitational constant, Planck's constant, the speed of light, and so on.  We can estimate the parameters here on earth (as, for example, Newton himself suggested), but assume they're the same elsewhere.  If observation challenges that, we assume the cosmos is regular enough that there are at least some regularities, even if we've not figured them all out yet.

A key feature of a parametric universe is replicability.  When things are replicable, because they are parametric--have fixed universal properties, then statistical estimates and their standard deviations etc. make sense and should reflect the human-introduced (e.g., measurement) sources of variation, not Nature's.  Statistics is a field largely developed for this sort of context, or others in which sampling was reasonably assumed to represent the major source of error.

In my view it is more than incidental, but profound, that 'science' as we know it was an enterprise developed to study the 'laws' of Nature.  Maybe this was the product of the theological beliefs that had preceded the Enlightenment or, as I think at least Newton said, 'science' was trying to understand God's laws.

In this spirit, in his Principia Mathematica (his most famous book), Newton stated the idea that if you understand how Nature works in some local example, what you learned would apply to the entire cosmos.  This is how science, usually implicitly, works today.  Chemistry here is assumed to be the same as chemistry on any distant galaxy, even those we cannot see.  Consistency is the foundation upon which our idea of the cosmos and in that sense, classical science has been built.

Darwin was, in this sense, very clearly a Newtonian.  Natural selection was a 'force' he likened to gravity, and his idea of 'chance' was not the formal one we use today.  But what he did observe, though implicitly, was that evolution was about competing differences.  In this sense, evolution is inherently not parametric.

Not only does evolution rest heavily on probability--chance aspects of reproductive success, which Darwin only minimally acknowledged, but it rests on each individual's own reproductive success being unique.  Without variation, and that means variation in the traits that affect success, not just 'neutral' ones, there would be no evolution.

In this sense, the application of statistics and statistical inference in life sciences is legitimate relative to measurement and sampling issues, but is not relevant in terms of the underlying assumptions of its inferences.  Each study subject is not identical except for randomly distributed 'noise', whether in our measurement or in its fate.

Life has properties we can measure and assign average values to, like the average reproductive success of AA, Aa, and aa genotypes at a given gene. But that is a retrospective average, and it is contrary to what we know about evolution to assume that, say, all AA's have the same fitness parameter and their reproductive variation is only due to chance sampling from that parameter.

Thinking of life in parametric terms is a convenience, but is an approximation of unknown and often unknowable inaccuracy.  Evolution occurs over countless millennia, in which the non-parametric aspects can be dominating.  We can estimate, say, recombination or mutation or fitness values from retrospective data, but they are not parameters that we can rigorously apply to the future and they typically are averages among sampled individuals.

Genetic effects are unique to each background and environmental experience, and we should honor that uniqueness as such!  The statistical crisis that many are trying valiantly to explain away, so they can return to business as usual (even if not reporting p values) is a crisis of convenience, because it makes us think that a bit of different reportage (confidence limits rather than p values, for example) will cure all ills.  That is a band-aid that is a convenient port-in-a-storm, but an illusory fix. It does not recognize the important, or even central, degree to which life is not a parametric phenomenon.

Understanding Obesity? Fat Chance!

Obesity is one of our more widespread and serious health-threatening traits.  Many large-scale mapping as well as extensive environmental/behavioral epidemiological studies of obesity have been done over recent decades.  But if anything, the obesity epidemic seems to be getting worse.

There's deep meaning in that last sentence: the prevalence of obesity is changing rapidly.  This is being documented globally, and happening rapidly before our eyes.  Perhaps the most obvious implication is that this serious problem is not due to genetics!  That is, it is not due to genotypes that in themselves make you obese.  Although everyone's genotype is different, the changes are happening during lifetimes, so we can't attribute it to the different details of each generation's genotypes or their evolution over time. Instead, the trend is clearly due to lifestyle changes during lifetimes.

Of course, if you see everything through gene-colored lenses, you might argue (as people have) that sure, it's lifestyles, but only some key nutrient-responding genes are responsible for the surge in obesity.  These are the 'druggable' targets that we ought to be finding, and it should be rather easy since the change is so rapid that the genes must be few, so that even if we can't rein in McD and KFC toxicity, or passive TV-addiction, we can at least medicate the result.  That was always, at best, wishful thinking, and at worst, rationalization for funding Big Data studies.  Such a simple explanation would be good for KFC, and an income flood for BigPharma, the GWAS industry, DNA sequencer makers, and more.....except not so good for  those paying the medical price, and those who are trying to think about the problem in a disinterested scientific way.  Unfortunately, even when it is entirely sincere, that convenient hope for a simple genetic cause is being shown to be false.

A serious parody?
Year by year, more factors are identified that, by statistical association at least and sometimes by experimental testing, contribute to obesity.  A very fine review of this subject has appeared in the mid-October 201 Nature Reviews Genetics, by Ghosh and Bouchard, which takes seriously not just genetics but all the plausible causes of obesity, including behavior and environment, and their relationships as best we know them, and outlines the current state of knowledge.

Ghosh and Bouchard provide a well-caveated assessment of these various threads of evidence now in hand, and though they do end up with the pro forma plea for yet more funding to identify yet more details, they provide a clear picture that a serious reader can take seriously on its own merits.  However, we think that the proper message is not the usual one.  It is that we need to rethink what we've been investing so heavily on.

To their great credit, the authors melded behavioral, environmental, and genetic causation in their analysis. This is shown in this figure, from their summary; it is probably the best current causal map of obesity based on the studies the authors included in their analysis:



If this diagram were being discussed by John Cleese on Monty Python, we'd roar with laughter at what was an obvious parody of science.  But nobody's laughing and this isn't a parody!   And it is by no means of unusual shape and complexity.  Diagrams like this (but with little if any environmental component) have been produced by analyzing gene expression patterns even just of the early development of the simple sea urchin.  But we seem not to be laughing, which is understandable because they're serious diagrams.  On the other hand, we don't seem to be reacting other than by saying we need more of the same.  I think that is rather weird, for scientists, whose job it is to understand, not just list, the nature of Nature.

We said at the outset of this post that 'the obesity epidemic seems to be getting worse'.  There's a deep message there, but one essentially missing even from this careful obesity paper: it is that many of the causal factors, including genetic variants, are changing before our eyes. The frequency of genetic variants changes from population to population and generation to generation, so that all samples will look different.  And, mutations happen in every meiosis, adding new variants to a population every time a baby is born.   The results of many studies, as reflected in the current summary by Ghosh and Bouchard, show the many gene regions that contribute to obesity, their total net contribution is still minor.  It is possible, though perhaps very difficult to demonstrate, that an individual site might account more than minimally for some individual carriers in ways GWAS results can't really identify.  And the authors do cite published opinions that claim a higher efficacy of GWAS relative to obesity than we think is seriously defensible; but even if we're wrong, causation is very complex as the figure shows.

The individual genomic variants will vary in their presence or absence or frequency or average effect among studies, not to mention populations.  In addition, most contributing genetic variants are too rare or weak to be detected by the methods used in mapping studies, because of the constraints on statistical significance criteria, which is why so much of the trait's heritability in GWAS is typically unaccounted for by mapping.  These aspects and their details will differ greatly among samples and studies.

Relevant risk factors will come or go or change in exposure levels in the future--but these cannot be predicted, not even in principle.  Their interactions and contributions are also manifestly context-specific, as secular trends clearly show.  Even with the set of known genetic variants and other contributing factors, there are essentially an unmanageable number of possible combinations, so that each person is genetically and environmentally unique, and the complex combinations of future individuals are not predictable.

Risk assessment is essentially based on replicability, which in a sense is why statistical testing can be used (on which these sorts of results heavily rely).  However, because these risk factor combinations are each unique they're not replicable.  At best, as some advocate, the individual effects are additive so that if we just measure each in some individual add up each factor's effect, and predict the person's obesity (if the effects are not additive, this won't work).  We can probably predict, if perhaps not control, at least some of the major risk factors (people will still down pizzas or fried chicken while sitting in front of a TV). But even the known genetic factors in total only account for a small percentage of the trait's variance (the authors' Table 2), though the paper cites more optimistic authors.

The result of these indisputable facts is that as long as our eyes are focused, for research strategic reasons or lack of better ideas, on the litter of countless minor factors, even those we can identify, we have a fat chance of really addressing the problem this way.

If you pick any of the arrows (links) in this diagram, you can ask how strong or necessary that link is, how much it may vary among samples or depend on the European nature of the data used here, or to what extent even its identification could be a sampling or statistical artifact.  Links like 'smoking' or 'medication', not to mention specific genes, even if they're wholly correct, surely have quantitative effects that vary among people even within the sample, and the effect sizes probably often have very large variance. Many exposures are notoriously inaccurately reported or measured, or change in unmeasured ways.   Some are quite vague, like 'lifestyle', 'eating behavior', and many others--both hard to define and hard to assess with knowable precision, much less predictability.  Whether their various many effects are additive or have more complex interaction is another issue, and the connectivity diagram may be tentative in many places.  Maybe--probably?--in such traits simple behavioral changes would over-ride most of these behavioral factors, leaving those persons for whom obesity really is due to their genotype, which would then be amenable to gene-focused approaches.

If this is a friable diagram, that is, if the items, strengths, connections and so on are highly changeable, even if through no fault of the authors whatever, we can ask when and where and how this complex map is actually useful, no matter how carefully it was assembled.  Indeed, even if this is a rigidly accurate diagram for the samples used, how applicable is it to other samples or to the future?Or how useful is it in predicting not just group patterns, but individual risk?

Our personal view is that the rather ritual plea for more and more and bigger and bigger statistical association studies is misplaced, and, in truth, a way of maintaining funding and the status quo, something we've written much about--the sociopolitical economics of science today.  With obesity rising at a continuing rate and about a third of the US population recently reported as obese, we know that the future health care costs for the consequences will dwarf even the mega-scale genome mapping on which so much is currently being spent, if not largely wasted.  We know how to prevent much or most obesity in behavioral terms, and we think it is entirely fair to ask why we still pour resources into genetic mapping of this particular problem.

There are many papers on other complex traits that might seem to be simple like stature and blood pressure, not to mention more mysterious ones like schizophrenia or intelligence, in which hundreds of genomewide sites are implicated, strewn across the genome.  Different studies find different sites, and in most cases most of the heritability is not accounted for, meaning that many more sites are at work (and this doesn't include environmental effects).  In many instances, even the trait's definition itself may be comparably vague, or may change over time.  This is a landscape 'shape' in which every detail is different, within and between traits, but is found in common with complex traits.  That in itself is a tipoff that there is something consistent about these landscapes but we've not yet really awakened to it or learned how to approach it.

Rather than being skeptical about these Ghosh and Bouchard's' careful analysis or their underlying findings, I think we should accept their general nature, even if the details in any given study or analysis may not individually be so rigid and replicable, and ask: OK, this is the landscape--what do we do now?

Is there a different way to think about biological causation?  If not, what is the use or point of this kind of complexity enumeration, in which every person is different and the risks for the future may not be those estimated from past data to produce figures like the one above?  The rapid change in prevalence shows how unreliable these factors must be, at prediction--they are retrospective of the particular patterns of the study subjects.  Since we cannot predict the strengths or even presence of these or other new factors, what should we do?  How can we rethink the problem?

These are the harder question, much harder than analyzing the data; but they are in our view the real scientific questions that need to be asked.

The state of play in science

I've just read a new book that MT readers would benefit from reading as well.  It's Rigor Mortis, by Richard Harris (2017: Basic Books).  His subtitle is How sloppy science creates worthless cures, crushes hope, and wastes billions.  One might suspect that this title is stridently overstated, but while it is quite forthright--and its argument well-supported--I think the case is actually understated, for reasons I'll explain below.

Harris, science reporter for National Public Radio, goes over many different problems that plague biomedical research. At the core is the reproducibility problem, that is, the numbers of claims by research papers that are not reproducible by subsequent studies.  This particular problem made the news within the last couple of years in regard to using statistical criteria like p-values (significance cutoffs), and because of the major effort in psychology to replicate published studies, with a lot of failure to do so.  But there are other issues.

The typical scientific method assumes that there is a truth out there, and a good study should detect its features.  But if it's a truth, then some other study should get similar results.  But many many times in biomedical research, despite huge media ballyhoo with cheerleading by the investigators as well as the media, studies' breakthrough!! findings can't be supported by further examination.

As Harris extensively documents, this phenomenon is seen in claims of treatments or cures, or use of animal models (e.g., lab mice), or antibodies, or cell lines, or statistical 'significance' values.  It isn't a long book, so you can quickly see the examples for yourself.  Harris also accounts for the problems, quite properly I think, by documenting sloppy science but also the careerist pressures on investigators to find things they can publish in 'major' journals, so they can get jobs, promotions, high 'impact factor' pubs, and grants. In our obviously over-crowded market, it can be no surprise to anyone that there is shading of the truth, a tad of downright dishonesty, conveniently imprecise work, and so on.

Since scientists feed at the public trough (or depend on profits and sales for biomedical products to grant-funded investigators), they naturally have to compete and don't want to be shown up, and they have to work fast to keep the funds flowing in.  Rigor Mortis properly homes in on an important fact, that if our jobs depend on 'productivity' and bringing in grants, we will do what it takes, shading the truth or whatever else (even the occasional outright cheating) to stay in the game.

Why share data with your potential competitors who might, after all, find fault with your work or use it to get the jump on you for the next stage?  For that matter, why describe what you did in enough actual detail that someone (a rival or enemy!) might attempt to replicate your work.....or fail to do so? Why wait to publish until you've got a really adequate explanation of what you suggest is going on, with all the i's dotted and t's crossed?  Haste makes credit!  Harris very clearly shows these issues in the all-too human arena of our science research establishment today.  He calls what we have now, appropriately enough, a "broken culture" of science.

Part of that I think is a 'Malthusian' problem.  We are credited, in score-counting ways, by chairs and deans, for how many graduate students we turn (or churn) out.  Is our lab 'productive' in that way?  Of course, we need that army of what often are treated as drones because real faculty members are too busy writing grants or traveling to present their (students') latest research to waste--er, spend--much time in their labs themselves.  The result is the cruel excess of PhDs who can't find good jobs, wandering from post-doc to post-doc (another form of labor pool), or to instructorships rather than tenure-track jobs, or who simply drop out of the system after their PhD and post-docs.  We know of many who are in that boat; don't you?  A recent report showed that the mean age of first grant from NIH was about 45: enough said.

A reproducibility mirage
If there were one central technical problem that Harris stresses, it is the number of results that fail to be reproducible in other studies.  Irreproducible results leave us in limbo-land: how are we to interpret them?   What are we supposed to believe?  Which study--if any of them--is correct?  Why are so many studies proudly claiming dramatic findings that can't be reproduced, and/or why are the news media and university PR offices so loudly proclaiming these reported results?  What's wrong with our practices and standards?

Rigor Mortis goes through many of these issues, forthrightly and convincingly--showing that there is a problem.  But a solution is not so easy to come by, because it would require major shifting of and reform in research funding.  Naturally, that would be greatly resisted by hungry universities and those who they employ to set up a shopping-mall on their campus (i.e., faculty).

One purpose of this post is to draw attention to the wealth of reasons Harris presents for why we should be concerned about the state of play in biomedical research (and, indeed, in science more generally).  I do have some caveats, that I'll discuss below, but that is in no way intended to diminish the points Harris makes in his book.  What I want to add is a reason why I think that, if anything, Harris' presentation, strong and clear as it is, understates the problem.  I say this because to me, there is a deeper issue, beyond the many Harris enumerates: a deeper scientific problem.

Reproducibility is only the tip of the iceberg!
Harris stresses or even focuses on the problem of irreproducible results.  He suggests that if we were to hold far higher evidentiary standards, our work would be reproducible, and the next study down the line wouldn't routinely disagree with its predecessors.  From the point of view of careful science and proper inferential methods and the like, this is clearly true.  Many kinds of studies in biomedical and psychological sciences should have a standard of reporting that leads to at least some level of reproducibility.

However, I think that the situation is far more problematic than sloppy and hasty standards, or questionable statistics, even if they are clearly a prominent ones.  My view is that no matter how high our methodological standards are, the expectation of reproducibility flies in the face of what we know about life.  That is because life is not a reproducible phenomenon in the way physics and chemistry are!

Life is the product of evolution.  Nobody with open eyes can fail to understand that, and this applies to biological, biomedical, psychological and social scientists.  Evolution is at its very core a phenomenon that rests essentially on variation--on not being reproducible.  Each organism, indeed each cell, is different. Not even 'identical' twins are identical.

One reason for this is that genetic mutations are always occurring, even among the cells within our bodies. Another reason is that no two organisms are experiencing the same environment, and environmental factors affect and interact with the genomes of each individual organism of any species.  Organisms affect their environments in turn. These are dynamic phenomena and are not replicable!

This means that, in general, we should not be expecting reproducibility of results.  But one shouldn't overstate this because while obviously the fact that two humans are different doesn't mean they are entirely different.  Similarity is correlated with kinship, from first-degree relatives to members of populations, species, and different species.  The problem is not that there is similarity, it is that we have no formal theory about how much similarity.  We know two samples of people will differ both among those in each sample and between samples.  And, even the same people sampled at separate times will be different, due to aging, exposure to different environments and so on. Proper statistical criteria and so on can answer questions about whether differences seem only due to sampling from variation or from causal differences.  But that is a traditional assumption from the origin of statistics and probability, and isn't entirely apt for biology: since we cannot assume identity of individuals, much less of samples or populations (or species, as in using mouse models for human disease), our work requires some understanding of how much difference, or what sort of difference, we should expect--and build into our models and tests etc.

Evolution is by its very nature an ad hoc phenomenon in both time and place, meaning that there are no fixed rules about this, as there are laws of gravity or of chemical reactions. That means that reproducibility is not, in itself, even a valid criterion for judging scientific results.  Some reproducibility should be expected, but we have no rule for how much and, indeed, evolution tells us that there is no real rule for that.

One obvious and not speculative exemplar of the problem is the redundancy in our systems. Genomewide mapping has documented this exquisitely well: if variation at tens, hundreds, or sometimes even thousands of genome sites' affects a trait, like blood pressure, stature, or 'intelligence' and no two people have the same genotype, then no two people, even with the same trait measure have that measure for the same reason.  And as is very well known, mapping only accounts for a fraction of the estimated heritability of the studied traits, meaning that much or usually most of the contributing genetic variation is unidentified.  And then there's the environment. . . . .

It's a major problem. It's an inconvenient truth.  The sausage-grinder system of science 'productivity' cannot deal with it.  We need reform.  Where can that come from?

Rare Disease Day and the promises of personalized medicine

O ur daughter Ellen wrote the post that I republish below 3 years ago, and we've reposted it in commemoration of Rare Disease Day, Febru...