Masal ve Hikayeler: scientific method

scientific method etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster

The state of play in science

I've just read a new book that MT readers would benefit from reading as well. It's Rigor Mortis, by Richard Harris (2017: Basic Books). His subtitle is How sloppy science creates worthless cures, crushes hope, and wastes billions. One might suspect that this title is stridently overstated, but while it is quite forthright--and its argument well-supported--I think the case is actually understated, for reasons I'll explain below.

Harris, science reporter for National Public Radio, goes over many different problems that plague biomedical research. At the core is the reproducibility problem, that is, the numbers of claims by research papers that are not reproducible by subsequent studies. This particular problem made the news within the last couple of years in regard to using statistical criteria like p-values (significance cutoffs), and because of the major effort in psychology to replicate published studies, with a lot of failure to do so. But there are other issues.

The typical scientific method assumes that there is a truth out there, and a good study should detect its features. But if it's a truth, then some other study should get similar results. But many many times in biomedical research, despite huge media ballyhoo with cheerleading by the investigators as well as the media, studies' breakthrough!! findings can't be supported by further examination.

As Harris extensively documents, this phenomenon is seen in claims of treatments or cures, or use of animal models (e.g., lab mice), or antibodies, or cell lines, or statistical 'significance' values. It isn't a long book, so you can quickly see the examples for yourself. Harris also accounts for the problems, quite properly I think, by documenting sloppy science but also the careerist pressures on investigators to find things they can publish in 'major' journals, so they can get jobs, promotions, high 'impact factor' pubs, and grants. In our obviously over-crowded market, it can be no surprise to anyone that there is shading of the truth, a tad of downright dishonesty, conveniently imprecise work, and so on.

Since scientists feed at the public trough (or depend on profits and sales for biomedical products to grant-funded investigators), they naturally have to compete and don't want to be shown up, and they have to work fast to keep the funds flowing in. Rigor Mortis properly homes in on an important fact, that if our jobs depend on 'productivity' and bringing in grants, we will do what it takes, shading the truth or whatever else (even the occasional outright cheating) to stay in the game.

Why share data with your potential competitors who might, after all, find fault with your work or use it to get the jump on you for the next stage? For that matter, why describe what you did in enough actual detail that someone (a rival or enemy!) might attempt to replicate your work.....or fail to do so? Why wait to publish until you've got a really adequate explanation of what you suggest is going on, with all the i's dotted and t's crossed? Haste makes credit! Harris very clearly shows these issues in the all-too human arena of our science research establishment today. He calls what we have now, appropriately enough, a "broken culture" of science.

Part of that I think is a 'Malthusian' problem. We are credited, in score-counting ways, by chairs and deans, for how many graduate students we turn (or churn) out. Is our lab 'productive' in that way? Of course, we need that army of what often are treated as drones because real faculty members are too busy writing grants or traveling to present their (students') latest research to waste--er, spend--much time in their labs themselves. The result is the cruel excess of PhDs who can't find good jobs, wandering from post-doc to post-doc (another form of labor pool), or to instructorships rather than tenure-track jobs, or who simply drop out of the system after their PhD and post-docs. We know of many who are in that boat; don't you? A recent report showed that the mean age of first grant from NIH was about 45: enough said.

A reproducibility mirage
If there were one central technical problem that Harris stresses, it is the number of results that fail to be reproducible in other studies. Irreproducible results leave us in limbo-land: how are we to interpret them? What are we supposed to believe? Which study--if any of them--is correct? Why are so many studies proudly claiming dramatic findings that can't be reproduced, and/or why are the news media and university PR offices so loudly proclaiming these reported results? What's wrong with our practices and standards?

Rigor Mortis goes through many of these issues, forthrightly and convincingly--showing that there is a problem. But a solution is not so easy to come by, because it would require major shifting of and reform in research funding. Naturally, that would be greatly resisted by hungry universities and those who they employ to set up a shopping-mall on their campus (i.e., faculty).

One purpose of this post is to draw attention to the wealth of reasons Harris presents for why we should be concerned about the state of play in biomedical research (and, indeed, in science more generally). I do have some caveats, that I'll discuss below, but that is in no way intended to diminish the points Harris makes in his book. What I want to add is a reason why I think that, if anything, Harris' presentation, strong and clear as it is, understates the problem. I say this because to me, there is a deeper issue, beyond the many Harris enumerates: a deeper scientific problem.

Reproducibility is only the tip of the iceberg!
Harris stresses or even focuses on the problem of irreproducible results. He suggests that if we were to hold far higher evidentiary standards, our work would be reproducible, and the next study down the line wouldn't routinely disagree with its predecessors. From the point of view of careful science and proper inferential methods and the like, this is clearly true. Many kinds of studies in biomedical and psychological sciences should have a standard of reporting that leads to at least some level of reproducibility.

However, I think that the situation is far more problematic than sloppy and hasty standards, or questionable statistics, even if they are clearly a prominent ones. My view is that no matter how high our methodological standards are, the expectation of reproducibility flies in the face of what we know about life. That is because life is not a reproducible phenomenon in the way physics and chemistry are!

Life is the product of evolution. Nobody with open eyes can fail to understand that, and this applies to biological, biomedical, psychological and social scientists. Evolution is at its very core a phenomenon that rests essentially on variation--on not being reproducible. Each organism, indeed each cell, is different. Not even 'identical' twins are identical.

One reason for this is that genetic mutations are always occurring, even among the cells within our bodies. Another reason is that no two organisms are experiencing the same environment, and environmental factors affect and interact with the genomes of each individual organism of any species. Organisms affect their environments in turn. These are dynamic phenomena and are not replicable!

This means that, in general, we should not be expecting reproducibility of results. But one shouldn't overstate this because while obviously the fact that two humans are different doesn't mean they are entirely different. Similarity is correlated with kinship, from first-degree relatives to members of populations, species, and different species. The problem is not that there is similarity, it is that we have no formal theory about how much similarity. We know two samples of people will differ both among those in each sample and between samples. And, even the same people sampled at separate times will be different, due to aging, exposure to different environments and so on. Proper statistical criteria and so on can answer questions about whether differences seem only due to sampling from variation or from causal differences. But that is a traditional assumption from the origin of statistics and probability, and isn't entirely apt for biology: since we cannot assume identity of individuals, much less of samples or populations (or species, as in using mouse models for human disease), our work requires some understanding of how much difference, or what sort of difference, we should expect--and build into our models and tests etc.

Evolution is by its very nature an ad hoc phenomenon in both time and place, meaning that there are no fixed rules about this, as there are laws of gravity or of chemical reactions. That means that reproducibility is not, in itself, even a valid criterion for judging scientific results. Some reproducibility should be expected, but we have no rule for how much and, indeed, evolution tells us that there is no real rule for that.

One obvious and not speculative exemplar of the problem is the redundancy in our systems. Genomewide mapping has documented this exquisitely well: if variation at tens, hundreds, or sometimes even thousands of genome sites' affects a trait, like blood pressure, stature, or 'intelligence' and no two people have the same genotype, then no two people, even with the same trait measure have that measure for the same reason. And as is very well known, mapping only accounts for a fraction of the estimated heritability of the studied traits, meaning that much or usually most of the contributing genetic variation is unidentified. And then there's the environment. . . . .

It's a major problem. It's an inconvenient truth. The sausage-grinder system of science 'productivity' cannot deal with it. We need reform. Where can that come from?

When scientific theory constrains

It's good from time to time to reflect on how we know what we think we know. And to remember that, as it has been in any time in history, much of what we now think is true will sooner or later be found to be false or, often, only inaccurately or partially correct. Some of this is because values change -- not so long ago homosexuality was considered to be an illness, e.g. Some is because of new discoveries -- when archaea were first discovered they were thought to be exotic microbes that inhabited extreme environments but now they're known to live in all environments, even in and on us. And of course these are just two of countless examples.

But what we think we know can be influenced by our assumptions about what we think is true, too. It's all too easy to look at data and interpret it in a way that makes sense to us, even if there are multiple possible interpretations. This can be a particular problem in social science, when we've got a favorite theory and the data can be seen to confirm it; this is perhaps easiest to notice if you yourself aren't wedded to any of the theories. But it's also true in biology. It is understandable that we want to assert that we now know something, and are rewarded for insight and discoveries, rather than more humbly hesitating to make claims.

Charitable giving
The other day I was listening to the BBC Radio 4 program Analysis on the charitable impulse. Why do people give to charity? It turns out that a lot of psychological research has been done on this, to the point that charities are now able to manipulate us into giving. If you call your favorite NPR station to donate during a fund drive, e.g., if you're told that the caller just before you gave a lot of money, you're more likely to make a larger donation than if you're told the previous caller pledged a small amount.

A 1931 advertisement for the British charity, Barnardo's Homes; Wikipedia

Or, if an advertisement pictures one child, and tells us the story of that one child, we're more likely to donate than if we're told about 30,000 needy children. This works even if we're told the story of two children, one after the other. But, according to one of the researchers, if we're shown two children at once, and told that if we give, the money will randomly go to just one of the children, we're less likely to give. This researcher interpreted this to mean that two is too many.

But there seem to me to be other possible interpretations given that the experiment changes more than one variable. Perhaps it's that we don't like the idea that someone else will choose who gets our money. Or that we feel uncomfortable knowing that we've helped only one child when two are needy. But surely something other than that two is too many, given that in 2004 so many people around the world donated so much money to organizations helping tsunami victims that many had to start turning down donations. These were anonymous victims, in great numbers. Though, as the program noted, people weren't nearly as generous to the great number of victims of the earthquake in Nepal in 2015, with no obvious explanation.

The researcher did seem to be wedded to his one vs too many interpretation, despite the contradictory data. In fact, I would suggest that the methods, given what were presented, don't allow him to legitimately draw any conclusion. Yet he readily did.

Thinness microbes?
The Food Programme on BBC Radio 4 is on to the microbiome in a big way. Two recent episodes (here and here) explore the connection between gut microbes, food, and health and the program promises to update us as new understanding develops. As we all know by now, the microbiome, the bug intimates that accompany us through life, in and on our body, may affect our health, our weight, our behavior, and perhaps much more. Or not.

Pseudomonas aeruginosa, Enterococcus faecalis and Staphylococcus aureus on Tryptic Soy Agar. Wikipedia

Obesity, asthma, atopy, periodontal health, rheumatoid arthritis, Parkinson's, Alzheimer's, autism, and many many more conditions have been linked with, or are suggested to be linked with, in one way or another, our microbiome. Perhaps we're hosting the wrong microbes, or not a diverse enough set of microbes, or we wipe the good ones out with antibiotics along with the bad, or with alcohol, and what we eat may have a lot to do with this.

One of the researchers interviewed for the program was experimenting with a set of identical twins in Scotland. He varied their diets having them eat, for example, lots of junk food and alcohol, or a very fibrous diet, and documented changes in their gut microbiomes which apparently can change pretty quickly with changes in diet. The most diverse microbiome was associated with the high fiber diet. Researchers seem to feel that diversity is good.

Along with a lot of enthusiasm and hype, though, mostly what we've got in microbiome research so far is correlations. Thin people tend to have a different set of microbes than obese people, and people with a given neurological disease might statistically share a specific subset of microbes. But this tells us nothing about cause and effect -- which came first, the microbiome or the condition? And because the microbiome can change quickly and often, how long and how consistently would an organism have to reside in our gut before it causes a disease?

There was some discussion of probiotics in the second program, the assumption being that controlling our microbiome affects our health. Perhaps we'll soon have probiotic yogurt or kefir or even a pill that keeps us thin, or prevents Alzheimer's disease. Indeed, this was the logical conclusion from all the preceding discussion.

But one of the researchers, inadvertently I think, suggested that perhaps this reductionist conclusion was unwarranted. He cautioned that thinking about probiotic pills rather than lifestyle might be counterproductive. But except for factors with large effects such as smoking, the effect of "lifestyle" on health is rarely obvious. We know that poverty, for example, is associated with ill health, but it's not so easy to tease out how and why. And, if the microbiome really does directly influence our health, as so many are promising, the only interesting relevant thing about lifestyle would be how it changes our microbiomic makeup. Otherwise, we're talking about complexity, multiple factors with small effects -- genes, environmental factors, diet, and so on, and all bets about probiotics and "the thinness microbiome" are off. But, the caution was, to my mind, an important warning about the problem of assuming we know what we think we know; in this case, that the microbiome is the ultimate cause of disease.

The problem of theory
These are just two examples of the problem of assumption-driven science. They are fairly trivial, but if you are primed to notice, you'll see it all around you. Social science research is essentially the interpretation of observational data from within a theoretical framework. Psychologists might interpret observations from the perspective of behavioral, or cognitive, or biological psychology, e.g., and anthropologists, at least historically, from, say, a functionalist or materialist or biological or post-modernist perspective. Even physicists interpret data based on whether they are string theorists or particle physicists.

And biologists' theoretical framework? I would suggest that two big assumptions that biologists make are reductionism and let's call it biological uniformitarianism. We believe we can reduce causation to a single factor, and we assume that we can extrapolate our findings from the mouse or zebrafish we're working on to other mice, fish and species, or from one or some people to all people. That is, we assume invariance rather than that what we can expect is variation. There is plenty of evidence to show that by now we should know better.

True, most biologists would probably say that evolutionary theory is their theoretical framework, and many would add that traits are here because they're adaptive, because of natural selection. Evolution does connect people to each other and people to other species, it has done so by working on differences, not replicated identity, and there is no rule for the nature or number of those differences or for extrapolating from one species or individual to another. We know nothing to contradict evolutionary theory, but that every trait is adaptive is an assumption, and a pervasive one.

Theory and assumption can guide us, but they can also improperly constrain how we think about our data, which is why it's good to remind ourselves from time to time to think about how we know what we think we know. As scientists we should always be challenging and testing our assumptions and theories, not depending on them to tell us that we're right.

The Blind Men and the Elephant -- a post-modern parable

It's an ancient parable; a group of blind men are lead to an elephant and asked to describe what they feel. One feels a tusk, another a foot, a third the tail, and so on, and of course they disagree entirely about what it is they are feeling. This tale is usually used as an illustration of the subjectivity of our view of reality, but I think it's more than that.

I heard a talk by Anthropologist Agustin Fuentes here at Penn State the other day, on blurring the boundaries between science and the humanities. He used the parable to illustrate why science needs the humanities and vice versa; each restricted view of the world is enhanced by the other to become complete.

But, this assumes that the tales that science tells, and the tales that the humanities tell are separate but equally true -- scientists feel the tail, humanities feel the tusk and accurately report what they feel. Once they listen to each other's tales, they can describe the whole elephant.

"Blind monks examining an elephant" by Hanabusa Itchō (Wikipedia)

But I don't think so. I don't think that all that scientists are missing is a humanities perspective, and vice versa. I think in a very real sense we're all blind all of the time, and there's no way to know what we're missing and when it matters. You feel the tusk, and you might be able to describe it, but you have no clue what it's made of. Or, you feel the tail but you have no idea what the elephant uses it for, if anything.

Here's my own personal version of the same parable -- some years ago we purchased a new landline with answering machine. Oddly, we have a lot of power outages here, and it seemed that every time I set the time and day on the answering machine, we'd have another outage and the time and day would disappear, having to be set once again. I decided that was a nuisance, and I stopped setting time and day.

The next time the machine said we had a message, I listened to it, but it was blank. There was no message! Naturally enough (I thought), I concluded that the time and day had to be set for the machine to record a message. Unhappy consumers, we contacted the maker, and they said no, the machine should record the message anyway. Which of course it would have if the caller had left a message, as was proven the next time someone called on unknown day at unknown time and ... left a message.

My conclusion was reasonable enough for the data I had, right? It just happened not to be based on adequate data (aka reality). But, we always think we've got enough data to draw a conclusion, no matter how much we're in fact missing. This is true in epidemiology, genetics, medical testing, the humanities, interpersonal relationships; we think we know enough about our partner to commit to marrying him or her, but half of us turn out to be wrong. Indeed, if all you've seen are white swans, you'll conclude that all swans are white -- until you see your first black one.

No, you say, we did power tests and we know we've got enough subjects to conclude that gene X causes disease Y. But, it's possible that all your subjects are from western Europe, or even better, England, say, and what you've done is identify a gene everyone shares because they share a demographic history. You won't know that until you look at people with the same disease from a different part of the world -- until you collect more data. Until you see your first black swan.

But, you say, no one would make such an elementary mistake now -- you've drawn you controls from the same population, and they will share the same population-specific allele, so differences between cases and controls will be disease-specific. But, western Europe is a big area, and even England is heterogeneous, and it's possible that everyone with your disease is more closely related than people without. So, you really might have identified population structure rather than a disease allele but you can't know, until you collect more data -- you look at additional populations, or more people in the same population.

Even then, say you look at additional populations and you don't find the same supposedly causal allele. You can't know why -- is it causal in one population and not another? Is it not causal in any population, and your initial finding merely an artifact of ill-conceived study design?

Without belaboring this particular example any further, I hope the point is clear. You feel the tail, but that doesn't tell you everything about the tail. But you can't know what you're missing until you ask more questions, and gather more data.

Darwin explained inheritance with his idea of gemmules. He was wrong, of course, but he had no way to know how or why, and it wasn't until Mendel's work was rediscovered in 1900 that people could move on. Everything we know about genetics we've learned since then, but that doesn't mean we know everything about genetics. But theories of inheritance (and much else) don't include acknowledgement of glaring holes: "My theory is obviously inadequate because, as always, there is a lot we don't yet understand but we don't know what that is so I'm leaving gaps, but I don't know how big or how many." And, in a related issue that we write about frequently here, it's also true that instead of coming clean, we often claim more than we know (and often we know what we're doing in doing so).

Even very sophisticated theories just 15 or 20 years ago had no way to include, say, epigenetics, or the importance of transcribed but untranslated RNAs (that is, RNA not coding for genes but doing a variety of other things, some of them still unknown), or interfering RNAs, and so on, and we have no idea today what we'll learn tomorrow. But, like the blind men, we act as though we can draw adequate conclusions from the data we've got.

Science is about pushing into the unknown. But, because it's unknown, we have no idea how far we need to push. I think in most cases, there's always further, we're never done, but we often labor under the illusion that we are. Or, that we're close.

But, should ductal cancer in situ, a form of breast cancer, be treated? And how will we know for sure? Systems biology sounds like a great idea, but how will we ever know we've taken enough of a given system into account to explain what we're trying to explain? Will physicists ever know whether the multiverse, or the symmetry theory is correct (whatever those elusive ideas actually mean!)?

Phlogiston was once real, as were miasma and phrenology, the four humors, and the health benefits of smoking. It's not that we don't make progress -- we do now know that smoking is bad for our health (even if only 10% of smokers get lung cancer; ok, smoking is associated with a lot of other diseases as well, so better not to smoke) -- but we've always got the modern equivalent of phlogiston and phrenology. We just don't know which they are. We're still groping the elephant in the dark.

How do we know what we think we know?

Two stories collided yesterday to make me wonder, yet again, how we know what we think we know. The first was from the latest BBC Radio 4 program The Inquiry, an episode called "Can we learn to live with nuclear power?" which discusses the repercussions of the 2011 disaster in the Fukushima nuclear power plant in Japan. It seems that some of us can live with nuclear power and some of us can't, even when we're looking at the same events and the same facts. So, for example, Germans were convinced by the disaster that nuclear power isn't reliably safe and so they are abandoning it, but in France, nuclear power is still an acceptable option. Indeed most of the electricity in France comes from nuclear power.

Why didn't the disaster convince everyone that nuclear power is unsafe? Indeed, some saw the fact that there were no confirmed deaths attributable to the disaster as proof that nuclear power is safe, while others saw the whole event as confirmation that nuclear power is a disaster waiting to happen. According to The Inquiry, a nation's history has a lot to do with how it reads the facts. Germany's history is one of division and war, and nuclear power associated with bombs, but French researchers and engineers have long been involved in the development of nuclear power, so there's a certain amount of national pride in this form of energy. It may not be an unrelated point that therefore many people in France have vested interests in nuclear power. Still, same picture, different reading of it.

Cattenom nuclear power plant, France; Wikipedia

Reading ability is entirely genetic
And, I was alerted to yet another paper reporting that intelligence is genetic (h/t Mel Bartley); this time it's reading ability, for which no environmental effect was found (or acknowledged). (This idea of little to no environmental effect is an interesting one, though, given that the authors, who are Dutch, report that heritability of dyslexia and reading fluency is higher among Dutch readers -- 80% compared with 45-70% elsewhere -- they suggest because Dutch orthography is simpler than that of English. This sounds like an environmental effect to me.)

The authors assessed reading scores for twins, parents and siblings, and used these to evaluate additive and non-additive genetic effects, and family environmental factors. As far as I can tell, subjects were asked to read aloud from a list of Dutch words, and the number they read correctly within a minute constituted their score. And again, as far as I can tell, they did not test for nor select for children or parents with dyslexia, but they seem to be reporting results as though they apply to dyslexia.

The authors report a high correlation in reading ability between monozygotic twins, a lower correlation between dizygotic twins, and between twins and siblings, and a higher correlation between spouses, which to the authors is evidence of assortative mating (choice of mate based on traits associated with reading ability). They conclude:

Such a pattern of correlation among family members is consistent with a model that attributes resemblance to additive genetic factors, these are the factors that contribute to resemblance among all biological relatives, and to non-additive genetic factors. Non-additive genetic factors, or genetic dominance, contributes to resemblance among siblings, but not to the resemblance of parents and offspring. Maximum likelihood estimates for the additive genetic factors were 28% (CI: 0–43%) and for dominant genetic factors 36% (CI: 18–65%), resulting in a broad-sense heritability estimate of 64%. The remainder of the variance is attributed to unique environmental factors and measurement error (35%, CI: 29–44%).

Despite this evidence for environmental effect (right?), the authors conclude, "Our results suggest that the precursors for reading disability observed in familial risk studies are caused by genetic, not environ- mental, liability from parents. That is, having family risk does not reflect experiencing a less favorable literacy environment, but receiving less favorable genetic variants."

The ideas about additivity are technical and subtle. Dominant effects, that is, non-additive interactions among alleles within a gene in the diploid copies of an individual, are not inherited as additive ones are (if you are a Dd and that determines your trait, only one of those alleles, and hence not enough to determine the trait, is transmitted to any of your offspring). Likewise, interactions (between loci), called epistasis, is also not directly transmitted.

There are many practical as well as political reasons to believe that interactions can be ignored. In a practical sense, even multiple 2-way interactions make impossible sample size and structure demands. But in a political sense, additive effects mean that traits can be reliably predicted from genotype data (meaning, even at birth): you estimate the effects of each allele at each place in the genome, and add them to get the predicted phenotype. There is money to be made by that, so to speak. But it doesn't really work with complex interactions. Strong incentives, indeed, to report additive effects and very understandable!

Secondly, all these various effects are estimated from samples, not derived from basic theory about molecular-level physiology, and often they are hardly informed by the latter at all. This means that replication is not to be expected in any rigorous sense. For example, dominance is estimated by the deviation of average traits in AA, Aa, and aa individuals from being in 0, 1, 2 proportions if (say) the 'a' allele contributed 1-unit of trait measure. Dominance deviations are thoroughly sample-dependent. It is not easy to interpret those results when samples cannot be replicated (the concepts are very useful in agricultural and experimental breeding contexts, but far less so in natural human populations). And this conveniently overlooks the environmental effects.

This study is of a small sample, especially since for many traits it now seems de rigueur to have samples of hundreds of thousands to get reliable mapping results, not to mention a confusingly defined trait, so it's difficult, at least for me, to make sense of the results. In theory, it wouldn't be terribly surprising to find a genetic component to risk of reading disability, but it would be surprising, particularly since disability is defined only by test score in this study, if none of that ability was substantially affected by environment. In the extreme, if a child hasn't been to school or otherwise learned to read, that inability would be largely determined by environmental factors, right? Even if an entire family couldn't read, it's not possible to know whether it's because no one ever had the chance to learn, or they share some genetic risk allele.

In people, unlike in other animals, assortative mating has a huge cultural component, so, again, it wouldn't be surprising if two illiterate adults married, or if they then had no books in the house, and didn't teach their children that reading was valuable. But this doesn't mean either reading or their mate-choice necessarily has any genetic component.

So, again, same data, different interpretations
But why? Indeed, what makes some Americans hear Donald Trump and resonate with his message, while others cringe? Why do we need 9 Supreme Court justices if the idea is that evidence for determination of the constitutionality of a law is to be found in the Constitution? Why doesn't just one justice suffice? And, why do they look at the same evidence and reliably and predictably vote along political lines?

Or, more uncomfortably for scientists, why did some people consider it good news when it was announced that only 34% of replicated psychology experiments agreed with the original results, while others considered this unfortunate? Again, same facts, different conclusions.

Why do our beliefs determine our opinions, even in science, which is supposed to be based on the scientific method, and sober, unbiased assessment of the data? Statistics, like anything, can be manipulated, but done properly they at least don't lie. But, is IQ real or isn't it? Are behavioral traits genetically determined or aren't they? Have genome wide association studies been successful or not?

As Ken often writes, much of how we view these things is certainly determined by vested interest and careerism, not to mention the emotional positions we inevitably take on human affairs. If your lab spends its time and money on GWAS, you're more likely to see them as successful. That's undeniable if you are candid. But, I think it's more than that. I think we're too often prisoners of induction, based on our experience, training, predilections of what observations we make or count as significant; our conclusions are often underdetermined, but we don't know it. Underdetermined systems are those that are accounted for with not enough evidence. It's the all-swans-are-white problem; they're all white until we see a black one. At which point we either conclude we were wrong, or give the black swan a different species name. But, we never know if or when we're going to see a black one. Or a purple one.

John Snow determined to his own satisfaction during the cholera epidemic in London in 1854 that cholera was transmitted by a contagion in the water. But in fact he didn't prove it. The miasmatists, who believed cholera was caused by bad air, had stacks of evidence of their own -- e.g., infection was more common in smoggy, smelly cities, and in fact in the dirtier sections of cities. But both Snow and the miasmatists had only circumstantial evidence, correlations, not enough data to definitively prove their were right. Both arguments were underdetermined. As it happened, John Snow was right, but that wasn't to be widely known for another few decades when vibrio cholerae was identified under Robert Koch's microscope.

"The scent lies strong here; do you see anything?"; Wikipedia

Both sides strongly (emotionally!) believed they were right, believed they had the evidence to support their argument. They weren't cherry-picking the data to better support their side, they were looking at the same data and drawing different conclusions. They based their conclusions on the data they had, but they had no idea it wasn't enough.

But it's not just that, either. It's also that we're predisposed by our beliefs to form our opinions. And that's when we're likely to cherry pick the evidence that supports our beliefs. Who's right about immigrants to the US, Donald Trump or Bernie Sanders? Who's right about whether corporations are people or not? Who's right about genetically modified organisms? Or climate change? Who's right about behavior and genetic determinism?

And it's even more than that! If genetics and evolutionary biology have taught us anything, they've taught us about complexity. Even simple traits turn out to be complex. There are multiple pathways to most traits, most traits are due to interacting polygenes and environmental factors, and so on. Simple explanations are less likely to be correct than explanations that acknowledge complexity, and that's because evolution doesn't follow rules, except that what works works, and to an important degree that's what is here to be examined today.

Simplistic explanations are probably wrong. But they are so appealing.

It's not just about psych studies; it's about core aspects of inference.

A paper just published in Science by the "Open Science Collaboration" reports the results of a multi-year multi-institution effort to replicate 100 psychology studies published in three top psychology journals in 2008. This effort has often been discussed since it began in 2011, in large part because the importance of replicability in confirming scientific results is integral to the 'scientific method,' but replicability studies aren't a terribly creative use of a researcher's time, and they're difficult to publish so they aren't often on researchers' To-Do lists. So, this was unusual.

Les Twins; Wikipedia

There are many reasons a study can't be replicated. Sometimes the study was poorly conceived or carried out (assumptions and biases not taken into account), sometimes the results pertain only to the particular sample reported (a single family or population), sometimes the methods in an original study aren't described well enough to be replicated, sometimes random or even systematic error (instrument behaving badly) skews the results.

Because there's no such thing as a perfect study, replication studies can be victims of any of the same issues, so interpreting lack of replication isn't necessarily straightforward, and certainly doesn't always mean that the original study was flawed.

The Open Science Collaboration was scrupulous in its efforts to replicate original studies as carefully and faithfully as possible. Still, the results weren't pretty. The authors write:

Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

Interestingly enough, the authors aren't quite sure what any of this means. First, as they point out, direct replication doesn't verify the theoretical interpretation of the result, which could have been flawed originally, and remain flawed. And it's impossible to know when a study is not replicated why that is, whether the original was flawed or the replication effort was flawed, or even both were flawed.

This effort has been the subject of much discussion, naturally enough. In a piece published last week in The Atlantic, Ed Yong quotes several psychologists, including the project's lead author, saying that this project has been a welcome learning experience for the field. There are plans afoot to change how things are done, including pre-registration of hypotheses so that the reported results can't be cherry-picked, or increasing the size of studies to increase their power, as has been done in the field of genetics.

We'll see whether this is just a predictable wagon-circling welcome, or really means something. One has every reason to be skeptical, and wonder if these fields really are sciences in the proper sense of the term. Indeed, it's quite interesting to see genetics held up as an exemplar of good and reliable study design. After billions of dollars being spent on studies large and small of the genetics of asthma, heart disease, type 2 diabetes, obesity, hypertension, stroke, and so on, we've got not only a lot of contradictory findings, but most of what has been found are genes with small effects. And epidemiology, many of the 'omics fields, evolutionary biology, and others haven't done any better.

Why? The vagueness of the social and behavioral sciences is only part of the problem (unlike, say, force, outcome variables such as stress, aggression, crime, or intelligence are hard to consistently define, and can vary according to the instrument with which they are measured). Biomedical outcomes can be vague and hard to define as well (autism, schizophrenia, high blood pressure). We don't understand enough about how genes interact with each other or with the environment to understand complex causality.

Statistics and science
The problem may be much deeper than any of this discussion of non-replicable results suggests. First, from an evolutionary point of view, we expect organisms to be different, not replicates. This is because mutational changes (and recombination) are always making each individual organism's genotype unique and, second, the need to adapt--Darwin's central claim or observation--means that organisms have to be different so that their 'struggle for life' can occur.

We have only a general theory for this, since life is an ad hoc adaptive/evolutionary phenomenon. Far more broadly than just the behavioral or social sciences, our investigative methods are based on 'internal' comparisons (e.g., cases vs controls, various levels of blood pressure and stroke, fitness relative to different trait values) to evaluate samples against each other, rather than as representations of an externally derived, a priori theory. When we rely on statistics and p-value significance tests and probabilities and so on, we are implicitly confessing that we don't in fact really know what's going on, and all we can get are a kind of shadow of the underlying process that is cast by the differences we detect, and we detect them with generic (not to mention subjective) rather than specific criteria. We've written about these things several times in the past here.

The issue is not just weakly defined terms and study designs. As Freeman Dyson (in "A meeting with Enrico Fermi") wrote in 2004:

In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, "How many arbitrary parameters did you use for your calculations?" I thought for a moment about our cut-off procedures and said, "Four." He said, "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." With that, the conversation was over. . . .

Here, parameters refers to what are more properly called 'free parameters', that is, ones not fixed in advance, but that are estimated from data. By contrast, for example, in physics the speed of light and gravitational constant are known, fixed values a priori, not estimated from data (though data were used to establish those values). We just lack such understanding in many areas of science, not just behavioral sciences.

In a sense we are using Ptolemaic tinkering to fit a theory that doesn't really fit, in the absence of a better (e.g., Copernican or Newtonian) theoretical understanding. Social and behavioral sciences are far behind the at least considerably more rigorous genetic and evolutionary sciences when the latter are done at their best (which isn't always). Like the shadows of reality seen in Plato's cave, statistical inference reflects the shadows of the reality we want to understand, but for many cultural and practical reasons we don't recognize that, or don't want to acknowledge it. The weaknesses and frangibility of our predictive 'powers' could, if properly understood by the general public, be a threat to our business and our culture doesn't reward candor when it comes to that business. The pressures (including from the pubic media, with their own agenda and interests) necessarily lead to reducing complexity to simpler models and claims far beyond what has legitimately been understood.

The problem is not just with weakly measured variables or poorly defined terms of, for example, outcomes. Nor is the problem if, when, or that people use methods wrongly. The problem is that statistical inference is based on a sample and is often retrospective, or mainly empirical and based on only rather generic theory. No matter how well chosen and rigorously defined, in these various areas (unlike much of physics and chemistry) the estimates of parameters and the like fitted to data that is necessarily about the subjects past, such as their culture or upbringing or lifestyles, but in the absence of adequate formal theory, these findings cannot be used to predict the future with knowable accuracy. That is because the same conditions can't be repeated, say, decades from now, and we don't know what future conditions will be, and so on.

Rather alarmingly, we were recently discussing this with a colleague who works in very physics- and chemistry-rigorous material science. She immediately told us that they, too, face problems in data evaluation with the number of variables they have to deal with, even under what the rest of us would enviously say were very well-controlled conditions where the assumptions of statistics--basically amounting to replicability of some underlying mathematical process--should really apply well.

So the social and related sciences may be far weaker than other fields, and should acknowledge that. But the rest of us, in various purportedly 'harder' biological, biomedical, and epidemiological sciences, are often not so much better off. Statistical methods and theory work wonderfully well when their assumptions are closely met. But there is too much out-of-the-box analytic toolware, that lures us into thinking that quick and definitive answers are possible. Those methods never promise that because what statistics does is account for repeated phenomena following the same rules, and the rule of many sciences is that, in their essence they are not following such rules.

But the lure of easy-answer statistics, and the understandable lack of deeply better ideas, perpetuates the expensive and misleading games that we are playing in many areas of science.

Masal ve Hikayeler

The state of play in science

When scientific theory constrains

The Blind Men and the Elephant -- a post-modern parable

How do we know what we think we know?

It's not just about psych studies; it's about core aspects of inference.

Rare Disease Day and the promises of personalized medicine

Kötüye Kullanım Bildir