Director’s Blog: P-Hacking
Later this week, more than 30,000 neuroscientists will descend on Washington, DC, for the annual Society for Neuroscience meeting. The meeting will likely reveal extraordinary progress in studies of brain structure and function, with thousands of posters and short presentations from a diverse crowd, ranging from excited graduate students to the venerable leaders of brain science. Amidst the excitement and sheer volume of activity, it’s easy to forget about a sobering set of recent reports that as much as 80 percent of the science from academic labs, even science published in the best journals, cannot be replicated.
This problem in replication, or what is now called the “reproducibility problem,” has received a lot of attention at NIH.1 Over the past year, we have held a series of meetings with some tense discussions about the nature of the problem and the best solutions. As one outcome, last week NIH published some principles and guidelines for reporting preclinical research. These guidelines aim to improve the rigor of experimental design, with the intention of improving the odds that results in one lab could be replicated in another lab.
It’s easy to misunderstand the reproducibility problem. Non-scientists assume this reflects fraud or fabrication of results. While science is not immune to fraudulent behavior, the vast majority of the time the reproducibility problem can be explained by three other factors, none of which involves intentional misrepresentation or fraud: biological variability, flawed experimental design, or flawed analysis.
Biological variability is especially difficult to address. Biological systems are inherently complex and dynamic, often in ways we do not understand. The challenge of reproducibility in biology is still less than the problem in behavioral research, which may involve some of the most complex and dynamic variables for study. Some years ago, three labs doing behavioral experiments with the same inbred mouse strains discovered huge differences in results even when they intentionally tried to follow precisely the same experimental protocols.2 Of course, biological and behavioral variability can be viewed as an opportunity as well as a challenge—understanding why the results differ between labs may reveal environmental factors that are important for behavior.
Flawed experimental design should be an easier problem to solve. Many of the common standards for reducing bias in clinical research, such as randomization and keeping raters “blind” to the experimental condition, are not always adhered to in preclinical studies. As a result, bias can creep into the experiment, leading the investigator to find what she or he is looking for and ignore results that are not consistent. The new NIH principles and guidelines are intended to address these issues. NIMH published similar guidelines earlier this year.
Flawed analysis may be the problem that has received the least attention. In my own lab, when studying neuroanatomical differences between species or between individuals we used to joke that if we needed statistics to find a difference, it was not an important difference. In fact, statistics are critical for defining group differences. But sometimes what we see in practice is the use of statistics to fish for a difference that is not present.
Which brings me, at last, to “P-hacking.” P-hacking is a term coined by Simmons and colleagues at the University of Pennsylvania; it refers to the practice of reanalyzing data in many different ways to yield a target result. They, and more recently Motulsky, have described the variations on P-hacking, and the hazards, notably the likelihood of false positives—findings that statistics suggest are meaningful when they are not.3,4 For most studies, statistical significance is defined as a “P” value less than 0.05, meaning that the difference observed between two groups would not be seen even 1 in 20 times by chance. That seems like a pretty high bar to prove that a difference is real. But what if 20 comparisons are done and only the one that looks “significant” is presented? Or what if a trend is apparent in the data and samples are dropped or added to achieve this magic number of 0.05? And what if none of this is apparent in the publication and raw data are not available to allow for an unbiased analysis? Welcome to the world of P-hacking.
As Motulsky points out, the relentless quest for a significant “P” value is only one of the many problems with data analysis that could contribute to the reproducibility problem. Many mistakenly believe that “P” values convey information about the size of the difference between two groups. P values are actually only a way of estimating the likelihood that the difference you observe could have occurred by chance. In science, “significance” usually means a P value of less than 0.05 or 1 in 20, but this does not mean that the difference observed between two groups is functionally important. Perhaps the biggest problem is the tendency for scientists to report data that have been heavily processed rather than showing or explaining the details. This suggests one of the solutions for P-hacking and other problems in data analysis: provide the details, including what comparisons were planned prior to running the experiment.
Does P-hacking explain the problem with reproducibility? My guess is that the misuse of statistics is only a small part of the problem, but it is one that has received too little attention in our attempts to improve rigor and replication. An important step towards fixing it is transparent and complete reporting of methods and data analysis, including any data collection or analysis that diverged from what was planned. One could also argue that this is a call to improve the teaching of experimental design and statistics for the next generation of researchers. I know, I know—it seems churlish to raise this issue as the year’s biggest celebration of neuroscience is about to begin. But maybe this is exactly the time to remember that great science requires great attention to the details of experimental design and data analysis.
1 Collins FS, Tabak LA. Policy: NIH plans to enhance reproducibility. Nature. 2014 Jan 30;505(7485):612-3.
2 Crabbe JC, Wahlsten D, Dudek BC. Genetics of mouse behavior: interactions with laboratory environment. Science 1999 Jun 4;284(5420):1670-2.
3 Simmons JP, Nelson LD, Simonshohn U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011 Nov;22(11):1359-66.
4 Motulsky HJ. Common misconceptions about data analysis and statistics. J Pharmacol Exp Ther. 2014 Oct;351(1):200-5.