Post by Former NIMH Director Thomas Insel: From My Data to Mined Data
When I started my career in research, data were kept as a private reserve. We shared results in journal articles, but never the raw data. In essence we treated our data like most people treat their toothbrushes—everyone has one, but you’d never let someone use your own. Thankfully, we’re in the midst of a sea change in this attitude.
Last month NIH released a new policy on genomic data sharing. This policy sets new expectations for the broad and responsible sharing of DNA and RNA data from large-scale studies. This means that scientists doing a wide range of research from epigenomic and genomic studies of patients to studies in model organisms, whether funded as a grant or contract by NIH, will need to submit their results into a database that can be accessed by other investigators. The policy sets out expectations for the use of the data as well, enforced through data access requirements.
Why is data sharing important? For large-scale studies of genomic variation, the number of variables explored is so huge that statistical significance requires many thousands of subjects. Combining data from multiple projects allows scientists to find significant associations that cannot be detected by any individual lab. The Psychiatric Genomics Consortium combines genomic association results from more than 80 labs in 25 countries to survey DNA from over 170,000 people. The impact of this approach was apparent in August with the identification of 108 genetic loci associated with risk for schizophrenia.1
Even beyond the power of “big data” projects, data sharing can be important for smaller-scale science. The recent concern over lack of reproducibility of both basic and clinical research supported by NIH suggests the need for greater transparency.2 Data sharing not only brings transparency; it provides the detail required to conduct replication studies. With sharing comes a need to standardize results, using common data elements so that results can be compared or integrated across studies. While this approach to big and small science has long been familiar in physics and information science, data sharing calls for a culture change in biomedical science.
Even before the new NIH policy, NIMH changed its data sharing policy and began building the essential infrastructure, especially for clinical research. Beginning this summer, NIMH-funded scientists involved in clinical trials were expected to enter individual level data into our National Database for Clinical Trials. Federal policy already requires posting of the results of clinical trials in ClinicalTrials.gov. Our new policy goes further by (a) focusing on individual level data and (b) expecting data sharing at regular intervals during the trial, not just at the conclusion of the project. At the same time, we are developing common data elements with the community to ensure that trials can be compared with fidelity. The importance of having individual level data to share became apparent in a new paper of 37 reanalyses of published clinical trials. In 35%, the reanalysis led to a different interpretation than the original paper, with implications for the types and numbers of patients who should be treated.3
Autism researchers made this culture change a few years ago. Virtually all autism human subjects research data is expected to be deposited in the National Database for Autism Research, which now holds genomic sequences, brain images, and clinical data from over 77,000 subjects. This data provides a platform for discovery through secondary analysis and data sharing specific to a publication.4,5 We expect similar opportunities to emerge from the Human Connectome Project, which is making brain imaging and genomic data broadly available from 1200 healthy volunteers, including 300 twins. But one of the most interesting data sharing efforts will be the Research Domain Criteria (RDoC) project. Most people have focused on RDoC as a matrix with domains of function arrayed with the units of analysis used to study those domains. In fact, it is a big data project built with an information commons for compiling and integrating data through the recently established RDoC Database. The new diagnostic clusters for classifying illness are expected to emerge from this information commons, much as we have seen genomic signals emerge from large amounts of sequencing data.
This shift from the “data are mine” to treasure troves of “mined data” will be disruptive for many. Academic culture is built on individual promotion, often dependent on holding on to data until results can be published in the maximum number of papers in the highest impact journals. For some investigators, sharing unpublished data will feel like giving away the crown jewels. For others, the work of sharing—and it is considerable additional work—will require supplemental funding. Some have asked about the value of sharing when they have too little time or funding to analyze their own data fully. Others have worried about the quality of data that will be shared.
We hear the concerns, especially from early stage investigators who are already facing unprecedented competition for funding. But we also hear the frustration from the public and Congress that funds NIH research. Their concerns: progress is slow, there is too little collaboration across labs, and scientists seem to be funded to study a problem not solve it. Data, not just the results, generated as a consequence of taxpayer funded projects should be a resource for all. Add to this frustration the recent evidence that NIH-funded research cannot be replicated and you have a recipe for change. Data sharing may not solve all of these issues, but if it creates an information culture with more people empowered to work on NIMH problems, we may see this sea change as the catalyst to finally solving some of the most complex scientific issues we face.
1 Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014 Jul 24;511(7510):421-7.
2 Collins FS and Tabak LA. NIH plans to enhance reproducibility. Nature. 2014 Jan 30;505(485):612-613.
3 Ebrahim S et al. Reanalyses of randomized clinical trial data. JAMA. 2014;312(10):1024-1032.
4 Supekar K et al. Brain hyperconnectivity in children with autism and its links to social deficits. Cell Reports. 2013 Nov 14;5(3):738-747.
5 Gaugler T et al. Most genetic risk for autism resides with common variation. Nature Genetics. 2014;46(8):881-885.