Skip to content

Pioneering NIMH Data Sharing

Science Update

NIMH's first major effort to share clinical trial research data – established before many current data registries existed – is still doing a brisk business. The NIMH Limited Access Datasets (LAD) project, including data from 23 large NIMH-supported clinical trials, recently sent out its 300th dataset. The datasets are referred to as "limited access" because, for the protection of the human study participants from whom the data were obtained, only qualified researchers may obtain access to the datasets, and only upon the approval of a Data Use Certification (DUC), which stipulates specific terms and conditions under which the data may be used, including terms for data security and confidentiality, and acknowledgement of the original data submitters in publications. The datasets have provided the raw material for at least 160 – and likely many more – published scientific papers, an example of how data sharing provides an avenue for multiplying the return on investment and benefit from clinical research.

The benefits of data sharing have been acknowledged for years; making raw data accessible to other scientists working in the same field enables them to confirm research findings and to mine the same data, making new comparisons and asking new questions without doing the costly and resource-intensive work of carrying out new research. Data sharing has been gaining momentum, with NIMH and other NIH Institutes and Centers creating searchable online repositories for data. The creation of the LAD, however, was an important milestone in the effort to make it possible to share raw research data.

In 2003, NIH-wide policy called for grants receiving over $500,000 in funding to include a plan for sharing data. Nevertheless, investigators wishing to comply with the policy may have been stymied by questions such as where the data generated via their grants will be stored and managed in the long term, after funding ceases. With the LAD, NIMH has provided infrastructure for archiving data from large clinical trials and making them available to requestors, thereby providing funded investigators with the means to comply with the NIH directive. Investigators are given guidance for documenting and preparing the datasets for submission to NIMH, including guidance for stripping personally identifiable information from the datasets prior to submission, and organizing associated metadata, so that secondary users would have a clear understanding of how the data were generated and what the variables were designed to measure. Thus, de-identified datasets and associated metadata have been submitted to NIMH, which then clarifies, organizes, and distributes them to requestors via CD-ROM.

In the course of clinical trials, investigators collect data on various measures: depression scales, blood test results, genetic material results, etc. The LAD project provides it all. Thus, while scientists relying on aggregated group-level results reported by multiple studies can only conduct descriptive meta-analyses, with the LAD, scientists can integrate datasets and look at specific measures recorded for individual patients and item-level data, not just aggregates. Clinical trials are complex and expensive to conduct; data mining via the LAD allows scientists not involved in the trial to explore these data, without the expense of repeating a trial. The return on investment of research dollars is thereby expanded.

Advances in big data and informatics since the establishment of the LAD project have created new means for encouraging widespread data sharing. NIMH has developed a series of federated data repositories to store data from a variety of studies: the National Database for Autism Research (NDAR); the NIH Pediatric MRI Data Repository (PedsMRI); the National Database for Clinical Trials related to Mental Illness (NDCT, see NOT-MH-14-015); and the Research Domain Criteria Database (RDoCdb). NDAR and PedsMRI have been facilitating data sharing for several years now, whereas NDCT and RDoCdb are new initiatives. NIMH intends to incorporate the LAD into the NDCT, in order to provide researchers with a ‘one-stop shopping' experience for clinical trial data from NIMH-supported research; in the meantime, the LAD continues to provide datasets that haven't yet been transitioned to the NDCT, and the momentum of requests continues to grow.

NIMH's Clinical Trials Operations and Biostatistics Branch (now a component of the Office of Clinical Research) has been managing the LAD since its inception in 2003. Branch staff members work with data coordinating centers while studies are ongoing to provide a data structure and, after study closeout, review data submitted. This quality assurance review is an important feature of the LAD; it ensures that the data are de-identified, thoroughly documented, and error-free.

The LAD includes datasets from NIMH-funded studies such as the Multimodal Treatment Study of Children with Attention Deficit and Hyperactivity Disorder (MTA), Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE), Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD), and Sequenced Treatment Alternatives to Relieve Depression (STAR*D). Using the data supplied from these and other studies, requestors have tested their own hypotheses, and published on issues such as genetic variation and treatment response in schizophrenia and major depression, the impact of anti-inflammatory medications on response to antidepressants, predictors of suicide events in patients with bipolar disorder, medication interaction with dopamine receptors, and remission in schizophrenia.

The experience of investigators who have made use of archived datasets illustrates the utility of this resource. L. Eugene Arnold, M.D., at Ohio State University, who has worked extensively with the MTA, notes that the data allow investigators to pursue "thought experiments." The MTA has 151 measures, many of which were retained through the longitudinal study. "You can wonder what the correlation would be between two different scales-and check to see what the relationship is, without carrying out a big study."

Amber Bahorik, a doctoral candidate at the University of Pittsburgh, points out that as a graduate student, her research questions would have been limited without access to a large dataset. "Small clinical trials are powered for specific things. The CATIE data have a lot of different measures, so you can ask a lot of questions." In addition, access to this dataset gave her the opportunity to get experience working with data and establish a research agenda and trajectory, eventually paving the way to being an independent investigator.

"The possibilities the datasets present for asking questions without the added, significant expense of original clinical research is valuable in and of itself," said Adam Haim, Ph.D., Chief, Clinical Trials Operations and Biostatistics Branch. "For investigators whose access to research funding is limited, for example, scientists in the developing world, it can enable research that would otherwise be impossible." The LAD makes these datasets available to scientists anywhere in the world where their institution maintains an approved Federalwide Assurance for the protection of human subjects.

Investigators interested in information on the available datasets can go to the LAD website. Those interested in submitting datasets can now go to the NDCT website. Everyone, from study participants to providers of the data to recipients, gains.