Director’s Blog: An Emerging Era of Big Data
Every era has its technological icons. In the early 20th century, it was the airplane and the automobile. In the mid-20th century, it was the television and the telephone. Now in the early 21st century, the smart phone and social networking appear to be the defining technologies. And increasingly, the current era is beginning to look like the era of “big data” — a term that refers to the explosion of available information, a byproduct of the digital revolution.
“Explosion” is not too strong a term, especially in the mental health field. Watching Google or Wikipedia, it’s easy to accept the estimates that the amount of information doubles every two years. But in biomedical research, the rate of growth is much faster. One of the drivers is the advent of inexpensive, fast sequencing of DNA and RNA. In 2002, if you wanted to sequence a megabase (1 million bases) of DNA, you needed $5,292 and several weeks of manual labor to do it. Today, you only need $0.19 and a few hours of machine time.1 This difference is even more striking at the level of the human genome. While the first human genome was a $3B project requiring over a decade to complete in 2002, we are now close to being able to sequence an entire genome in a few days for only $1,000.2
We’ve seen similar changes in brain imaging, where higher resolution instruments are generating massive data sets that can provide more precise pictures of brain structures. In fact, the big data revolution is found at every level of NIMH research, including clinical studies that are now able to capture inputs from digital sensors. In addition, studies of social networks can, for the first time, combine information from millions of people, surveying what some have called “humanity’s dashboard”—a tool that may help us combat many diseases and other social ills.3 A now famous example of social network studies tied a spike in emergency room visits for the flu with an increase in the number of people searching Google for “flu symptoms” and “flu treatments” two weeks prior to the ER spike.4
These revolutionary changes in data acquisition create profound challenges for storage. Indeed, it may now be less expensive to generate the data than to store it. The National Center for Biotechnology Information (NCBI) has been our control tower for directing big data efforts in biomedical science, but neither the NCBI nor anyone in the private sector has a comprehensive, inexpensive, and secure solution to the problem of data storage.
Even more challenging than storage is the task of translating big data into better knowledge. Sequencing of the genome or mapping of the brain give us the opportunity to discover new, important frontiers, including genes and brain areas we did not even know existed. But vast data sets also may elicit faulty science, potentially tempting an investigator to search for the data that supports his or her own theory. There are safeguards to preclude such “false discoveries,” but even these may fail to prevent a biased use of selective data sets.
These caveats notwithstanding, the big data revolution can be transformative for mental health research, but only if much of this data become public. After all, if knowledge is power, then making scientific and health data public can become empowering. We are already seeing this happen with “public access” scientific and medical journals, as well as PubMed Central , which was created to make the results of all publications from NIH-funded studies available for free.
Of course, having places in which to share information only helps if scientists are willing to share. Biomedical science has a proprietary tradition that has been slow to change in the face of NIH’s increasing focus on data sharing.5,6 But as more scientists see the successes of sharing, such as the Psychiatric Genomic Consortium and the 1000 Connectomes Project, the proprietary culture will become more transparent and collaborative.
Some of the most innovative vanguard efforts to harness the power of big data are found outside of government and outside of mental health. The Personal Genome Project , Patients Like Me , NextBIO , and some of the projects within Sage Bionetworks are among the current efforts connecting individuals to big data related to their health. As these crowd-sourced efforts give individuals information about their own health, they are also creating knowledge for all of us. In a classic example, data registered on Patients Like Me indicated that using lithium to treat amyotrophic lateral sclerosis (ALS) was futile—years before the completion of prospective trials.7
The mental health community has been slower to join this revolution, but this could change. It just requires a passion to share information, a capacity to develop data repositories, and a vision for turning individual data into collective knowledge. We have some unique challenges in the mental health community: lack of a central organization, inconsistent quality of information, and in some cases, a denial of illness. But — as we are seeing in areas as diverse as robotics and baseball — big data has a way of overcoming big challenges. In fact, big data may be the solution for a field that has been lacking in metrics of performance or success.
- DNA Sequencing costs table.
- International Human Genome Sequencing Consortium et al. Initial sequencing and analysis of the human genome. Nature. 15 Feb 2001.409:860-921.
- Quote attributed to Rick Smolan, in Lohr, S. 11 Feb 2012. The Age of Big Data. New York Times. Accessed http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-in-the-world.html .
- Carniero HA and Mylonakis E. Good Trends: A web-based tool for real-time surveillance of disease outbreaks. Clinical Infectious Diseases. 2009; 49:1557–64.
- Tenopir C et al. Data sharing by scientists: practices and perceptions. PLoS ONE. 29 Jun 2011, 6(6):c21101.
- Savage C and Vickers A. Empirical study of data sharing by authors publishing in PLoS journals. PLoS ONE.18 Sept 2009. 4(9):e7078.
- Wicks P, et al. Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm. Nature Biotechnology. Apr 2011. 29: 411-414.