# Workshop on Advanced Statistical Methods and Dynamic Data Visualizations for Mental Health Studies: Day One

## Transcript

**Workshop on Advanced Statistical Methods and Dynamic Data Visualizations for Mental Health Studies: Day One**

**Transcript**

**DR. WOUHIB**: Thank you for joining us for the Workshop on Advanced Statistical Methods and Dynamic Data Visualization. My name is Abera Wouhib. I am a mathematical statistician and program officer at the National Institute of Mental Health. NIMH is the lead federal agency for research on mental disorders and the host of this workshop. It is one of the 27 institutes and centers that make up the National Institutes of Health, NIH.

I would like to invite NIMH’s Director, Dr. Joshua Gordon, for his opening remarks. Dr. Gordon is a great supporter of statistical applications in mental health research. Grant application to NIMH are viewed with high statistical standards. As a result, it is difficult to secure funds without checking all the boxes for statistical validity. More importantly, NIMH established a program called Statistical Methods in Psychiatry three years ago. As a statistician, it is always good to work at a place where statistics blend with research.

Without further delay, here is Dr. Gordon.**DR. GORDON**: Thank you, Abera, for that introduction, and thank you and the rest of the team as well as the external members for organizing this workshop. I am really excited to be here and to follow along.

Statistical and mathematical approaches (inaudible) -- It is especially true as we move forward with two important initiatives across the NIH and especially at NIMH. One is efforts to improve rigor and reproducibility throughout our research portfolio. As all of you know, several years ago many new initiatives were instituted to ensure rigor and reproducibility of our clinical studies. Some of you maybe know that a recent workgroup of the Advisory Council for the Director set about to begin consideration of what we might need to do to encourage similar efforts with animal studies.

These approaches uniformly recognize the importance of statistical expertise not just after the data is acquired but in the design of the studies themselves. So we recognize the need not only to use the expertise we have but to also support the development of novel techniques, novel approaches, in order to ensure the rigor and reproducibility of the work that we fund.

A second area where this work is incredibly crucial is as we continue to expand the emphasis on computational approaches to psychiatry in general, and that especially means developing ways of evaluating the rigor and reproducibility of modeling approaches and of machine learning approaches to big data. These are all important new areas that we are continuing to expand at NIMH, and it is the work of people (inaudible) virtually to listen in, to follow and to ask questions and discuss. It’s that work that is going to enable us to make sure that, as we expand in these areas, that we do it properly.

Thank you all for coming. I am looking forward to the day.**DR. WOUHIB**: Thank you very much, Josh, for really uplifting remarks. I apologize, for clarity -- because Dr. Gordon was really committed to give his remarks and he was at the same time on other duties, so he was providing this talk from his car. So I really appreciate that.**DR. WOUHIB**: Next we go to the introduction and vision, which I really have very limited slides and brief comments which cover statistics in mental health in general and NIMH’s statistical program and today’s workshop. I will just highlight what it covers and what our vision will be from the workshop.

Investigators and researchers know the role statistics play particularly in the biomedical field. Biomedical data are described in terms of descriptive methods by highlighting trends, making valid diagnostic methods and therapies. Use of inferential methods enables researchers to summarize findings and conduct testing. Biomedical agencies including NIH review the research process and scrutinize new proposals from the statistical perspective for validating. Because of that, the need for greater knowledge in statistics has become crucial and very important. Researchers and investigators are getting either direct training in the statistical sense or start collaborating with statisticians.

In general, statistical thinking is critical in basic biomedical research including study planning, experimental design, sample size and power determination. Data collection, data analysis and interpretation is part of it, including manuscript preparation. Although more challenging when it comes to the nature of the data, that is also important in mental health studies.

Due to the nature of data, it is common to see statistical applications in mental health which is uncommon in the methodological approach. Advances in statistical methods and applications are lacking the data collection in mental health studies. For example, I can say mobile psychiatric data collected from smartphones meets the criteria of big data, just like what Josh mentioned a while ago.

With respect to volume, velocity and variety, although this kind of data has potential to be clinically useful, developing statistically sound methods to analyze such data could be even harder. Reproducibility is defining the future of science as we know it, and obviously, lack of reproducibility creates distrust in scientific findings among the scientific communities.

Inadequately designed studies are irreproducible and hard to reproduce, with inflated errors, what we call false positive and false negative. Treatment efficacy usually happens to be small, has a great degree of individual response variability and heterogeneity, is noisy and unstable data. Although there are many reasons for lack of reproducibility, a workshop given in 2016 at the National Academy of Science identified some statistical factors. They consider some of the factors to be like improper data management and analysis, lack of statistical expertise, incomplete data, difficulties applying sound statistical inference to available data.

And even in our own initiative we have seen that, once funded, a very optimistic statistical analysis and power calculation during the grant application may end up with marginalized effect size and insufficient statistical powers. Investigational treatment many not demonstrate statistical superiority over its comparator. Poorly designed studies fail to separate experimental treatments from established treatments or placebos.

I just would like to bring this example regarding reproducibility. It was done in 2015, and 270 authors contributed to this study, and 100 studies were conducted. The original authors of this publication were also part of the reproducibility study.

They selected relatively high-powered studies from three psychological journals, Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory and Cognition. What they found in these 300 studies in three journals is not all the effect sizes were even in uniform units, so they converted all of them to common effect size, which is like correlation and effect size. So all the effects were all converted to correlation to make very simple comparison, and inferences about reproducibility were based on the original and replication study effect sizes.

On this slide you see on the Y-axis the replication effect size and on X-axis is the original effect side. The original effect size for correlation was running from zero to 1, but the replication of the same studies, the correlation was from negative-.5 to 1. As we see it, all those grids are supposed to be -- If everything is perfect, if the original studies were really replicating themselves, all the correlations are supposed to be on this diagonal axis. But as you see, most of the replications are really lower than what they were supposed to be, they are lower than the axis, which is telling you what I said a while ago. Most of the studies which we think in the original proposal have a strong effect size are getting smaller when they are reproduced. This is telling you what these people studied in 2015.

What they concluded from this replication study results is the study showed that replication effects were half the magnitude of original effects in general. When they take overall 100 studies, they found that the replication is half the size of the original effect size.

Ninety-seven percent of the original studies have statistically significant results. Thirty-six percent of replications have statistical significant results, and 97 percent of original had statistically significant results while the replications were only thirty-six percent. Forty-seven percent of original effect sizes were in the 95 percent confidence interval. Only 39 percent of effects were subjectively rated to have replicated the original result. Only 39 percent of them were considered to be replicating themselves.

From this study, what the investigators are saying is, if the original study is not good enough, we don’t have to expect a good result from the replication.

As I said in my early remarks, we have a program called Statistical Methods in Psychiatry. You see the link and probably you can find it in NIMH’s website. There were many reasons for this program to be established. As I said, the data which is coming from the mental health studies are overwhelming compared to the methodology. This program was established at NIMH in the Adult Psychopathology and Psychosocial Interventions Research Branch of the Division of Translational Research.

The fundamental goal of this program is to find statistical methods and analytical plans which can validate biomarkers and novel treatment targets. And it encourages new methods and applications which are really suitable for psychiatric research and encourage development of innovative statistical challenges to come from statistical experts in mental health research.

The program emphasizes developing novel methods of classifying mental disorders using integrative methods, just using different data sources, bringing small but many studies together, and also encouraging meta-analysis. Also, inventive research to understand the psychological, behavioral, cognitive and natural mechanisms that cause mental disorders.

And particularly we would like to see more research in the area of psychiatric studies with data anomalies, like most studies in mental health have the problem of multiple testing, correlational problems which in most cases we took for granted independent of (inaudible). Clustering is another issue, and heterogeneity is a major issue in mental health studies, and missing data. Just like in any other research data, those are the problems.

So what do we cover really in today’s workshop? The workshop covers a wide range of statistical applications in mental health research from the highlights I see, and recent statistical advances are also incorporated in this particular workshop for today.

I consider four different areas where we are looking at the gap in statistical research to be filled. Those four are: recent advances in statistical methods for mental health services research. The second highlights statistical methods for generating reliable and reproducible findings from neuroimaging data, and it can be applied also for other parts of mental health data.

Third, statistical testing and power analysis for high dimensional neuroimaging mental health data is another session where we are looking for in today’s session. And recent statistical developments in imaging genetics is also one of the areas where we see major gaps and we will see what we can learn today.

Finally, from the best on these four topic areas, including other issues which are really major to us, we will have a panel discussion as the final session.

Some of the highlights in today’s workshop, like linking different clinical sites through common data elements. These are the kind of things we would like to see forward. We would like to see statistical studies in that area.

Patient-specific characteristics in treatment decisions. The use of social media information. These are particularly in the service areas. I am assuming these are the areas we would like to see some research going ahead.

Determining the lifecycle of data -- how good are that data, particularly in this age of information. Probably the validative data is depending upon just the lifecycle.

The issue of multiple comparisons and confounding in high-dimensional genetic and neuroimaging data. Those are also discussion points in today’s sessions. Interactions of genetic and environmental variables. And if difference in effect size is attributable to differences in treatment efficacy or differences in methodology -- those are also for discussion in today’s talk.

More details are coming from the speakers.

What is our vision? Really so many questions to be answered by this workshop. These are the kind of things we always see. From the statistical perspective, what are the visible and easily achievable methodological adjustments that should be considered to increase reproducibility in mental health studies? This was the main point in Josh’s talk.

What should be the role of pilot studies? Are all pilot studies good in our studies, particularly in getting information from these small pilot studies? Are those really useful for larger clinical trials and studies? Are effect sizes from the pilot studies useful, because they are coming from very small studies?

Are there statistical advances to get around low statistical power? Those are sticking points in our statistical research, and we get most of our studies low in statistical power. Is there any way we can get around it by involving small study plans, Bayesian approaches? What new statistical advances can help us to get around it?

And what is the benefit of health information linked to electronics? This is information which is available these days. And is there any way we can exploit the electronic health record in future mental health studies and in clinical studies, because we can use that humongous information from electronic health records. These are the kind of areas we are looking forward to go on.

Are there methods for studying effectiveness of adapted interventions? I am sure we will have those questions when the time is appropriate.

What are the methods for identifying effect moderators of prevention and treatment interventions in mental health studies? How can we deal with common data anomalies, just like I said, multiple comparisons, correlation and clustering and heterogeneity? These are common anomalies we see in mental health studies, and what are effective ways, other than always increasing the sample size? We look forward to see how we can deal with those kind of things.

What is the long-term effect of uninformed use of toolboxes in neuroimaging? We have multiple toolboxes in fMRI and I am not sure all of them would produce the same results from the same data. And what are those effects in the long run? We would like to see those kind of studies because our researchers are using those toolboxes all the time and I am not sure if really, especially the studies that are done, which of these toolboxes are appropriate for which kind of data and why. We would like to see some studies in the future incorporating those kinds of questions.

How can we encourage statisticians and applied researchers to collaborate more? And what are the training areas that can really help to get ahead our statisticians and other researchers in mental health studies? These are the kind of things we would like to see and we will address some of these questions in the afternoon.

Finally, this is a time to give special thanks to the organizers, the invited panelists, the speakers and to everyone who made this workshop possible. I really appreciate it from the bottom of my heart.

To mention a few housekeeping items before starting the first session, we would like to make sure your microphone is muted if someone is not talking, and it would be good also if your cameras are off. I would appreciate if people are really mindful to keep the background noise at a minimum. When I am introducing the moderators of each session I don’t think I will go through the detailed bios because those are already posted on the website. I will mention their name and affiliation and for the rest you can refer to the website.

Each session will cover about 80 minutes, and it will be followed either by a small break or lunch. If you have questions they should come through the Q&A. For each question coming through the Q&A the chair of that session will handle it appropriately and bring it to the floor when it is submitted.

Do the event planners want to say anything?**PARTICIPANT**: There are a few questions in the Q&A.**DR. WOUHIB**: What is the context of clustering in this instance? Clustering could be data which is coming from the same source, from the same school. That would be clustering. It could be from the same hospital in our context. Because they are coming from the same source they have some kind of common characteristics. That is what we call clustering.

Let us go to the first session. The session moderator is Dr. Elizabeth Stuart, and she is from Johns Hopkins University. I would like to really highlight the organizers. I can mention Dr. Stuart, Dr. Bhaumik, Dr. Guo and Dr. Thompson, and of course my co-organizer, Michele Ferrante, and the moderators, Drs. Stuart, Guo, Bhaumik and Thompson. We really work as a team throughout this day and I really thank everybody for that.

Now we will turn to Liz Stuart. Liz is from Johns Hopkins University. She is a Professor there in statistics, and her bio is already listed. Without taking further time, Liz, take it away.**Agenda Item: SESSION 1: Recent Advances in Statistical Methods for Mental Health Research****DR. STUART**: Thank you so much. It is really a pleasure to participate in this session. Thank you, Abera, and all of the NIMH team for organizing this. I am really excited about this day as someone who is a Professor of Mental Health and Biostatistics and Health Policy and Management, so the topics we’re discussing today are right at the interface of my interests.

I am particularly excited about this session, which is going to focus on mental health services research, so, investigating questions such as how do we make the best use of, quote, “real-world data” to answer questions about the efficacy or effectiveness of prevention or treatment interventions, how those interventions might be expanded, implemented or adapted to meet the needs of people with mental illness or to prevent mental illness in the first place.

The statistical issues you will hear about include how to deal with confounding, how to make sense of or integrate complex data sources such as electronic health records, and how to measure outcomes of interest in these real-world settings.

I also want to highlight that the work discussed today and much of what you will hear especially I think in this first session builds on a long tradition of support by the National Institute of Mental Health for methodological work related to mental health services work. This includes training programs.

It also includes the fact that multiple statisticians, multiple of us on this morning’s panel, serve on or have served on the Mental Health Services Review Panel, and so I think that integration of statisticians and the applied researchers has really helped move this field forward and helped develop advanced and rigorous statistical methods that answer these key questions in mental health services research.

This session is not for you to hear from me. We are lucky to have three speakers and I am really thrilled that all three agreed. Each will speak for about 15 minutes and then we will have a discussant.

Just to give you a head’s up and sort of highlight of the session, the first speaker will be Melanie Wall from Columbia University, who will discuss data science as relevant for a learning healthcare system for first episode psychosis. Then Yuanjia Wang, also from Columbia, will discuss machine learning approaches for optimizing treatment strategies. Third, Munmun De Choudhury from Georgia Tech will discuss how to use social media to study mental health.

Finally, Ben Cook from Harvard Medical School will provide a discussion. I am particularly excited that Ben agreed because he is the current Chair of the Mental Health Services Review Panel, and he is not a statistician; however, he embodies these researchers who have a real appreciation for the interface between the statistical methods and the applied research. I will introduce each of them in more detail before their talks.

As Abera noted, we are going to have time at the end of the session for open Q&A, so I really do welcome you to put questions into the Q&A. If they are for a specific speaker please note that, and I will be keeping track and using those to moderate the discussion at the end of the session. I can answer questions as well in the chat as we go along.

Without further ado let’s turn to our first speaker, Melanie Wall. Melanie is the Director of Mental Health Data Science at the New York State Psychiatric Institute and the Columbia University Psychiatry Department where she oversees a team of 14 biostatisticians collaborating on predominantly NIH-funded projects related to psychiatry. She has worked extensively with modeling complex, multilevel and multimodal data on a wide variety of psychosocial public health and psychiatric research questions in both clinical studies and large epidemiologic studies. So you can see she is the perfect person to kick us off in the session. Melanie, I will turn it over to you.**DR. WALL**: I, too, want to join the chorus of people who will be thanking NIMH for sponsoring such an event. As a biostatistician I am very pleased to see the support that NIMH puts behind the need for good methodology, and I am very happy to contribute to that.

What I will be discussing today is actually a project that comes out of some of the funding we have from NIMH and I will talk more about that in a second, but I want to highlight, too, my affiliation with the New York State Psychiatric Institute as well as Columbia University. But it is really through NYSPI that a lot of the work that I am going to present today is grounded, as well as much of our Division of Mental Health Data Science.

For this particular audience I might not need to go through a background about first episode psychosis, but nevertheless, just to make sure and not knowing the diversity of the people we have on the call here, I am going to go through what is first episode psychosis. I’ll talk a bit about what is coordinated specialty care and then specifically what is OnTrack New York and how that is motivated by a learning healthcare system, and really how data science plays a key role within the concept of a learning healthcare system.

And I will demonstrate for you an initial test case using data from OnTrack New York where we began to build a prediction model of patient outcomes that could then be fed back within the system to learn better about how to care for people with first episode psychosis.

First, what is first episode psychosis? Psychosis is a condition that affects the mind where there has been some loss of contact with reality. Common symptoms include delusions, which are false beliefs, and hallucinations where someone might see or hear things that others do not see or hear. When someone becomes ill in this way it’s called having a psychotic episode. In people ultimately diagnosed with schizophrenia, often the first episode of psychosis most commonly occurs early in life between the ages of 15 to, say, 30 years old. Identifying adolescents or young adults with a psychotic disorder early and connecting them with care has been shown to lead to better outcomes.

Back in 2008 NIMH launched the RAISE project to look at the efficacy of coordinated specialty care and showed that indeed this sort of team-based multi-component approach to the treatment of first episode psychosis did in fact lead to better outcomes,

What is coordinated specialty care? It is really a wraparound approach to caring for an individual who has experienced a first episode of psychosis that includes supportive education and employment, case management, psychotherapy, family support and education for the development of the disorder, as well as evidence-based psychopharmacology.

My colleague, Lisa Dixon, at the New York State Psychiatric Institute is one of leaders in this particular area of coordinated specialty care and was very much a part of the RAISE project and a few years back wrote a viewpoint on what it would really take, now that we have shown the efficacy of this kind of care, to really bring it to scale, and talked about what it would take to get it over the hump in terms of getting it into care.

In this viewpoint she talked about financing and workforce development needs as well as community activation that was needed. But in particular she also talked about a need to focus on measuring outcomes and fidelity to the model, as well as having youth and consumer involvement in their care.

As part of that motivation within the state of New York in collaboration with the Office of Mental Health, which is the sponsor for the New York State Psychiatric Institute, they were able to develop the OnTrack New York program, which is a set of 20 sites across the state of New York to implement this coordinated specialty care model for first episode psychosis. Just in New York City alone there are 15 of these sites that people can go to who had a first episode.

That idea of having sort of a network of sites also was part of the initiative NIMH put together a few years ago for the EPINET research program, and our program OnTrack in the New York Psychiatric Institute also won one of these awards from NIMH to look at building, if you will, a learning healthcare system that would incorporate data and evaluation into the coordinated specialty care for these early psychosis programs. And so the OnTrack New York site, which is actually 21 sites, is part of a hub of sites across the nation that are looking into ways of evaluating this type of care.

The OnTrack New York data collection system is really a contractual obligation for particular sites. To be part of the program they have to agree to collect data, and they have to agree to be part of continuous quality improvement using that data. If we think about a patient who has had a first episode, at any kind of intake there is first a collection of the pathways that person took to get there in the first place, because we know that is an important part of the process is actually getting into care in terms of help-seeking.

And then, when they are in the program there is sort of this back-and-forth model between the types of services provided and measuring the fidelity of those services, and then monitoring the outcomes in the person with psychosis to those services. And that somehow feeds back to suggest other services. Eventually, once they are discharged from the program, being able to monitor outcomes longer term, say, using potentially administrative records or labor force records, in order to keep an eye on how things are working after they leave the program.

So what is a learning health system? This is a very nice article from a few years ago sort of describing the general idea, but it really begins and ends with the clinician-patient interaction. It aspires to provide that continuous improvement in quality, outcomes and healthcare efficiency, and the way that is really done is through data. So, what can data science do here? I would say actually a lot because data is being collected -- or I should say wrangled, gathered -- at each step of this care process.

One of the things that we had proposed and are beginning to do is to utilize machine learning methods for prediction in order to predict outcomes that might help to make better decision-making along the way. But not just those predictions, but how to bring the youth and consumers into the process through that feedback of information, which really requires data visualization and data literacy between the client and the patient in terms of decision-making.

Prior to our involvement with the current project of EPINET we had used OnTrack data before to show sort of group-level improvement over the course of one year of treatment within the service system. This is from a paper we published a few years back and on the left is time along the X-axis here. This is baseline enrollments every quarter up to one year, and this is using data that is administratively collected across the 21 OTNY sites. These three lines represent three different functioning measures from the GAF for the OnTrack cohort, and we can see there is improvement in three months, six months and all the way out to one year.

Down here the circles are education and employment. One of the goals of the program is to have people be in school or be working and not to sort of drop out of those life-functioning skills. We see that we can improve on average up to 80 percent of people being in school or work, and this is decrease in hospitalizations.

We previously reported at the population level, but what we really want to do now with the EPINET study is move into predicting for individuals and be able to have a sense for where someone is heading and to potentially make changes in their services based on that. For example, in this test case we take individuals from 1300 members of the OnTrack care system, and here are some of their demographics, and we would like to be able to see if we can do prediction for their trajectory over time at the individual level.

The OnTrack team collects a lot of data at admission and every three months follow-up. As I mentioned, they are contractually obligated to report this type of data on every one of the clients across time. We have a long list of measures that are collected, over 200 clinical variables, and we would like to see if we could use those with a data-driven approach to do prediction of particular outcomes.

We have done that and actually developed a secure R-Shiny app to explore those visualizations of predictive outcomes. We used a balanced random forest for model development. Validation was done using both cross-validation as well as holdout samples of different types where we would hold out just randomly chosen people at random time points, and then also holding out a most recent site and seeing if we could actually predict that most recent site.

The app was made so that it can take in varying amounts of client-level inputs and then predict any number of time points in the future. So think of it as a flexible input/flexible output prediction model.

And just to give you an idea of what that looks like, the app, you select how much data you want to use -- in this case we were going to only use the data we have up to baseline from an individual -- and then you pick which client you want to predict for. Here we are predicting the probability of them being in education or work at the follow-up time. At baseline they were indeed in work. They were a 1. You can either be a 1 or a zero on this. And so we would predict at three months a 79 percent chance that they would still be working, and then even out to one year a 79 percent chance. Pretty flat here.

We can take in more than one time point. The visualization on the right here actually took a lot of care and iteration with the team members to think about what would be a useful way to present this. Here we are showing that we know, with the black dots, at these first three time points they were not in work or school in the first two quarters, and then they were in work and school, but we still predict only about a 50 percent chance that by 18 months they would still be in work or school.

So you can see the trajectory based on what we know about their current work or school, as well as all of those background variables we have on the person at each of those time points. And just to give another example, knowing baseline, three months and then follow-up, we can see different predictions of where they are going to be.

Also in the app you can click on the future prediction, because we got a lot of feedback that people want to know what is driving those predictions. You can click on a future time point and a pop-up window will come up showing you the variable importance of what were the top variables that went into predicting at that particular time point for that particular person. Also, you can see the visuals of, say, the AUC curve for all those future time points as well as the distribution of the probability of being a yes or no.

I want to emphasize that in terms of the statistical method, we were using a pretty straightforward machine learning approach, which I consider these things quite standard these days, but what really took a lot of innovation here was how to think about presenting the results in a way that was meaningful.

So here we also developed this way of showing the sequence of how someone gets to a particular probability, thinking that might be useful for the clinician as well as the consumer. At baseline you only knew this about the person and so you would predict this. And then you knew two time points and in fact, maybe strangely, the prediction goes down, but that’s because there are other variables that the model knows about the person. Then they were in work or school, and then they were not, so it goes down a little bit further. So that type of dynamic visualization of the prediction seems to be something that is important.

To summarize the predictability across the cross-validation, overall, for predicting work or school this is actually for predicting at 12 months. We had a .88 AUC which I guess is typically considered good, but as I will come back to in a moment, what’s considered good enough might be a question we want to think harder about.

And here, just looking at the three holdout samples for new clients, a new site and new follow-up times, the predictability, the AUCs were quite good in those holdout samples.

Still to be done is really questions of how to deploy this. How good should those predictions be before we really recommend to the clinician that they need to take a different action? I think that is not an easy question to answer or to think about how to even build evidence for, although I think there are some ways to do that. And what is the best way to provide the information? I showed you some visualizations but perhaps there are better ways to do that.

And how often should those prediction models be updated? We built that model on 1300 people that we had access to. Should we keep updating that model as more and more data comes in? And how to make that decision of what data can be included?

Finally, I just want to end with this graph that I think is really useful when starting out to try and build something that is going to be useful for deployment. You have this expectation that a lot of the time when you’re going to build one of these models you are going to spend in the yellow, which is optimizing that machine learning algorithm, and then a little bit of time will be for building the infrastructure, collecting the data and the deployment.

But actually, I think now that we have gone through it, it’s really much more like this graph here. In reality, we spent some time but not that much time on the actual optimization of the algorithm, and really it’s about collecting and cleaning that data, having what Lisa had already done, building the infrastructure to collect the data in the first place. And now really we are in this phase here of deployment. What is the best way to utilize this within the care setting?

Finally, I want to just thank my partners. Cale Basaraba was the lead data analyst helping to build the statistical model. Jenn Scodes contributed much to this work, and of course, Lisa Dixon. I will stop there.**DR. STUART**: Great. Thank you so much, Melanie. A great talk to start with. I think you nicely covered that whole spectrum of some advanced statistical methods addressing a really important clinical and services question and, given this two-day workshop on statistical methods today and data visualization on Wednesday, some nice examples also of how to visualize and show these results. Great. Thank you for kicking us off so well.

We are now going to turn to our second speaker, also from Columbia, Yuanjia Wang. She is a professor in the Department of Biostatistics and the Department of Psychiatry at Columbia University and is a core member of the Division of Biostatistics at the New York State Psychiatric Institute. She was elected a fellow of the American Statistical Association in 2016 and, as you will see a bit today, works on developing data-driven approaches to explore relationships between biomarkers, clinical markers and health outcomes to assist discoveries in disease etiology and increase diagnostic capabilities.

Yuanjia, I will turn it over to you.**DR. WANG**: Thank you, Liz, for the introduction. I want to join Melanie and Liz in thanking the NIMH for funding statistical methodology research.

My talk will be grounded in the context of precision medicine. Precision medicine stems from this different paradigm of one-size-fits-many care models. These one-size-fits-many care models are very inefficient. Meta-analysis shows that the top high-grossing drugs in the US have very poor NMG. For one treatment responder there would be anywhere from three to 24 patients who did not respond. So, many sources of variation would contribute to this inefficiency.

We heard from Abera’s opening remarks that there is heterogeneity of diseases, heterogeneity between patients and over time, so precision medicine would propose to use a more targeted and tailored approach to improve efficiency and also reduce cost. This audience will be very familiar with RDoC, which proposes to characterize mental disorders more precisely based on biological behavior and psychological measures at different domains and also at different levels of measurement.

This a particular conceptual model of precision psychiatry would propose to use neural dysfunction at the circuit level to tailor treatment for depression. However, these conceptual models will need empirical evidence to support decision-making, so there are empirical studies that can provide the data for us to test these conceptual models. One example is the EMBARC study which aimed at looking for bio-signatures of antidepressant response.

We also worked on the treatment study randomized trial to look at optimal treatment of complicated grief. So this provides a wealth of data for us to learn some targeted strategy.

Why do we need machine learning for this, though? The current practice is to test for interaction between treatment and covariates. However, if we fall into this testing paradigm, we know that each covariate may only contribute a very small effect, so we would need to test over a high-dimensional covariate space. We run into the multiplicity issue pretty quickly. Also, there is no guarantee that the treatment response and the covariate would fall into a linear relationship. In this very ideal and conceptualized model they do, but in practice there would be many covariates contributing to this effect, and the response curve may be highly non-linear.

Also, when we look for interactions, the linear regression model would aim at predicting outcomes. They would not aim at predicting which subgroup would respond better under the new treatment and which subgroup would not. However, that is when we’re talking about tailoring treatment and selecting optimal care among several options. We are really predicting a direction of those conditional treating effects. We are looking at it for which sub-population new treatment is better than the usual, so has a positive impact, and for which sub-population new treatment is worse, so it has a negative impact. So a linear regression would not directly do that.

We worked on the tree-based approach to get at, first, non-linearity; secondly, to distinguish those qualitative interactions from quantitative interactions so we can predict the size of the treatment effect as well as whether there is a group with a large magnitude of treatment effect.

Another group, Pelkova, et al, looked at non-linear response treatment decision function as a function of a combination of covariates.

These machine learning approaches move us to tackle some of the issues. They would provide us some powerful approaches to estimate nonparametric decision functions. They are powerful to handle large, diverse health data; for example, those data collected at the precision medicine initiative which proposes to look at electronic health records, measures from mobile and wearable devices, and also biospecimens. So machine learning approaches are well known to be able to handle some of the non-structured data.

They are also goal-oriented. If you can define your objective function there is a wealth of advanced computational tools that you can use to optimize your objective function. However, they are not panacea, especially for health-related applications. There are many challenges.

The first is that many of our research questions are causal, so the direct use of machine learning is not sufficient. We don’t get to observe both potential outcomes under both treatments. Applications have high impacts on human health, and a lot of applications are high state, so we really emphasize reproducibility and robustness and generalizability.

We have a combination of supervised and unsupervised learning to reduce disease heterogeneity -- the discovery of disease subtype -- that would be unsupervised, looking at the clustering and so on. But also we hope that these subtypes are associated with a treatment response, so there is a supervised component to it. How do you optimally combine these two?

Also, data are expensive. We don’t have the luxury of having many big data if we are working with the trials. Some of the studies are moderate size. So, when we are designing our machine learning architecture we might want to keep that in mind kind of for the super-complex models. We want to balance interpretability -- we probably prefer this than the black box.

So how do we address some of these? First, let’s consider using individualized treatment rules to tailor our treatment, so, using these ITRs, defined as D(X). You can think of these as a rule book. They take patient characteristics as input and output is optimal treatment for a given patient.

The objective function you can define to tailor treatment is what’s called value function. It is just the expected outcome. You define your outcome of interest, the expected value. Given that, you assign treatment based on this rule, D(X), instead of assigning everybody to an antidepressant, or, instead of assigning randomly assigned treatment as in a randomized trial, you assign treatment using this D(X) you learned from some data.

So, what is the expected value of the outcome if you use this rule? Assuming a higher outcome is more desirable, your optimal rule would be the ITR that maximizes your value function, giving you the highest reward.

Most existing methods focus on observed variables and a single observed outcome, but that is assuming that all the heterogeneity across individuals can be entirely explained by these observed factors, which is often not the case for mental disorders. If you look at this plot of patient-based characteristics you can see there is a lot of heterogeneity going on and a lot of hidden patterns. So we want to account for both those observed and hidden patterns when we tailor the treatment.

Let’s first look at the outcomes. Our outcome measure for primary endpoint in an antidepressant trial might be the HAMD scale used to measure depression symptoms. A clustering analysis puts these HAMD systems into three domains, so, if you are only looking at HAMD total as your outcome measure you might be obscuring differences between patients with more atypical symptoms with the patients with more core emotional symptoms. They would have the same total but they can be very different and have completely different patterns across their outcome measures.

So we propose, instead of looking just at a single thing, you analyze multiple outcomes, analyze outcome data as they are measured and acknowledge that there is measurement uncertainty. We could integrate those measurement models with computational models to take advantage of this theory and reduce your model search space and reduce complexity. We would view these observed measures as arising from latent constructs or latent states. This is one of the principles of the RDoC. Multiple measures would have the same latent states, so there you could borrow information, borrow the correlation, borrow the structure to improve your learning efficiency.

Also and more importantly, we assume that the ITRs will target the underlying mental state instead of directly targeting the observed measures, which can be more noisy, so we can reduce some of the noise uncertainty when we target latent domains to bring our ITRs to be more reproducible.

To be a little more specific, this is the architecture we use. We are assuming that these observed outcomes, observed HAMD items, arise from latent mental states, Z:K binary mental states. You can think of them as activation of negative emotion. If there is one negative emotion node that is activated, associated negative symptoms are also activated. So the core assumptions are, first, treatment changes these underlying mental states so they could, for example, reduce the likelihood of activating Z1 after treatment; and also these latent states would depend on your external variables you want to tailor treatment.

But importantly, treatment does not change the measurement model. In other words, those paths, you see that connecting latent states to the observed symptoms, they are shared between treatment and after treatment. So, how we are measuring our mental disorders is not changed by the treatment but shows an essential internal core property of the measurement instrument. This we call the invariance structure.

Also we assume this conditional independence also stemming from the measurement model theory. So, given the latent states of observed symptoms, why is there independence? You don’t see connections between those observed nodes. There, we use those measurement theories to reduce our model complexity, cut the parameters you have to estimate for your instrument in half, and also remove those edges that are unnecessary.

If we break this down a little bit, to learn these connections between latent states and observed symptoms we use this RBM model. They are probabilistic. They are also very interpretable because the waves can be interpreted as log odds ratio, just as adjacency category model for modeling outcomes. But more flexible for RBM is that it could be used to model multivariate joint distribution of high-dimensional observed ordinal nodes and also high-dimensional beta notes, and there is an approximate learning algorithm that easily scales up.

However, in many of our applications we found that we don’t need the many latent nodes to describe major variation of (inaudible), so there we also view Z as a lower dimensional representation, reducing the model complexity. We can use this RBM model to borrow baseline measures from multiple studies to improve our generalizability.

In the middle, when we are going from Z-0 baseline to the after-treatment latent state, Z-1, we would use a nonparametric model to learn the ITRs for targeting these underlying after-treatment latent nodes. G is a known aggregation function. They are never observed. Those after-treatment potential outcomes, half of them are observed in a randomized trial.

We will first focus on trial in this talk. In those trials, half of them were observed so we can devise a transformation that takes us from Y to Z, but ensuring that we have the same conditional mean whether we based our learning on the latent mental states or based our learning on the potential outcome for the measures, Ys.

Also, by randomization we can replace those potential outcomes for Ys by the observed measures of Ys, and even under conditioning treatment. So there, we would not be subject to unmeasured compounding. Also, we extract the baseline latent states as the tailoring variable and put our algorithms into a weighted classification based on the previous approach to learn how to maximize those empirical value functions directly.

So you can think we are classifying patients into their optimal treatment instead of predicting their outcomes, which is not our direct objective. Our direct objective is to classify patients into their optimal treatment.

We applied this approach to the EMBARC study. The outcome measure is HAMD, and we are bringing the STAR-D trial, also a depression study, a larger study with 3,800 patients. EMBARC has 240 patients.

RBM used BIC to select the dimensions of latent states. We selected three. The first latent state is characterized by sleep problems, the second is anxiety in the core emotional, and the third is by atypical symptoms including suicidal thoughts and so on. You can see that this RBM grouped patients into homogeneous groups: a mild group, a severe group, and also, for example, this group has severe impairment in the first two domains, a less impairment in suicidality, and those atypical symptoms.

They are quite interpretable, the subgroups; however, the bigger question is do they relate to treatment response. We used treatment responder status in evaluation and showed that these subgroups are highly predictive of treatment response. The average response rate is only 49 percent compared (inaudible). But on the subgroups we identified a much higher treatment response rate.

Interestingly, we identified a group of placebo responders and a group who never responded, the most severe group, so we need to do something better for them. So we extended this to account for some continuous measures in the imaging domain and so on.

I will just jump to the discussion. The main takeaway from this presentation is to look at your outcomes and your tailoring variable. They are measured. Accounting for measurement uncertainty and looking at the finer level of the outcomes.

Our next steps, one is to have a better analysis of biological measures in other modalities. There, RDoC would be more helpful in designing architecture for these theoretical properties, and also better design. Two, unrandomized trial is not ideal to learn many different options. In the Hill study we looked at a factorial design looking at both psychotherapy and pharmacotherapy. And we are considering validating the observational studies and working with the CUIMC EHRs and also the Precision Medicine Initiative. So you see the real-world antidepressant prescription for these real-world PMI participants. There you immediately see many different combinations from this (inaudible) plot.

Last thing I want to thank many colleagues and students and thank the funding agency and my program office. Thank you all.**DR. STUART**: Thank you so much. Really impressive work and clear clinical implications that are great to see. I think we will jump into some of the implications and things later.

I also want to just quickly note that someone in the Q&A had asked if there are existing packages, so it sounds like maybe you can put the github links in the chat in response to that. Wonderful contribution.**DR. WANG**: Thank you.**DR. STUART**: Our third speaker in this morning’s session is Munmun De Choudhury. She is an Associate Professor in the School of Interactive Computing at Georgia Tech. Trained as a computer scientist, she has developed methods to use social media as a mechanism to understand mental health and to improve access to mental health care. She leads the Social Dynamics and Wellbeing Lab at Georgia Tech. The lab studies and analyzes appropriate social media responsibly and ethically to derive computational large-scale, data-driven insights and to develop mechanisms and technologies for improving our wellbeing, particularly our mental health.

I will turn now to Munmun. Thank you.**DR. DE CHOUDHURY**: Thank you very much, Liz. It’s my pleasure to be here today and it is an honor to have a chance to speak with you and share the research that we have been doing for a little while now. I also, like everybody else, deeply appreciate NIMH for organizing this workshop on statistical methods today.

We all know that we have a lot of digital data traces that are being left behind today by millions of individuals on different online and social media platforms. I would say over the past decade and a half, we have seen many new research directions emerge that have made significant strides into understanding human behavior broadly speaking.

To give you some examples of that, these data that we can collect from social media and similar platforms provide us new ways to measure our social interactions, our moods and emotions and our collective action. We have made many important contributions in the past decade and a half as a community to improve our understanding of a variety of different societal outcomes as well, and the most relevant of which for this conversation is mental health and wellbeing.

In the initial part of this talk I will share with you some of the highlights of the work that my collaborator and myself have been doing over the past few years in showcasing the computational use of social media data for mental health and how we can do that in an ethical and responsible manner and what that would mean from the perspective of services research.

I am going to spend large parts of the conversation today discussing the next frontier of this work, some of our ongoing work about how synergistic and participatory approaches that combine social science with domain expertise in mental health can allow us to fully realize the potential that social media data provides us and how that maximizes the benefits and reduces the risk of harm to all possible stakeholders involved.

Let me quickly highlight, using the social ecological model, a series of work that we have been doing to tell you how social media data can be helpful. We will focus on the individual, and in this work we focused on ongoing crisis events so in this case we are talking about the COVID-19 crisis.

One unique thing about crises and sort of the digital era and sort of the post-internet era is that, obviously, we get a lot of valuable information on the internet. But one significant challenge that I would say has been exacerbating over the last few years has been misleading information that circulates on social media platforms. While we are trying to get a good handle on the impact of a crisis directly on people in the context of COVID-19 -- that would be infections or deaths -- a considerable number of people are also being indirectly impacted as they are exposed to misinformation online.

So, what we were interested to study in this work was to explore how exposure to misinformation -- and we were operationalizing that, ways that people share behaviors on social media -- how this impacts people’s mental health. To answer that, we conducted a large-scale observational study based on propensities for matching this causal framework, and we found that people who shared COVID-19 misinformation on the Twitter platform experienced about two times additional increase in their anxiety level when compared with similar people who did not share COVID-19 misinformation on Twitter.

What is even more interesting to us were the sociodemographic differences. We found that, for instance, people who identified to be female on Twitter, racially minoritized individuals, and people with lower levels of education experienced actually a disproportionately higher increase in their anxiety levels when compared to other individuals.

So this work allowed us to gather new evidence about the effects of misinformation around crisis events and, importantly, the effects from the perspective of mental health and how we need to think about crisis in a more holistic way that thinks about the direct as well as these indirect impacts that use of internet and social media platforms are having in these trying times.

Speaking of communities and also focusing on crisis events, here we are focusing on situated communities, and this was particularly college campuses, and we are looking at a different kind of crisis event; that is, gun violence incidents on college campuses. Our question of interest here was to see how social media data can allow us to develop new kinds of computational techniques that can allow us to both quantify and then use that quantification to understand stress responses after these gun violence incidents.

We focused on 12 universities that experienced a gun violence incident over a five-year period in the recent past, and then we looked at dedicated Reddit communities corresponding to each of these campuses. We developed an interrupted time series-based framework to compare the before and aftereffects of these events, and what we found is that in the aftermath of these events, expectedly, we saw amplified levels of stress, but that deviated from usual stress patterns that we saw on these campuses in other times of the year.

We were also able to glean many meaningful attributes of these stress responses by looking at the language that people use on social media. We found that there were many unique linguistic changes that were evidenced in this post-gun violence incident here, and that included attributes like reduced cognition, higher self-preoccupation and death-related conversations.

In summary, this study allowed us to re-think our intervention and policy approaches that can bring timely help to these kind of situated communities, and being more equipped to handle sort of the unique crisis that these communities face from time to time.

Finally, here we are again returning to the COVID-19 pandemic but now, using the social-ecological model, we are interested in population-level insights. One of the things that was clear from early on in the pandemic was that it caused so many disruptions in our personal and collective lives, but what we didn’t quite understand so well is the impact it was having on mental health, whether it was due to quarantining, the uncertainty of the pandemic and so on.

Here we looked at Twitter and we wanted to gather population-level insights into people’s psychosocial concerns. We again conducted an observational study and looked at several different psychosocial expressions on Twitter such as depression, anxiety, suicidal ideation and stress. What we found was that, across the board, all of these measures showed significant increases during the crisis, and this was the first 18 months of the pandemic.

But what was more interesting to us is the fact that there was actually a steady decline and eventual plateauing of these effects, which we conjectured and hypothesized to indicate that people were settling into some kind of a new normal. But when we looked at the language analysis and language attributes that were associated with these increases and then plateauing of these concerns, we saw how the conversations on social media and these psychosocial concerns shifted from personal and professional challenges, healthcare access, pandemic-related concerns to sort of dealing and coping with this sort of new life that we have had to lead for the last 18 months.

What the study allowed us was that it shows the potential that could be brought in by social media data for public health purposes in order to equip mental health stakeholders and policymakers so that they can better plan and implement measures to sort of deal with this twin crisis that we are experiencing parallel to COVID-19 and a crisis that is mental health, which is likely to persist for longer than the actual pandemic itself.

You might be wondering, this has a lot of potential, but the question is what is the next step in this line of work and are we ready for deployment; if not, what is really preventing us from pursuing those directions?

I would note that there are three issues that we need to pay attention to before we can realize the potential of social media data and these kinds of algorithmic insights in the context of service, treatment and prevention. These are agency and power that surround patients and other mental health stakeholders when algorithms are made a part of the paradigm; unintended negative consequences that we can have because we are appropriating a source of data that was not created to be used for mental health purposes; and of course questions of ethics about how do you secure the privacy of patients, how do you ensure that there are no ruptures in the therapeutic relationship and so on.

To address that, I am going to present a theoretical lens we can adopt that can allow us to navigate these questions as a part of the research process itself instead of as an afterthought.

This particular lens is called an action research framework. It is popular in the qualitative social science field. It is essentially a methodology that kind of seeks to have transformative change, and it involves the simultaneous process of taking action and doing research, and these two are linked together by critical reflection.

In our work I am going to showcase a case study where we have adopted it as a way to collaborate on a shared goal and shared mission between our team and our partnering organization towards a broader mental health outcome.

Specifically, I am talking about collaborative work that is still ongoing. It has been going on since 2017, and it is called the THRIVE project. It is in partnership with Northwell Health, and we are designing and deploying various kinds of clinical tools and potentially technologies that can influence clinical decision-making that are powered by patient social media data. And our idea is to see if these data can lead to improved treatment outcomes.

We are looking at many questions within this project but one question I will highlight here is the question of relapse. I don’t need to introduce that question as much for this audience, but we all know that relapse is a huge issue for individuals with schizophrenia spectrum disorders because a lot of people, even when on a treatment, relapse over a few years’ time. So clinicians always emphasize that there is a need for early identification of indicators of relapse, but that currently is incredibly hard. Clinicians only get to know about relapses after the relapses have actually happened and a patient has been hospitalized.

So to change the status and to see if social media could give us a way to find these early indicators of relapse, we are reporting in this work an analysis of a little over 100 patients, one-half of whom had a relapse and hospitalization, and these individuals shared with us their archives of Facebook data.

Let me tell you a bit about the modeling considerations here. Supervised machine learning approaches were not appropriate for our work because relapse events are multifactorial; they manifest in many different ways in many different people, they have clinical heterogeneity. And the other challenge was that it is a fairly rare event so a lot of the standard machine learning techniques were not suitable here.

To counteract this reality, we modeled the relapse prediction problem as sort of an anomaly detection problem where we were looking at a single patient’s Facebook data and we were looking for aberrations in their behavior that would have clinical meaning. So we developed this anomaly detection framework which was a one-class support vector machine model here, and I am going to report a couple of performance measures.

We found that we do pretty well in terms of specificity. Our model detects relapses correctly to be relapse for about 79 percent of the cases, but where we don’t do very well is the measure of sensitivity. We find that for a lot of cases our model thinks that certain periods in a person’s trajectory are relapses but in our Round 2 data they are not reported as one.

So we conducted a chart review to investigate this problem and we found that for a vast majority of these incorrectly identified individuals, in order to be relapses there were actually notes left by the clinician at the time of the appointment that they had exacerbated psychotic symptoms but at that time the clinician didn’t warrant them to be severe enough to need hospitalization. But because we were looking at the patient’s data from Facebook we were able to pick up some of these early warning indicators of exacerbated symptoms and that is why our model performed the way it has.

The initial research and the case study on relapse that I talked about showcases one way that we can adopt this participatory and iterative approach that involves computer scientists and statisticians and mental health clinicians and clinical researchers in a collaborative way and that we can now appropriate these big data from social media, and we can support hopefully a better assessment of mental health concerns.

But beyond those specific issues and specific case studies that I discussed, there are many questions that still we need to address in order to truly see the value of social media in services research. Some of those issues that we need to address are things like social media was not created or didn’t start as a source of health data, so how do we ensure we have sufficient algorithmic performance on social media to support real-world clinical use?

We know that these algorithms are never going to be perfect, so the question is how do we support graceful failures when these models of social media data actually can stand up to potential use cases. And there are questions of doing no harm and ensuring goals of social justice when we deploy them.

That brings me to the end of this conversation. Thank you very much for listening. I would like to thank all my students and collaborators and sponsors including NIMH for this work. I can take questions at the end of the session.**DR. STUART**: Thank you so much, really interesting work, and we saw a nice spectrum of types of studies this morning. As someone who studies causal inference and non-experimental studies, I was particularly happy to see the use of some of those methods within this context.

We now have about 20 minutes left. Munmum, by the way, there are some questions in the Q&A for you that you can try to respond to by chat, or else we will hopefully have time to talk about some of them after our discussant.

I am happy to turn now to our discussant, Ben Le Cook. Ben is Director of the Health Equity Research Lab at the Cambridge Health Alliance and Associate Professor in the Department of Psychiatry at Harvard Medical School and visiting clinical Assistant Professor at the Albert Einstein College of Medicine in the Bronx. Dr. Cook is a health services researcher focused on improving quality-of-life and access to quality of treatment for individuals living with mental illness and substance use disorders.

Particularly relevant today, as I mentioned earlier, he currently serves as Chair of the Mental Health Services Research Committee for NIMH. Again, I was thrilled when Ben agreed to participate as a discussant because he provides a really good perspective of someone who deeply understands mental health services research and deeply supports the need for high-quality and rigorous statistical methods. I will turn it over to Ben now.**DR. COOK**: Thanks for the opportunity to discuss these really incredibly important projects from three of our field’s preeminent mental health services quantitative scholars, so it is a pleasure to be here.

I just have a few minutes so I am going to highlight the areas that struck me as being really important to the field and then I might make a few broad comments related to how these tools can be used to improve patient care and enhance equity in mental health treatment.

Dr. Wall’s presentation is a great starter to this session, as Liz mentioned, given the importance of the role of advanced statistical methods in learning health systems, and I think this is an idea that is so necessary. I think we have done a lot of work envisioning what learning health systems might look like, and the work that they have done at Columbia and OnTrack data in New York is starting to create the learning health system. It is starting to put it into place.

We know we have this extraordinary amount of data at our fingertips and this comes with really high potential but also the possibility of unintended consequences and -- Munmum, sorry to get at this and I think that is so important -- if we are not rigorous and thoughtful on the implementation of these methods, if we don’t do this kind of across disciplines. And Dr. Wall’s work is a really good demonstration of how to do this with prediction modeling, how it can support treatment decision-making and practice, and so I think it is just so significant.

I want to also applaud Dr. Wall’s team for focusing on educational and occupational outcomes. I really appreciate the patient-centeredness of those outcomes. As we work and move from population reporting to individual prediction, those are so important.

Her R-Shiny app is so innovative. This is going to allow end users to explore multiple outputs and multiple time periods and allows for exploration of the saliency of all of these different model inputs that are going into the models. It’s really innovative. And AI models, I think in order to be relevant for clinicians and to be relevant for patients, they need to come out of their black boxes, and this is a good example of doing that.

I would have liked to see some confidence intervals on the predictions. I’m sure you can do that. I think especially when you are in groups -- and this is often vulnerable groups -- the confidence intervals get really wide, and I think that will be important to know the kind of reliability that you have, and the width of those confidence intervals are going to be important.

My suggestion is actually in the last slide that Melanie had in terms of deployment. First is the tool’s potential for reducing or exacerbating racial and ethnic disparities. Here I want to think about how these models are going to be used when we allocate treatment resources. I listed two papers here that are an example of some of the recent papers that represent these two thorny issues that we will have to overcome in implementation.

There’s this recent JAMA paper by Rebecca Coley and colleagues which demonstrates poor performance for suicide risk prediction modeling using a huge healthcare system dataset, and it still shows poor performance for black and American Indian and Alaska Native patients. And what’s relevant and highly predictive for majority groups in those black and American Indian and Alaska Native patients doesn’t seem to be as meaningful for those more vulnerable groups.

The second paper is this Obermeyer and colleagues Science paper that many of you have likely seen that shows how prediction models relying on prior utilization data actually reduce the number of black patients who needed extra treatment and were identified for that extra care. And the key issue here is that this type of prediction modeling can -- I mean, we are going to be using prediction modeling. We are going to use it to allocate resources, we’re going to use it to identify high-risk patients, and if those models lack relevant data for certain racial and ethnic groups, certain linguistic groups, or there are large amounts of missing data for certain vulnerable groups, then we are going to have decision-making about wraparound services, about intensive resources that may continue to benefit majority groups that historically already have greater access to those resources.

One more suggestion that I will put forward is I want to really second this idea of incorporating provider feedback and patient feedback into how the team will use prediction for clinical practice. The clinicians that I work with at Cambridge Health Alliance have let us know they already have a lot of alert fatigue from the EHR. They are already really reticent to use these tools as part of their decision-making when they are in front of a patient.

The patients we work with have real questions about the source of the data, the privacy of the data, why all of a sudden there’s all this data that they didn’t know the clinician had that is being used to predict their higher risk. So we really need collaborative community-engaged work here with patients and with providers to make sure these predictions translate into clinical practice.

Dr. Wang’s work on optimizing treatment strategies for mental health disorders, -- I want to say that Dr. Wang makes really an excellent case for why machine learning adds value to these traditional methods for identifying who should receive tailored treatments. I love how she boils down an individualized treatment role into an objective function that can be optimized. This turns an incredibly complex problem into a solvable mathematical problem with clear inputs, with assumptions clearly laid out. That is just so important.

It is really exciting also to think about simultaneously modeling multiple outcomes, multiple unobserved latent states. This is cutting edge structural equation modeling that she’s doing there, and that is really what’s going to be needed given the complexity of the heterogeneity of treatment effects.

The latent groups in the EMBARC SSRI RCT trial, identifying those was really interesting to me. The model results suggest it can increase positive responses by 6 percent to 14 percent among certain groups. Like Yuanjia mentioned a little bit, I am really interested in placebo effects here. If I’m getting the data right, there are placebo effects that appear to increase dramatically some of the positive returns on placebo when you target the right individuals who are kind of most amenable to receiving placebo. That’s very interesting, the high levels of impact that a placebo can have, and it seems worth exploring.

The other area of real interest to me is the estimation of latent states and what are those latent states. These are groups that are going to receive tailored intervention, so I would recommend spending a lot of resources and energy identifying who those patients are in those latent groups. I would propose qualitatively understanding those groups through follow-up with patients in those states and expert panels of patient representatives and providers to pour over that data to try and understand similarities and differences within and between groups.

And then more work can be done to quantitatively characterize those groups through looking at their nodes, doing NLP scraping of those nodes, sentiment analyses of their visit audio recordings. I have done a little bit of this work in kind of naming latent variables and I find it to be really reductive, so I would encourage that kind of intense work on understanding those latent categories.

Much of the clinical benefits of these methods stem from the correct identification of the constellation of variables and patient characteristics and symptoms and dispositions, et cetera, that underlie each latent state, so I really propose that we dig in here to better understand and label all the latent states and adapt models based on those results.

Let me turn last to Dr. De Choudhury’s work on social media. There is a lot of speculation and a lot of expertise around dinner tables, including mine, about the effects of social media on mental health, and her work is really some of the best I have seen in terms of rigorously quantifying social media and its association with mental illness and trying to predict mental illness. That is really important as it is. Her focus on misinformation is really significant. And now she is turning to this harder work, in some ways, moving from those descriptive studies to prevention and interventions, which is really important.

I was also interested to see the work in predicting relapse in schizophrenia using Facebook posts. These anomaly prediction methods that she used are important for rare outcomes, and that was really appropriate and important. I think she might agree that we need more data here, and thinking about applying some of those methods to the millions of observations that we have in EPIMED and MHRN sites I think would go a long way to improving those methods. Understandably, though, how do you get Facebook data, how do you get all of the social media data linked to those patients is going to be a real issue.

I was also really struck in the error analysis part of the presentation by the importance of the availability of the patient medical record. If I’m reading the slide right, I think when you have the patient medical record the prediction accuracy goes up. Maybe that is not a surprise, that when you have that data with some historic or electronic medical record data it becomes more predictive, and those clinical assessments end up being really important.

I think that leads to a point that applies to a lot of this work. While we continue to advance our statistical methods to improve mental healthcare, we need to do this kind of old work of increasing outreach, community awareness of services and trust-building to bring people into the learning health system, some more reliable information can be collected, and we get these provider-based evaluations into the data that we use.

So, once in the system there needs to be more time, more resources and understanding what was said by patients in the visit. These are the precious informative times where we get really good data, and so accessing treatment is doubly important here. Then we can understand what happened. I would bring in the perspective of the therapists, ask therapists to review this data in a more intense way, and then use the big data analytics that we’re talking about to disentangle and prioritize all of this evidence.

And then kind of a last tangent here is that the business side of mental health treatment in the US is really pushing therapists towards increasing relative value units, increasing numbers of patients, increasing numbers of visits, as opposed to improving the quality and the understanding of the visit that they have. So, teaming up these kind of new statistical methods with a higher intensity of reflection by therapists and patients, and more information-sharing from trained therapists. I think that is really needed.

It also would be more consistent with the commitment that’s happening across the country to having health systems provide accountable care and value-based treatment. There is kind of a disconnect between some of that accountable care work and the value-based treatment and the amount of time we’re spending on patients and patient data, and I think there is a nexus there that’s important.

I will stop there. I want to really commend Drs. Wall and Yang and Choudhury for their work. These are really crucial, important and interesting projects. Thanks for letting me discuss them.**DR. STUART**: Thank you so much. Just as I had hoped, you did a great job bringing the practical applications of the methods and thinking through what does this look like for clinical or public health practice.

We have about eight minutes left in the session. I am a stickler for ending on time and giving people a break. The speakers have been doing a great job responding to questions in the Q&A, so I have a couple things teed up if needed, but why don’t we first give each of you a chance to respond to any of Dr. Cook’s comments. We can go in order and start with Melanie.**DR. WALL**: Thanks, Ben, for summarizing so well. You highlighted absolutely one -- I forgot to say the anecdote about what happened the first time we showed the R-Shiny tool to the clinicians, but it was just like you said. They were like, what? You want more predictions? More things you want us to show? It was exactly like you said. First we had something much more elaborate, and then we really scaled it back to have something simpler.

But more importantly, one of the pushbacks was why do you think a prediction would even be useful. We had started this whole project thinking that somehow with more information that would be somehow better, and with an algorithm that hopefully you could prove in some way mathematically did a good job of predicting outcomes, and that information would be useful to them. But we have a lot of work to do now with our partners to figure out maybe this is not useful to them, or somehow convincing them that it could be useful.

I really think that is our next big challenge, is to take these tools and try to help people see that it could be used -- Or maybe we are wrong. Maybe they are not useful. I don’t know. But your reaction just made me think I needed to tell that anecdote because we were all happy. Statisticians get happy looking at high AUCs and seeing predictive values that look really good, but the clinicians were not like that at all. They were like, why do I care. How is this going to be helpful? Also, there’s skepticism that an algorithm would be able to do well at predicting.

And honestly, that is kind of where we are right now. We are planning meetings now to do focus groups to try to show this to different groups of clinicians. We have a really strong qualitative team as part of EPINET who are helping us, and they also believe that there’s a role for data in a learning health system. But I think convincing the consumer of that is something I certainly need to build more muscle for how to do that well, because I presently feel like I am in an area I am not so competent with. That’s the one thing I wanted to say.**DR. STUART**: Actually, there’s a little exchange in the Q&A about the communication between statisticians and applied researchers and clinicians and others, and I’m hoping that will be a theme throughout the day and maybe something we can come back to at the end of the day.

Thank you. Yuanjia, do you want to respond?**DR. WANG**: Sure. Thank you, Ben, for summarizing my talk. I really love your idea of using some qualitative research to try to understand those latent states better. That is such a great idea. I remember when we did some more traditional factor analysis with the clinicians, they really liked to label the latent groups they found. They really enjoyed that. So I think if we can do some of this qualitative research trying to understand the subgroups it would be really useful.

Also, I struggled a little with explaining to them about machine learning algorithms. They really appreciate interpretability, so something that they can understand better. They like tree models, random forest importance measures, and they like those interpretable latent subgroups. So that is my experience.

Also about placebo responders, we also identified a subgroup who did not respond. For the most severe group their response rate is 38 percent, so clearly, there is a subgroup of patients where one cycle of antidepressant is not enough. That would make a case for learning dynamic treatment regimes. What do you do for that group of patients who didn’t respond for one cycle? Next step?

A lot of studies can be used for that type of research.**DR. STUART**: Great, thank you. Munmun, anything you would like to respond to?**DR. DE CHOUDHURY**: Thank you, Ben, for the nice discussion. Indeed, the two points that you raised for some of the work I discussed have been questions that we have been wrestling with for the past few years as well.

The first point you had is a really valid point about data size. The irony here is that, as a computer scientist who also talks and hangs out inside computer scientist circles, the data size is like a problem of the 1990s. In those communities, these are no longer real problems because you have tons of data from the internet.

But when I come to these circles, data is still an issue. High-quality data -- It’s challenging because you do have powerful methods like anomaly detection approaches, like deep learning. All of those, however, need access to tons of data to provide the level of performance that we see them providing in other contexts.

So this is a challenge we are going to have to continue to wrestle with from a statistical methodologic perspective. Yes, there are methods that exist; yes, they can be helpful to us. The question is how do we get data. And it is not just data; it’s data that is high-quality data that obviously is sensitive to the privacy expectations and other ethical concerns around it because a lot of these data are not data that are coming from, let’s say, the medical records or data that were coming from systems that do collect health information. They are coming from these digital traces on the internet and on social media, from text messages, from smartphones and whatnot. This is something that we think about.

One of the ways to circumvent that issue leads to the second point you raised which is about combining different kinds of data together. There is a real possibility for doing some fantastic stuff by looking at these datasets together in concert. We did that for some work that I didn’t cover in my presentation on forecasting suicide fatalities on the national level in collaboration with the CDC, and there we have combined health services data with social media data and we saw a tremendous boost in performance.

So, when it comes to services research, thinking about impacting treatment at the individual level, I think we need to start thinking about what other types of data that already exist out there can be used, and hopefully we can cover multiple facets of a person’s life of what is going on, what the clinician has observed, and how do we put those pieces of information together at the same time also overcoming some of the challenges of the scale of the data.**DR. STUART**: Thank you so much. I think great points to end on. I think we have kicked the day off very well. Again, I am hoping that many of the themes from this morning will come back. I made some notes of things I want to come back to in the panel at the end of the day.

Abera, did you want to say anything before we take a break?**DR. WOUHIB**: I was wondering if you have seen the Q&A. I saw a couple of questions here.**DR. STUART**: I do want to end on time so people can get a break, so I think maybe the Q&A questions we can try to respond to by chat or later in the day if that’s okay.**DR. WOUHIB**: That sounds good. We will take a break until 11:00 o’clock for the second session. Thank you very much. This was a wonderful session. This is a good way of starting today, and we look forward to the remaining sessions.

(Break)**Agenda Item: SESSION II: Statistical Methods for Generating Reliable and Reproducible Findings from Neuroimaging Data****DR. WOUHIB**: Hello. I think it is 11 Eastern time, and we are moving to the second session. I would like to introduce our moderator for the second session, Dr. Ying Guo, professor of biostatistics at Emory University. Our session is titled a statistical method for generating reliable and reproducible findings from neuroimaging data.

This session is packed with three great speakers and a discussant. Take it away, Ying.**DR. GUO**: Thank you. Welcome everyone to the second session of the workshop. My name is Ying Guo, and I'm professor of biostatistics at Emory University, where I direct the Center for Biomedical Imaging Statistics.

I'd like to first thank NIMH for organizing this workshop, which provides a very helpful forum for us to discuss these important challenges in mental health studies and how we can use statistical measures to solve these questions. And as the director, Dr. Gordon, has mentioned in today's opening remarks, in recent years there has been a significant amount of interest and emphasis from NIMH, as well as the research community, to improve the rigor in mental health studies. So this need is especially strong for neuroimaging studies, given the well-known complexity and challenges in imaging data. Important venue for us to achieve that goal is through the development and implementation of robust and advanced statistical methods.

Today we have gathered a panel of leading experts in the field to share with us their perspectives and some cutting-edge tools they have developed to help us move towards more reproducible findings. We will have three speakers today, Dr. Bin Yu, from UC Berkeley, Dr. Thomas Nichols from the University of Oxford, and Dr. Martin Lindquist from Johns Hopkins University. Our panel discussant today is Dr. Todd Ogden from Columbia University

So if you have any questions, please type in the Q&A sessions, and we will try to relay some of the questions in the end of the session, and also the speakers, please feel free to type in your answers during the session.

Without further ado, let me introduce our first speaker, Dr. Bin Yu. It's my great honor and pleasure to introduce Dr. Bin Yu today. She is the Chancellor's Distinguished Professor and Class of 1936 Second Chair in the department of statistics EECS at UC Berkeley. She was formally trained as a statistician, but her research extends beyond that. She has led her research group in developing novel statistical machine learning approaches to address important scientific problems in various domains, including neuroscience, genomics, and precision medicine.

Dr. Bin has received numerous prestigious awards and recognitions during her career. She's a member of the U.S. National Academy of Sciences and of the American Academy of Arts and Sciences. She's the past president of the Institute for Mathematical Statistics, Guggenheim Fellow, Tukey Memorial Lecturer of the Bernoulli Society. She's also the winner of the COPSS E.L. Scott prize winner.

Without further ado, I will pass it to Dr. Bin Yu.**DR. YU**: Thank you, Ying, very much for the very kind introduction and for having me here. This is very exciting for me because NIH is a pretty new community for me, and let me just first say something about Weill Neurohub, is part of now for this community, it's a consortium between UC Berkeley, UCSF, and UW. So I hope you guys will get familiar with that.

So what I will talk today is really sharing a framework my group has developed over the last 10 years, with application for more basic neuroscience. I think all of us here have been working on biomedical problems, and they are very important. And there's a bigger umbrella now about all these data science work we do, statistical work we do, is kind of under AI. And Bill Gates said 2019 that AI is like nuclear energy, both promising and dangerous. So I think we should realize the potential at the same time mitigate the dangers.

Data science is a key element of AI, because under the hood basically we're doing data work of traditional statistics.

I'd like to get us thinking about the data science as a lifecycle, not just like modeling. We have to have this holistic view of the whole system from problem formulation, data cleaning, and visualization, at every step as an integrated system, and we need the quality control and standardization process so that we can mitigate the dangers of AI or data science.

So we define veridical data science as extracting reliable, reproducible information from data, with an enriched technical language to communicate and evaluate empirical evidence in the context of human decision and domain knowledge. So my philosophy is very much focused on solving problems. Methods, theory, domain knowledge, serve the same purpose. I like to think we take on the role of all three instead of just developing methods.

For the rest of the talk, I will introduce a framework called PCS, standing for predictability, computability, and stability, for veridical data science. Veridical means truthful. And also the example, actually one of the motivating case studies, for understanding V4 neurons.

So there was different talks last session on machine learning. Leo Breiman 20 years ago really contrasted two cultures, really brought our attention to the differences and similarities. So PCS is really following Leo Breiman for this task to really instead of contrasting, actually integrating the two fields in a most productive way.

So the PCS paper came out with my former student, Karl Kumbier, now at UCSF as postdoc, to really integrate predictability which is from machine learning and statistics and computability at the heart of machine learning, and also stability, as expansion, significant expansion, on the concept uncertainty assessment in statistics, and integrate them under this new framework called PCS.

Many of us will recognize the ingredients of this framework, because we really tried to share the best practices and ideas and then build a unified platform to push forward. PCS in a nutshell is really emphasizing predictability for reality check. This should be the first thing we worry about. Any model, it doesn't matter where you come from, which philosophy, we have to have some reality check to make things comparable, and stability expands statistical inference in the most specialized case when you only have to worry about sample variability to the entire data science lifecycle. Especially we want to evaluate the changes because of the human judgment calls we make in the entire process.

So you can think of this as shaking the system, if you think of lifecycle as a hardware, so that the hardware doesn't break, and the perturbation should be defined and documented by the practitioner using domain knowledge, and you define appropriate stability metric to measure. It's very broad and conceptual and flexible, including bootstrap, like more traditional accepted perturbation under of course assumptions (indiscernible).

So I've been working with quite a few doctors from UCSF, and this particular doctor, I asked him why do you like PCS, and he answered, I think, in a way which was very insightful. He said, the PCS framework builds a working relationship between data and the clinical world. PCS is a look under the hood to ensure that the conclusions found are what the data genuinely suggest. In all, PCS is a holistic approach to helping the clinician understand, interpret, and build the science we need to help our patients.

So I actually used PCS for a stress test for some existing clinical decision rule called PECARN, and it's really looking under the hood, and we want the process not as messy and we need to make it reliable on the left, but that's too ideal, but not on the right.

So the first thing, as in the previous session people already talked about you can choose different data, and you might end up different conclusions, for this particular audience, right? Can you choose different imaging data? Would you reach different conclusions? Would they answer the same questions? So that's a huge, I would say, perturbation to the process, and humans are making judgment calls, and we need to discuss these issues instead of, well, that's the data I have; therefore, what I get must be truth.

The other thing is data processing or cleaning. I didn't mean that you just sit at a computer and say this looks good. You really have to do shoe leather work and to really understand the process. MRI, we're in discussion with the Neurohub people about maybe calibrating the MRI images from UCSF and UW, and from the people I work with, it's not clear whether if you don't do any calibration, things will turn out the same.

So this is just starting from the raw data, tuning the machine, we can create a different versions of the data. Maybe we should keep multiple versions, depending on different process, and really and judge later whether we reach different scientific conclusions. I mean, quality different conclusions.

And data split, to do machine learning and also in the PCS framework, you want to assess the reality check, that's whether your model at least captures some realities through prediction assessment, and how do you split? Do you do random split, want things symmetric and vector dependent? It's a good idea. And maybe if an another medical situation, you always think about the future patient, you should do a time split. So it's unclear, or we should discuss whether the random split used a lot in machine learning is the right thing to do in the health context. I don't think so. I think we should do time split.

And data perturbations. This is a well-known example of how deep learning can be adversarially attacked, and you get different medical diagnoses.

We have so many -- it's great that we have so many different methods, just even just for supervised learning, and different people prefer different algorithms. I really like to advocate multi-method researchers instead of different groups only use one particular method. They all have their limitations, but should really branch out and try to try different methods, which offer variability as possible, instead of different methods only used traditionally where they're comparable. Because science shouldn't depend on the method we choose.

And researcher to researcher perturbation. For climate studies, the next talk by Tom definitely will address this question, but also in climate study we often see plots like this. We have nine different climate models for global mean temperature change, and it ranges from 1.5 to 5.5. So all of these are human judgment calls, and we have to use documentation to record these judgment calls.

Reality is outside our mind, and the models are mental constructs. They don't have to connect. It looks like a bridge, that's why. But we have to do the work to really show there is a bridge, instead it looks like there's a bridge. We can put them together, and we need good quantitative and qualitative narratives, interpretable results, and document it and make a case why we made certain judgment calls so that the result will be as trustworthy as we can make it be. Otherwise, we're not on solid ground.

Now taking this more expanded view of data science and look at statistical inference. It's still very important, but I am of the view that p-value should not drive a decision whether, say, a follow-up experiment should be done. It's part of the evidence, and it's our job to make the evidence as transparent as possible and through interpretable models and we should have ways to evaluate the strengths, say, from different model, different groups. That's part of what we should do and present it as part of the decision-making instead of the sole decisionmaker.

If we were working on the frontier for PCS, working on PCS inference, we have one paper called epiTree detecting epistasis, it is on my website, but it's still ongoing. It's really first -- we first have to do model checking. You look at traditional -- actually we use the same training -- at least we're thinking about diagnosis before we do testing. But now, that step has been kind of neglected, because it's difficult.

We at least should really show there's signal in the data. You have to be something better than random guesses before we worry about p-values. That step is not done often, testing model biases. And then we can put different steps together through these perturbation intervals that we can evaluate through test data under idealized conditions, you know, get the confidence level. So this is very much still being developed. And we can say that the PCS is already inference practiced by climate scientists, because they are already showing these nine different models and give you interval, and it's a lot more realistic than just give you one model.

For the second part, I will quickly go through one of the motivating examples. The paper has been revised for the last few years, and still we put it on bioRxiv like three years ago now, led by Reza, my UCSF student, now a faculty UCSF, Yuansi now faculty at Duke, with my long-term collaborator Jack Gallant's group. So this is really trying to understand a very difficult elusive error in the human visual what pathways. We try to figure out what we're looking at. V1 is like (indiscernible) detection, and V4 has been very elusive and unsafe. On the other hand, deep learning has really catching up and bringing a lot of attention to kind of pseudo kind of neural network like supervised machine learning, and unsupervised.

So Hubel and Wiesel figured out, we went through very ingenious design experiment, putting bars and showing that the neurons really care about the different orientation, different frequency, so that's how they figured out V1. V4 has been long challenging. People have made very complex beautiful geometric shapes to figure out what V4 cares about.

So with this methodology, we might miss a lot of patterns that V4 cares for, but we didn't think of design. So instead we turned the table and tried to use natural image development models and use the model to design the next step experiment. So many, many PCS framework we have done has really tried across scientific machine learning or recommendation system to really help scientists to do better design for the next step experiments or validation.

So the questions we faced was how do we characterize V4 neurons? And can we generate data-driven hypothesis instead of doing this geometric beautiful shapes but we might miss important patterns? And then try to connect with V4, with new neural networks.

So we had earlier work, we designed our own neural network. So now I'll talk about that's why of the benchmarks, we got the state of art performance in terms of prediction relative to V1 performance. Later, we moved to deep learning, just to see how it works with our own other work.

So what we did is we took AlexNet, which is trained by other people on ImageNet, about 1 million images, 1,000 categories, and take it as a data driven by other data, feature extraction, and we didn't do any change, we just feed our black and white images to the feature extractor, and did linear regression. So the data was random selected black and white images collected by Jack's lab, and we show 71 macaque, we found neurons and their average neuron activities, like firing rate.

So this has really transformed learning in the sense that we transferred, we have some things developed using color images to black and white, and then the data from ImageNet to human very microlevel, and it went to the very microlevel in the sense of at the neuron level instead of task, and from human to macaque. So there's three different levels transfer, and I was so surprised that it works so well, across 71 neurons.

Now we want to -- remember, our goal is to help neuroscientists design stimuli to understand V4. So this is a key slide. So we already have a model. Suppose we use AlexNet and (indiscernible) to do the regression part. And then we start with random white noise, and we maximize this model, regularized, and then we say if this model capture reality, then this suggests this neuron, neuron 1, likes this curve, this direction, and then we can put it back to experiment to see better it's actually the case makes things fire.

So this is what is called DeepTune, very much inspired by DeepStream, from the deep learning community, but we are doing neuroscience, so tuning is an important concept, so we call it DeepTune. So this is basically the key development, and how the model came from.

You can say this is completely artifact of the model, which is very fair, but we use a smaller dataset which from the experiment and also use a model and see whether this curve thing is preferred. So you can see all the high responses in the model if you use it, restrict it, images, you see the curve type of thing coming up. So that's kind of confirmation.

But that's still model based. So now we're looking we have test data, which is higher signal-to-noise ratio, 10 to 1, like 10 replicates. This is just raw responses, and we look at where the neuron get excited. You see the curvature kind of thing again. So curve again.

So this is kind of qualitative confirmation that what the DeepTune image suggests is capturing qualitative reality. The position is kind of similar in the middle, but of course it's not identical.

And then we have to confront the question that we have many different neural networks. We try the GoogleNet and VGG. That's the early ones. And then you can do regression and (indiscernible), and they give you very, very similar predictive performance.

So we suddenly have 18 different models, give very similar performance, and this is a perturbation of human choice. How do we interpret? So we decide to use stability principle to find some consensus. So this is 18 different models give you the DeepTunes, and they all share this common curve in the middle, and this kind of regular curve like regular, like parallel curve, that's actually artifact of the deep learning, as we learn. You can see GoogleNet has smaller futures, convolutional futures, and it has smaller spacings.

And then actually I was giving this talk, some of you might have heard it earlier, that people were saying, well, I want one image. Actually, random initialization for any of this doesn't make too much differences. So that's why I didn't discuss it, and we decided to aggregate the different 18 models at the gradient level. We do gradient ascent to maximize the surfaces, you work the -- take the gradient as the smallest in terms of absolute value, among the 18 different gradients, and then that's what we call consensus DeepTune.

If you do this, this is ten different initializations. They become very, very similar. So now we have a good representation for each neuron, and now you can cluster them. You can really understand the diversity of V4 neurons, what they care about, and this is just a normal natural that we see people design with geometry and complex shapes. You see long contours. People already know curves are being favored by V4 neurons, and then texture also has been in the literature, but you can see there's also some outlier like V1 like, the one on the most left. And then there's small complex patterns, very hard to design for humans.

So the goal is really try to feed back to the closed-loop and unfortunately I stopped doing physiology, but Jim DiCarlo's lab at MIT really did parallel line did something similar, and they had a Science paper. So I'm seeing proof of concept in Jim's paper that instead of doing the previous very geometric careful design, we can use data to give us suggestions and then confirm experiments. So we want to crop it, because we don't try this spacing, so it's not like this is -- it's kind of shrink down the scale of the scope of the possibilities you can try, and beyond the geometric designed stimuli.

So to summarize, we propose this framework will be useful for many of the audiences here and data documentation is hugely important, and I showed you one case study. Actually my group has done seven different case studies from precision medicine and genomics, and the principles are very useful. You still have to do a lot of work to make the choices and stability and how you evaluate the predictability, for example, for causal inferences, you have to go through categorization, because there's no direct prediction at a data unit level.

Domain knowledge is hugely important. I don't feel like -- I think one of the most effective ways to develop statistical methodology actually embed it and solve a scientific problem and then generalize. I'm not in the camp of designing things first and then trying it out. I don't think the chance of that succeeding is very high. But you never say never.

So I'd like to thank my group for their unwavering support for the type of I'll call slow research we do, and we really try to solve problems and everything, critical thinking, algorithms, interpretable machine learning and relevant theory, we're now doing deep learning and also random forest relevant theory, all part of solving the problem.

I want to have a shoutout for a book I'm finishing with my postdoc, Rebecca Barter, with MIT Press, we hope will finish in the summer, and will be a free online copy later this year. So it's really using the PCS guiding principle for the entire data lifecycle. For example, we have a whole chapter on data cleaning, and emphasis on narratives, how do I make the connections between domain, human knowledge, and communication, with the symbols we use in mathematical models.

Without that narrative bridge, the symbols don't exist in the real world. This is a gap I think we are teaching that we hope to fill.

Another shoutout for our new division called CDSS at Berkeley. We have our wise provost, Jennifer Chayes, leading the charge, and statistics department is in the process of moving into this new division, leaving letters and science. I will have many different courses I hope you guys have heard, Data 8, Data 100, Data 102, and hope you have a chance to take a look.

So the papers, I want to have -- because the previous session talked about interpretable machine learning, actually the other paper tried to emphasize the relevance to audience into problems. So you might find it interesting to take a look. Appeared in PNAS last year.

Thank you very much.**DR. GUO**: Thank you, Bin, for the wonderful talk and for introducing this PCS framework, and I think this is exactly the kind of modern statistical framework we need now to address the challenges we are facing in those research studies as well as clinical trials, and thank you for sharing your insights on that.

It's great to know you have a book coming out on this topic so we can learn more about it. Thank you.

So in the interests of time, we are moving to our second speaker, Dr. Thomas Nichols. Tom is the professor of neuroimaging statistics at Oxford Big Data Institute. He is a statistician with a solitary focus on modeling and inference methods for brain imaging research. Tom has a unique background, with both industrial and academic experience, and he has a diverse training in both statistics as well as cognitive neuroscience. So Tom has received the Wiley Young Investigator Award, which is high honor from the organization for Human Brain Mapping, in recognition for his contributions to statistical modeling and inference of neuroimaging data. He's a developer of both SPM and FSL tools, which are the two most popular tools for analyzing imaging data, and is well-known for bringing advanced statistical methodology to brain imaging.

So in summary, I think Tom has done some great and high impact work in bridging statistics and neuroscience. So, Tom, please take it away.**DR. NICHOLS**: Thank you, Ying, for that warm introduction. Continuing on the theme of reproducibility and stability, my talk will be on the impact of methodological variation on fMRI. This is a joint work with my former postdoc, Camille Maumet, and my current postdoc, former student, Alex Bowring.

The goal of this work was to basically answer this question, the question we often are asked when doing fMRI analysis, is which software to use. If you're familiar with fMRI analysis, you might know that there are a couple dominant software packages out there, but in fact, there are a number of tools that each have their own strengths and advantages. I've put a couple of them up here, Freesurfer is used a lot, the tools in ni learn I use a lot, and of course NITRIC is a repository for just a huge collection of tools. So it does come to the question of which software should we be using? What are all the pluses and minuses?

Let me say that there have been some study of this, this question of methodological variation in fMRI. So the first person who took this up was a student in Michigan, Joshua Carp. He looked at the number of different possible analysis pipelines and enumerated them all, and showed on one data set how different the results could be.

We've also shown that just the version of your analysis software -- I believe this Groenenschild paper here was manipulating the version of the Freesurfer software, and showed nontrivial differences. And perhaps most disturbingly taking the same software, same version, but on different operating systems, and finding that the results are not the same.

So there's been a number of people who have looked at these, the fact that there are differences and that perhaps we should be concerned about them. But what we really wanted to know was what was the actual impact on not just one dataset, or one simulated dataset, but what would actually be the impact on actual fMRI data that has been used in more than one dataset?

Just for background for people who are not familiar with fMRI, fMRI data processing is hard, because there's a huge number of analysis preprocessing steps, and to be honest, that's probably shared with many types of modern biomedical data -- it's not the fMRI alone that has all these kind of non-statistical preprocessing steps that have to be done. I've just listed them here.

Just for example, you have to account for the fact that people move their heads in the scanners, you have to account for the fact that different people have different shaped brains and you have to align them to a common atlas. And then there are a number of agree-to-disagree analysis choices on how the data is analyzed, and different people can make different choices on these.

I think one of the reasons why fMRI has succeeded is because there are end-to-end software packages available that say, listen, we're a team, we've thought hard about each one of these problems, we're going to give you a turnkey, end-to-end solution, you put your data in here, you specify your model, and out comes the answers. And that's great, and that I think has really been key to supporting the growth of fMRI.

But that does means that there's a whole set of choices, analytical choices, that have been made that are kind of baked into these. And they're probably very carefully thought of at the time, but they represent different pipelines for each different package.

We set out to ask the question of what impact does this -- if we just took one dataset, how much impact would there be by using the defaults, and of course there are different choices you could use around the defaults, but our first work on this, back in 2016, we just said what if we just took one dataset whose data was available, put it through three different software packages with minor variations of the default analysis pipeline? And that was a masters student's project and it was interesting, and we found that actually the statistical maps -- these are t-statistical maps out of each package -- look considerably different.

But of course, we quickly realized well that's not really surprising because there are some fundamental differences in the defaults between these packages. For example, AFNI specifies a relatively lower smoothing than the other two packages. So why would they be similar if we accept the defaults?

So what motivated the current work was basically taking a more harmonized approach and also looking at more datasets. So we took three different datasets and put them through three different analysis software packages -- again, that have end-to-end solutions for treated analysis, and then also at the very tail end we allowed for both either parametric or non-parametric permutation inference at the group. Our goal was to replicate the original publication published papers, except we didn't want to get caught in the last problem of having sort of completely incomparable results across pipelines.

So we made some hard choices. Here are the three datasets, and I'll go through them; I'll only have time to talk about the first two. But here's some tough choices that we had to do. Some of these papers had some very well-intended, probably good methods for removing outliers, but they weren't implemented in any standard software. So we said nope, we're not going to do that.

Some software tools extract the brain, some software tools manage to do the analysis without that brain extraction. Some tools' default was to use linear brain extraction. So just a bunch of things. We said, okay, we're not going to just accept the defaults, we're going to make some active choices so that across the three different software packages they will be as comparable as possible to what was published in the original work. So our goal was to then qualitatively and quantitatively evaluate how different are the results when we take the same data through these three different packages?

For better or for worse, the main way that researchers in neuroscience interpret fMRI data analysis results are through thresholded statistic maps. They take the p-values, determine a level of significance, and they examine where these super-threshold regions are. When we do this for this first balloon analog risk task, you see these three maps. This, on the bottom here, shows you what was published in the original study, and so we would expect the FSL would be the same. Now of course, you might say well, why isn't this exactly the same? There are some minor differences that we made to make them all comparable, and also -- and this is very common -- even though the data is shared, the study reported n = 17, but only shared 16 maps. So we had to do, we took what we did. So in some ways it's less interesting comparing to this and it's more interesting comparing among the different -- we worked as hard possible to make these analyses as similar as possible across three packages. Here are the parametric results, here are the non-parametric results.

And broadly they're picking up different similar things, there's perhaps less activation here in the end, and interestingly, this goes away altogether (indiscernible) permutation. Now, you may be aware that there is some work that says, and I was involved in that work, that this cluster-forming threshold is probably too low and shouldn't be trusted. But that said, it should mean the same thing across these three packages. So again, this is kind of upsetting that you don't get something similar.

You might focus in on this and say, well, how is it possible we lost this? So here is the same data, just a different view, a coronal view instead of a sagittal view. And what's interesting is that what you would not see on the map here, but when we looked at the unthresholded map, is that this cluster here is one big cluster. The insular and the anterior cingulate regions all were connected to be one gigantic and hence very significant cluster. And obviously what happened was there was just a very minor perturbation of the data that gave rise to different results that broke the cluster into smaller clusters, and then that was no longer significant.

So this is something telling us something about fMRI analysis in general, but maybe specifically about cluster analysis as a very fragile aspect to it. I can show you what happened on the next dataset, which had more subjects -- certainly a low n, by any count -- but less fiddly. Whereas the last analysis was an event-related design with a very complex and subtle experimental design analysis, this is a more robust block design, and you see then very similar -- we don't always get the cingulate up here. The permutation. But we do generally see greater agreement across these for this analysis here.

Qualitatively, we can look at this and we can try to say to ourselves how would the interpretation be different? But can we make this quantitative? And we can do this. So the thing we focused on was two things, correlation, just correlation of the Z-maps, the T-maps. And then the dice coefficient. For the chosen thresholding method, how much overlap was there between the different methods?

This was just for the first study. This is the dice overlap. So this should be 1 if the different regions of activation perfectly overlapped or 0 if there's no overlap. And what you can see is that among the different softwares, it's quite disappointing. This is basically saying, at best, we're under 50 percent overlap by this dice metric. And this is quite sobering. So this is a quantitative assessment of what we saw in the last one. And in some ways I think the last one may make you more optimistic that when you actually see it laid out like this.

As noted, the FSL permutation analysis actually just had no overlap and missed that positive activation, had no significant results. But what is notable is that the comparison that's strictly within pipelines. So these results here are AFNI parametric to AFNI nonparametric. They're quite similar, because basically they have the entire pipeline behind them that's all the same, and they only vary on how the final group stats were derived and hence they're the most similar.

The last thing I want to show on this first version of the data analysis was Bland-Altman plots. You might be familiar with Bland-Altman plots, they're a plot whenever you are comparing two things, it's often better to compare not just a scatterplot, but a plot of the average versus the difference, with the y-axis showing the difference and the x-axis showing the average.

The first thing to point out is this one here. It turns out that SPM uses the exact same statistical model for its nonparametric and for its parametric statistics, hence you get a perfectly flat line. There is no difference in the T-statistic images. Fine. That's good.

And with comparing within software packages, between parametric and nonparametric, yeah, there's pretty good similarity, there's some variation, but a little less with FSL, but this is reflecting different models. The nonparametric methods are using a one-sample T-test, where both of these parametric methods are using a more fancy mixed-effects model.

But what's really sobering was when we looked at the difference in the T-statistics between the software packages, and what you're seeing here is just to note, these are T-statistics. You can interpret these. These are roughly Z statistics. So you can see the differences are up to 4 Z units. I've put on this ellipse here, would be the confidence ellipsoid for what you would get if you had unrelated, completely independent Gaussian variates, when you plotted their average versus their difference. Now, this is not as broad as an independent difference, difference of independent random variables, but it's getting up there. This is sobering. This is saying that the difference we see between packages is approaching what you see between independent random noise.

So that's pretty sobering, that's ds001, the first dataset. When we do the second dataset, the ds109, similar pattern, just really substantial variation, and it's really sobering. Different for some pairs of software, and you might say this is a systematic, maybe a bias, saying AFNI is better for highly statistics than FSL, but still really sobering, the amounts of variation.

That work was published, and the reviewers gave us a lot of hassle, basically asking what is the nature of the differences? What is the difference? And we were just setting out to do a very systematic comparison, and we just didn't have the time at the time to sort of isolate what are the differences. So that's what we did with the current project that's currently in preparation.

We had a feeling that maybe it was preprocessing that was explaining most of this difference. So we wanted to create, we were going to take our same three datasets, and then we're going to put them through a collection of hybrid pipelines, and we're going to use a common, basically a fourth approach to doing the preprocessing, something called fMRIPrep. It's a package that was developed trying to find best-consensus solutions for preprocessing.

What we'll then do is interchange analysis steps between the different models, and then we'll do the same sort of comparisons. So we've identified five key steps. There's preprocessing, that's head motion correction, atlas registration. Details of the first level signal model; different packages make different assumptions about the HRF, that's the Hemodynamic Response Function. The first level signal model, some software, for example, assumes that the temporal autocorrelation is homogeneous across the brain.

Different software packages make different assumptions about the deterministic model for the drift. And also how the group model is conducted, it differs; some packages just use equivalent of ordinary least squares and don't use the first level standard errors. But this alone would be 243 different options, just among the three software packages. So we needed a practical way forward.

So our approach was basically find paths through the analysis space, changing one option at a time. So this is a set of glyphs that we created to demonstrate, to record how we're changing the analysis. So this is an example of an SPM for preprocessing signal, noise, drift, and a group model, and these two little bits down here represent parametric and nonparametric.

And then when we can change one thing at a time, you'll see this glyph changes here. We have over here, pure FSL, pure SPM, and if we just change the preprocessing, get this glyph, we change preprocessing and we also change group model; then we change the group model and the drift model; change the group model, drift model, and the noise model. And here, this is basically identical to FSL except with fMRIPrep preprocessing.

So now we can finally try to isolate these differences, and what we find here is that similar, similar, similar -- so, not very much change -- until we get to right here, and that big step here is the change is the signal model. And then there's again some similars -- whereas SPM doesn't seem to be that different in terms of preprocessing, switching from SPM preprocessing to fMRIPrep, FSL does maybe have some differences. In fact actually the region that disappeared comes back. So this is suggesting that that signal model matters.

We can do this now between AFNI and FSL, and interestingly we see the same issue. Basically, right where we change the signal model that differs. And this is perhaps this particular experiment is a very subtle cognitive experiment, and you have to model it, you have to use complex parametric modulations, and perhaps subtle differences in the hemodynamic response functions used in the different packages are resulting in different result patterns.

We can also then make correlations, dice, for positive activations in the dice for negative activations, to further quantify that. And these maps are really interesting, you don't have time to stare at them, but when you see blocky structure, that's telling you where there's basically no difference, and then when you see these bright changes you're like, ah, yes, there's something here, and this is the difference between 5 and 6 I've identified before.

And this, we can do the same thing for the next pipeline here, and again, less changes here, so it looks like it's actually good, although we do actually miss sometimes the cingulate finding up here. So we could map those out, but was can also just, again, quantify these. And again, this dataset is doing much better than 109. In general, these correlations are higher. And if anything, there's maybe a break here, slightly, between 3 and 4, but also again between the FSL preprocessing.

There's a lot here, but what we have done is by taking a systematic exploration of the analysis space, we can try to start isolating where are these differences. And what have we found, in general? Basically there is lots of variation, unfortunately, and it's really sobering what these are.

Partly you could say this is a pessimistic view, because we are focusing on low-n datasets and the trend in the field is of course towards larger n. But a large amount of cognitive neuroscience still relies on these small-n datasets, and also it is informative on how we view the historical literature that has relied on these small n's.

Our current work with the hybrid pipelines has found that preprocessing, and I didn't really show, but the drift model doesn't seem to make much difference. But it just depends on each different dataset. It definitely seems like the HRF makes some big difference, but not so much on a block design study.

I promised I'd try to have some sort of positive spin here on how can we deal with this going forward. I see two ways forward. One is basically saying, well, we need more validation. Each of these packages were the result of peer-reviewed research, but they weren't necessarily validated as a whole. They were validated by different pieces over, literally, decades, at this point. And maybe we need to now step back and do some sort of comprehensive analysis and try to consider all these different variants and find the best analysis, and then provide a sort of a suggested consensus pipeline that mixes and matches.

And that sounds daunting, and it is, and I don't know whether that's feasible at all for all of task fMRI. But I will say that was the project of fMRIPrep. A team sat down and took an ecumenical view across all the different software packages and tried to find the different pieces that worked best, and glued them all together, and that's what fMRIPrep is. However, task fMRI design is more complex and there's more involved.

So maybe a more practical way forward is a multiverse analysis approach where we acknowledge that there's substantial methodological variation, and we try to systematically sample that variation and then average over, come up with consensus inferences. Now, there's a lot of work needed on this. First, right now, all this work that we did right here was fairly manual, and we need ways to automate the provenance of basically what was done and the description of how to manipulate tasks, so it doesn't have to be done in a manual, hand-created way. And then ideally, if I did, say, 500 different analyses, I could then find that actually maybe only 20 or maybe only 50 of those variants were responsible for a substantial amount of variation, and so going forward I could use a smaller number of analyses.

And then finally there is the whole notion of, well, okay, once I have my 500 or 50 different analyses, how do I combine them? Do I want to just take an average, do I want to penalize for being different, in a kind of mixed effects meta-analysis, or do I not care about that, and I just want a consensus? I think that's a really interesting area going forward, because all meta-analysis previously has all been about independent data, and this is really looking at the same data meta-analysis.

So I'll stop there, and thanks for your attention. **DR. GUO**: Thank you, Tom, for the great talk. Especially for those of us working in imaging, we all know how much impact when you change a small thing in your pipeline, how that (indiscernible) reaching your analysis results. Thank you for doing this very important, I think, I assume very time-consuming work, to formally quantify the influence from various sources. Similar results are both revealing and also worrying for us, and this also shows how important it is that we should be really careful in the statistical analysis and be transparent about the procedures that we used to arrive at certain results, and I really like your idea in the end. Maybe we can think of a way to automate the process to consider different measures and software that you can analyze the data. Thank you, Tom.

We are coming to our third speaker, Dr. Martin Lindquist. Martin is a professor of biostatistics at Johns Hopkins University. His research focuses are mathematical statistical problems relating to fMRI data, and Martin is actually involved in developing new statistical methods that can help us generating new insights in brain function using functional neuroimaging.

He has been very productive in his career and published over 100 research articles and serves on multiple editorial boards on both statistical and neuroimaging research. Martin is a fellow of the American Statistical Association, and he was awarded the Organization for Human Brain Mapping's Education in Neuroimaging Award in 2018. So Martin has very popular online courses teaching fMRI methods. I think I recommend all my students take his class, actually. This class has been taken by more than 100,000 students worldwide.

Martin, please take it away.**DR. LINDQUIST**: Thank you, Ying, for the kind introduction, and thanks for inviting me. This is really a great workshop and I'm glad to be a part of it.

I'm going to build on some of the -- at the end, Tom, in the previous talk, was talking about the movement from small n to large n studies, and so one of the things I wanted to sort of touch upon here is how does that impact, that movement from small n to large n, impact statistical analysis and reproducibility and the like. So those are kind of the things I wanted to explore in this talk.

In the past decade or so, there's been a lot of discussion about how studies with small samples undermine the reliability of neuroimaging research. I think a survey paper from a few years ago showed that the median sample size for an fMRI study, I think, was around 28 or 29, so on the low end. And you saw in Tom's lecture that the studies that he was looking at were around 20. So that was a normal type of sample size for a long time. And so there's been this argument that these small sample studies both have low statistical power to detect true effects, but also reduced likelihood for statistically significant results to be true effects. There's been a whole lot of debate about that.

But we're sort of in the midst of a paradigm shift where there's been a substantial increase in the availability of largescale, very diverse lifespan datasets consisting of well over 1,000 subjects, and here I put a couple.

The classic is the Human Connectome Project, and of interest to this crowd, there's also disease connectomes, which are using the same paradigm and using similar experiments, but on sort of disease population, and there's multiple mental health studies related to that. There's ABIDE, ADNI, ABCD, UK Biobank, which is going to image up to 100,000 subjects eventually, which is quite amazing. I put in here also the Acute to Chronic Pain Signatures program, because I'm heavily involved with that, and we're going to be scanning 3,600 people longitudinally, just to see who recovers from acute pain episode and who develops chronic pain.

With this increase, and now that we have these datasets consisting of thousands of people, most of the sort of methods that we have were sort of validated and developed with 20 subjects in mind, and so there's a need to reevaluate which methods to use and where, and there's a need for principled statistical methods to analyze these big datasets. Some of the commonly used statistical methods that we have will carry over to this new paradigm, while others will probably need some sort of refinement. In this talk I'm going to try to highlight some of the opportunities and challenges that may lay ahead in this area.

This has already been a discussion in the bread-and-butter brain mapping type approaches, and you saw some of this, example of brain mapping in Tom's talk, where you have people perform a task, and you try to find activated regions. In these types of studies, traditional brain mapping approaches, they treat the brain as the outcome and the task as predictor. So we have a bunch of voxels over the brain, each voxel has a time series associated tracking brain activation, and we have a bunch of mental events that the subject goes through, and we have sort of hypothesized activation corresponding to each of these mental events, and we fit a big regression model where we try to determine the significance of each of these mental events using the time series data. And we want to do that at each voxel of the brain.

So the goal of this type of analysis is to assess at each voxel if there's a nonzero effect. So the null hypothesis is that this mental effect doesn't play a role here, and we see whether we can reject, in that case we give it a big fat blob in that region. So each of these voxels are separate, independent outcome in a mass univariate analysis, and voxel-wise effects are aggregated into these statistical maps that we see.

There's been a fair amount of debate related to sample size and brain mapping already, and one of the things that have driven this is that people found that as sample size increase, even small effects become well-powered and tend to be easier to detect, which is sort of -- that's true of these statistical models. However, it's been sort of presented as something as the fallacy of null hypothesis testing, and one researcher went so far as to say that you shouldn't have more than 16 subjects in your study, because you might get sort of trivial effects be significant.

This is really a question of power, in that you'd have an increasing amount of power, and you're able to detect even small effects. We wrote an article pushing back on that saying you don't want to limit the power of these types of studies. You might just want to change the question that you're asking. But in general, and I think this is an important thing, is that if you detect more significant effects, this is ultimately going to require more domain knowledge to determine whether they're practically important or not, and that's an important thing. But I also think that you shouldn't get mad at your method if it does what it's designed to do. It's not what you want, right. There's been a lot of people thinking that the p-value is not working well. It works perfectly fine for what it's designed to do, and that just might not be what you want it to do, so in that case you should change the methodology or the metrics that you look at.

We've seen that in the literature. There's a lot more, for example, predictive modeling approaches that are taking place, and alternative metrics such as effect sizes will increasingly find widespread use. And that's a fine thing.

For example, predictive modeling, this links back to Bin's talk on machine learning and deep learning and the like. Here, we sort of flip the equation oftentimes, and have the voxels be the features, and we want to predict some task or condition or behavior or whatnot, and that's the outcome. This has a lot of benefits, as it gets rid of some of the multiplicity issues related to the brain mapping, and there's going to be a whole session on multiplicity this afternoon, so this ties into that as well.

These predictive models tend to work better if you have large sample size than these standard statistical approaches. So with this increased sample sizes, these methods will hopefully start working better, in particular the deep learning type approaches that Bin talked about need a very substantial sample size to work well. So these larger sample sizes help avoid overfitting and the like.

In work that Tom didn't mention, but he was involved with, there was a big study of -- this NARP study, where they gave the same dataset to I think 78 different groups, and asked them to preprocess them as they wanted to, so what they found is that the different groups preprocessed things very differently, and that gave rise to different results, particularly in the thresholded statistical maps.

One of the nice things about that paper was that even though people preprocessed the data quite differently, the spatial patterns of the task activation maps were relatively stable, and that I think is very beneficial for the predictive modeling approach, because that's what's going to be, oftentimes, if you're using predictive modeling on fMRI data, that's usually where the features are. So that indicates there might be some stability in the features that will be useful for these types of approaches.

Also, another thing that is becoming increasingly popular sort of is the idea of using effect sizes as a way of determining whether an effect is interesting or not, and that promises to play an important role in the analysis of largescale data. So we have a paper from 2017 with Reddan et al, and just recently, just this week, my Twitter feed showed up a new paper by Dick et al, with the ABCD group, were talking about how to analyze big datasets, so that was very serendipitous. That's a great paper, you guys should take a look at that. They also talked about effect size estimation and the like.

And there's lots of different types of methods that you can use, and these are beneficial here as they're sort of unit-free descriptions of the strength of an effect that are independent of the effect sizes. Now, this should ideally be presented together with confidence intervals, and those intervals will be dependent on the sample size, of course. Again, small effect sizes can be important, and again, scientific knowledge is required to evaluate the relative importance.

So this is, as one idea, one area where the brain mapping, where sample size discussions have already played a pretty big role, have been discussed a lot, and another area where this is important that's been controversial is in looking at individual differences in brain structure function, and behavioral phenotypes. So for example, you might want to look at brain activation and how it links to fluid intelligence or some other behavior or method, and you want to look at that across the brain. Uncovering reproducible associations in individual differences, and between brain structure, function, and behavioral phenotypes, is quite difficult using these standard small sample sizes. There's been a lot of debate about inflated effects. You might recall some of the voodoo correlation debate and it's sort of related to the fact that oftentimes the way it's reported, that there's a selection bias involved, and I'll show you in a second why that's the case. And this is sort of a thing that these large effects are necessary basically to survive multiplicity correction.

As an example, let's say that we have a correlation between a phenotype and brain activation across subjects that's, say, 0.15. And all the voxels with that correlation, this might be the distribution of those voxels, taking always into consideration where sample size is 30. Because we have to adjust for multiplicity and the like, our threshold might be somewhere here, and we say that you have a significant correlation if your correlation is above, say, 0.7 or something like that. Then they tend to report these large correlations and, say, look it's very highly correlated between this, but this is a selection bias issue, which is now sort of well-understood, but it's sort of, people have grown accustomed to seeing these large individual differences between brain and behavior, which are sort of driven a lot by this selection bias.

Now what happens when the sample size increases, let's say, to 3,000? Well, of course this tends to, the threshold becomes closer to the true effect, so the selection bias will similarly decrease. So what's going to happen in these cases is when we're looking at these big datasets, we're going to tend to see more reproducible brain-behavior relationships, but they're going to be smaller, right? Because we're going to be closer to the true effect, which is probably closer to 0.5 than to 0.7, which is quite large.

So I think when analyzing large datasets, one thing that we should expect is that we're going to find more reliable associations, but they're going to smaller than we've grown accustomed to. So correlations of 0.9 between brain and behavior, which is great for getting a Nature paper, probably not so realistic, because most real-world associations are probably small. But this of course doesn't make them uninteresting. We have to just sort of change our expectations a little bit.

I think that this was shown in another excellent paper from the ABCD group, Marek et al, which showed that in that big cohort of data that the largest replicable brain-wide associations for univariate measures was around 0.14. So these are of course very interesting things, but they're not these 0.9s that we've grown accustomed to in the small sample paradigm. So I think we're going to see smaller but more reproducible measures, and we should set our expectations appropriately.

Those are two things where sample size and brain imaging have already started intersecting and people have started asking question about things, and it's really good. So where are some other areas where we might look at changes? One of the big benefits of these largescale datasets is that they provide us sufficient power to investigate higher-order interactions between variables, but also perform data fusion across modalities. So imaging genetics, which I think is going to be talked about at the very end of today, is one example of that. So you have a big data problem in genetics, and you have a big data problem in imaging, and then you combine them into a huge data problem, and that's very, very difficult. You need very big sample sizes to do this correctly.

Here's sort of a cartoon about how the statistical analysis in these multimodal worlds has worked. You're looking at one modality here, which is maybe genetics, and we have another one here that's imaging. There's different ways that we can do that. We can take a phenotype from the genetics and a phenotype from the imaging and just correlate them. That's a very simple analysis. You might not need a very big sample size for that. But more common has been that you've taken a phenotype, say, from the genomics, and looked for associations whole-brain-wide. So you have a univariate-to-multivariate sort of mapping. Or you can do it opposite -- you can take an imaging-based phenotype, and map it in a GWAS analysis.

So those are where a lot of data fusion has been going on, but increasingly, and I think importantly, people with these large datasets, people are able to do these multivariate-to-multivariate analysis, and I think that there's a huge space there for sophisticated statistical methods for doing that type of analysis.

Today, in sort of the imaging world, a lot of this has been done with, say, ICA independent component analysis, or some variants of CCA, canonical correlation analysis, but this is sort of a very interesting area of research, I think, for statisticians. Here's one area linking work by Smith et al, linking behavior and imaging, using CCA, but there's a lot of other type of things that people do.

That's an area where a lot of analysis will take place, moving forward. Another area is these datasets are big and they're diverse, and this will allow us to estimate flexible models and study covariation with demographic factors with sufficient precision. Is the same model good for young adults as for an aging population? Probably not. It's known that the hemodynamic response function changes as you age, so presumably you should tailor your analysis to that. So these largescale datasets allows for the possibility of maybe being able to do that.

Also, it allows researchers to perform analysis on subsets of data with a particular set of characteristics. And this is one of the cool things about, say, the UK Biobank, where you have 100,000 subjects with imaging. Let's say that you have a mental health disorder with a prevalence of 0.1 percent. Well, if you look at the UK Biobank, you can get 100 subjects with that, and then you can find matched controls and then suddenly you have imaging data and case control on 200 subjects, which is great, and opens up lots of new doors and interesting avenues. So I'm really excited about those types of things.

Also, these large datasets could be used to validate results from smaller studies. Do they hold up? The results from these small n studies hold up in these databases? They could also, probably not together with the first, be used as priors for the analysis of smaller sample studies. So we've used that quite a bit, that we've used the big data to get normative results that we use in smaller sample studies, and it's sort of like in meta-analysis and whatnot, but now we have these big databases that allow us to do this.

One thing that we've done a lot, and I see this as catching up is to test the reproducibility of new methods in smaller sample studies using big data. My colleague, Ciprian Crainiceanu, has dubbed this downstrapping, and basically the idea of downstrapping is that you treat, say, the human connectome data as your population, and then you take bootstrap samples of size, say, 20, and use that to sort of mimic what a small n dataset would look like and how your method would work in that setting.

Here's the work by postdoc Stephan Geuter showing this selection bias that I talked about earlier as a function of sample size in the HDP data. Again, the Marek paper that I also talked about. Here they're showing how resting state functional connectivity correlates with the cognitive ability as a function of sample size, and as you see, if you have this small n, you can hit the jackpot and get a correlation of 0.8, but as the sample size increases, you tend to get smaller, but again, more reproducible results. So I think this figure sort of captures that point as well.

Finally, I've been talking about these sort of big funded studies, but there's also several grassroots initiatives geared towards sharing data, collected under similar experimental paradigms, so that's been -- the 1000 Functional Connectomes is one example, where they shared resting state data. Another area which I think is an alternative approach is to share models across different groups using a sort of federated approach. Here the models are run locally and data derivatives are passed forward. And this is a nice way of sharing data, because you don't actually have to share the data, you just have to share the output of the data. I think the Enigma Consortium has used this to great benefit in imaging genetics, but you could do this for other things. An example I'll show on the next slide here. But both of these are important for reproducibility and generalizability of stability of results across different settings.

So here's an example. This is work with Wani Woo and Luke Chang and Tor Wager and myself about building better biomarkers, and so one of the things that we found, we did survey of the literature of people doing predictive modeling in neuroimaging, and we looked at like 500 studies, and we found that of those 500 studies, only 9 percent of them actually -- you know, they developed these really cool machine learning techniques, but only 9 percent of them applied it to independent datasets or did anything more with the model. So basically what people are doing is they published these results, and then they were sort of done with it, and they didn't test these things and try to see whether or not they were generalizable to a broader setting and whatnot.

So what we proposed is this biomarker development procedure where you could form a broad exploration followed by increasingly rigorous assessment of the method. So you start out by developing your predictive model, and then you would apply it to new samples, you'd apply it to new samples from different labs using different scanners, you would apply it to different populations, and whatnot. Here we found that this federated approach is quite useful because what we did is once you have these predictive models, it's just a matrix of numbers, and you can send these and some code to labs that are doing similar experiments as yourself, and then they can run it and they can give you some results back.

So we have this thing called the neurologic pain signature, where we try to predict physical pain from brain data, and we were able to send it to collaborators at test sites across the world, and that allowed us to learn a lot about sensitivity and specificity of the method for pain. It's sort of an easy thing to do, but it's not a free thing, it takes time and effort to do this. So whenever I present this, people say how do you fund that kind of work? I guess that's sort of difficult, but now that I have NIH on the line here, this might not be a good -- this is a good bang for your buck, if you have ten studies that you've financed with $2 million each and they're all smallish, it might be worth a couple of hundred thousand dollars to fund some study that sort of combines them into a nice way. So that's my little plug there.

In general, I think I'm running out of time here, so the points here, increasingly large sample sizes are becoming available, lots of interesting new statistical challenges and opportunities, so it's a great time to be a statistician in this new environment. Again, don't expect performing status quo statistical analysis will give equivalent results. I think that's true. For example, we'll expect more significant results, they may be smaller, but more reliable effects. And I think there's plenty of opportunity for new analytic approaches.

I think that is it. Thank you so much. **DR. GUO**: Thank you, Martin. A lot of great insights and suggestions.

Now, we have ten minutes left and also I think a lot of the Q&A questions have already been addressed in the Q&A box. So if you have questions, please keep sending them in and the speakers can type the answers to you.

Now we have Dr. Todd Ogden, to give the discussion of the session. So just an introduction of Todd. He's a professor and vice-chair in the department of biostatistics at Columbia University, and currently is an expert in method development involving functional data and high-dimensional data settings in precision medicine. He has had a longstanding interest in imaging data, particularly with brain imaging using PET and has contributed to that field in many ways.

So, Todd.**DR. OGDEN**: Thank you for three really fascinating talks. The session is all about reliability and reproducibility, and I think, Ying, you did a great job selecting these three speakers, because I think they really covered a lot of the range that goes into this. As they were speaking, I just jotted down a few notes of things that occurred to me that I thought might be worthy of a bit more discussion.

The first thing is, you know, we think a lot and we talk a lot about effect sizes, small effect sizes and large effect sizes, and so forth. And I think it's sort of natural to think when we see a small effect size, like some of the ones Martin showed near the end of his talk, if we see an effect size of r equals .11, I think it's natural to think well, that's junk, it's not meaningful. If we're finding things like that we're really wasting our time, we should be looking in other directions.

But like Martin and others have said, they might be smaller, they might be more reliable, but what does it even mean for an effect size to be small? And if we think about it in terms of an r, a lot of that we're trying to do is we've got a bunch of numbers that come out of the imaging scanner, and we're trying to correlate that to a bunch of numbers that come out of some measures of some psychiatric condition.

What would a big effect look like? How big would that be? Because in the denominators of both of these is some measure of variance, so we know -- what are the sources of variance? We've talked a lot about the sources of variance in imaging, and I'll run through those in a minute. But what about the sources of variance on the other side? The measurement of symptoms and the rating scales in psychiatry. These don't have, just because of their very nature, they don't have a whole ton of reliability on their own. There are all these rating scales that have been validated in various ways, and often used in ways they were never intended to be used. But they're indirect, they're reported by a patient describing his or her own symptoms, and then all of this is based on either DSM criteria or the RDoC construct.

But I think everyone would agree that all these psychiatric mental health measurements we make are subject to lots and lots of uncertainty, day-to-day fluctuation in terms of the subjective nature of measuring these things, and huge heterogeneity in the populations, even when we restrict things in terms of RDoC ideas.

So that's the mental health side of things, and then the other side of things is imaging. We've talked a lot already in this session about some of the sources of imaging, including calibrating the scanner, the scanner itself, the sequences that are used, and certainly the preprocessing, even just motion correction and so forth, all of these contribute a lot of noise, variance, to the imaging side of these measurements.

Then Tom talked a lot about the modeling that's used, the algorithm that's used to do the modeling, the software, and even the version of the software, that's being used to try to figure out these correlations. That's not to mention model hacking, if we're really searching for a high correlation.

And then beyond that, just the multiplicity adjustments that we need to make. There might be a very, very strong signal, but it's isolated to a small number of voxels or maybe it's a widespread pattern, but it's relatively small compared to all the noise and uncertainty. So in order to get some meaningful, reliable results, I think it's really important to think about and really to account for all these different sources of variation, and I think among our clinical counterparts it might seem very natural to just covary for which scanner was used, or covary for which motion correction algorithm was used, if we're trying to combine different studies. But I don't think that usually makes a lot of sense. It's a much more difficult problem than that.

Finally, I'll just add, it's quite difficult, I think, given all these constraints, the variability on the mental health side of things and the variability on the imaging side of things, to make even like population level claims. This population differs from that population on average based on whatever.

Ultimately, I think what we would like to do in psychiatry is individual-level statements, and Martin talked a little bit about this, as well. So all of these issues with the variants have a big effect on population-level inference, and I think it makes it even more challenging to do individual-level stuff. With the individual level, we can't just get an easy out by combining more studies, right? We really need to try to exclude a lot of that variance, the noise, so that we can really understand what's going on at an individual level.

So a lot of this seems a little discouraging, but what sort of keeps me going is even if these effects are relatively small, and I sort of clouded what I even mean by small effects, but even if an effect exists but is small, the incidence of mental health conditions is so widespread that if this eventually reaches clinical practice, it can have a tremendously huge impact, because so many people have the condition that we're studying.

So anyway, that's just a few things that occurred to me during these presentations, so Ying, I'll throw it back to you.**DR. GUO**: Thank you, Todd. Now I'd like to welcome our panelists in the session to open the cameras and mic again and just to respond to Todd's discussion, and also there are still some questions in the chat, in the Q&A session. So if you want to address that here, that would be great, too.**DR. NICHOLS**: I've already responded to a couple of questions. I think there's a really important distinction to be made between the big n studies and small n studies. I think we all want to be going there, but I think there's still a role in cognitive neuroscience for modest n studies. So I think a lot of what I talked about is hopefully not so important when you have thousands of subjects and you've extracted a hippocampal volume and you're going to find an association or not. I guess we could look at that, but I think you're going to see the greatest fragility of results when you have the smallest n.

So we could either decide that we should never do that stuff and minimum n should be 150, which the Marek paper that Martin references I think says maybe even more than that, or if we are going to be in the small n, that we do need to pay attention to this kind of stability fragility.**DR. YU**: I like to make a comment. I think it's really interesting educational for me to see other more detailed work on neuroimaging, but one thing more like a still outsider is that I think when we focus on a specific mono-focus problem, maybe there are a lot of the issues would become more clear than do the general inference, that I think there's just so many sorts of variabilities, as all of the speakers and discussants have pointed out, but maybe there's some sweet spot that for a particular problem there would be consensus, and the variabilities will be not as overwhelming as some other problems, and we can make progress.

The other thing, what Todd was saying, my thoughts is that it's true that individual level is very difficult. I'm very interested in like precision medicine, but of course it's expensive. But I think there's also a sweet spot in the sense of when we can afford, let's have a sequence of neuroimaging for particular individual, and again, on the bio issue, when we look for change, hopefully that will give something that easy for us to detect, right? I think if you look at biological beings, the aggregation cross-sectional aggregation, is problematic. I believe that we'll have an online dynamic system running and we have different kind of equilibrium, and if we compare our paths, we can see, oh, somebody is not doing well, but if you put in relative to other individuals, that signal get lost.

But there's a cost problem, I understand that. So more individual, more like time-series, dynamic analysis, might also be a sweet spot that things can be, progress can be made.**DR. GUO**: Thank you, Bin, for the wonderful input. Unfortunately we have to wrap it up now because our 80 minutes is up. But I think this is a great start of an important discussion among our statisticians working in this field, and I think a lot of the things we talked about today, very briefly, I think we just touched the bases briefly, but I feel like we're going towards the same direction with a lot of similar approaches. So if we as a statistical community can combine our brain, our minds together, I feel like we can really push out some new framework and new solutions to help address this rigor and reproducibility issue.

So thank you again for all the panelists, the speakers, and Todd for the great discussion, and just a reminder, we do have an end of the day session at the end of I think today. So we're going to come back and summarize for the whole day, all the workshops. And thank you again for all the speakers and Todd and also thank you all for attending the workshop.

Abera, back to you.**DR. WOUHIB**: Thank you, Ying. It is another great session with really great speakers and it is very important subject areas to our institute.

Just some housekeeping, some attendees are expressing interest in getting presentation slides. I don't think we have slides from all the presenters, and it will be more appropriate if provided by the presenter himself or herself, and it would be easy if you contact the presenter whom you are interested in for his or her slides. Please, go to the website and try to figure out just -- there would be name and affiliated institute, and it would be much easier to get their email address. Unfortunately we don't have the email address to provide to presenters.

That said, we will convene at the top of the hour, which is 1 p.m. Eastern time, and thank you again. I really appreciate for your being here.**DR. GUO**: Before we go, just one final thing for some of the speakers in the sessions, there's still some questions in Q&A. For example, there's a question for Martin and Todd. If you can, if the speakers in the session can look through those questions and provide some answers in the Q&A, that would be great. Thank you.**DR. WOUHIB**: Thanks.

(Luncheon Break)

**AFTERNOON SESSION****Agenda Item: SESSION III: Statistical Testing and Power Analysis for High-Dimensional Neuroimaging Mental Health Data****DR. WOUHIB**: Hello, everybody. This is the third session, and I would like to introduce the moderator for the third session, Dr. Dulal Bhaumik, professor of biostatistics at the University of Illinois, Chicago. The topic of this session is statistical testing and power analysis for high-dimensional neuroimaging mental health data.

The session has wonderful speakers and two discussants. Also, Dr. Bhaumik is one of the presenters in the session, with the title Power Analysis for High-dimensional Neuroimaging Studies, and it's a very hot topic area for mental health studies.

Dr. Bhaumik, please, take it away. Thanks.**DR. BHAUMIK**: Thank you, Abera. Thank you very much for two reasons. One is for organizing this huge conference and in the last almost 18 months felt that we are not part of the world, which again came back through the participation and organization of this wonderful workshop. And thank you for the introduction, also.

Yes, I'll be very happy to moderate this third session. We had extremely powerful, interesting, important two sessions in the morning. The third session, what Abera mentioned, that statistical testing power analysis for high dimensional neuroimaging mental health data. So the three talks are, all the three talks, will address hypothesis testing, basically multiple comparison, for high dimensional data. There are some common things in three talks that the illustration of all these three talks will be via resting state functional neuroimaging data, and while we'll see that while the talks will be going on, we'll see that in a different way, we tried to address the correlation of the heterogeneity.

Sometimes it is spatial correlations, sometimes it is temporal correlation, and so on. So hypothesis testing should not ignore those different types of correlation. Otherwise, the results will be questionable and most likely be wrong.

The second question is that we cannot ignore the hypothesis testing or the type I error rate, specifically for big data or multiple comparison will be addressing the false discovery date. We cannot ignore that while determining the sample size for power analysis. The truth is that the traditional concept of power, the statistical power, reject the null and reject the alternative, and it is true, that is the type II two error.

That kind of thing is not simply defined for multiple comparison. So it should look for alternatives. There are several alternatives we'll be looking for those things. So combine those, and different types of correlation and different types of model of course come into picture, and many previous pictures mentioned that it will be the heterogeneity should come in the picture, the effect size should come into picture, but what is meant by that effect size when there are so many alternatives, when there are so many nulls and so on?

So we'd like to see how those factors are playing a role in hypothesis testing as well as in power analysis. I am the first speaker, and let me see my talk first. Okay.

The power analysis for high dimensional neuroimaging studies. The research question that we are asking for traditional power analysis, we do first control the type I error rate, then work on type II error rate. Basically 1 minus the probability of that is the power. And in gradient is most likely effect size, there are many other things can be addressed for the effect size, especially in the modeling system. That we'll see.

And for power analysis, another thing, how largescale multiple testing, should it be the type I error rate alpha? Or the false discovery rate? Then for the power, false negative rate or the type II error? Now, homogeneous effect sizes or heterogeneous effect sizes, so there are many, many alternatives. Now all the alternatives have the same effect size that is basically impossible. So we should shift from homogeneous effect size to the heterogeneous effect size, every hypothesis under the alternative can have a different effect size, and so on. But the general question is what are those factors if we really want to address the power analysis or the false discovery data?

The outline. Introduction. We are bringing linear mixed effects model, largescale multiple statistical inference, power sample size, summary and conclusion.

Now for neuroimaging studies, especially fMRI, we all are very much familiar with that, and most likely I'm not spending my time over that. But for the illustration purpose of my talk, I'll be talking about two groups, one is the late life depression and the other one is the healthy control group. I have ten subjects in LLD and 13 subjects in the control group. The fMRI measure that we'll talking about, the whole brain basically, the broad brain mapping, which has 37 brain regions and we are talking about the connectivity. So connectivity is the link from one region to the other region by the Pearson correlation coefficient, and we are talking about a total of 3,741 links or the connectivity, and that brings the notion of multiple comparison. We are comparing each link of the LLD with the first (indiscernible) link of the healthy control to figure out what disruptions are there.

We can put all these disruptions into several components, and that will be called the different types of networks, like default mode network, then salience network, and so on. But the first question is that for this kind of comparison, what should be the sample size per power analysis and even before that, how can we control the false discovery data. So these are the different brain regions, the 87 left and right. So one is the central and the other ones are like that.

Now as I mentioned, we put that -- while analyzing this kind of data or testing hypothesis, we should not ignore the correlation. As these voxels are nested within the same brain or the regions are nested within the same brain, so some kind of spatial type of correlation is there, and we are taking the measurements over time. So temporal measurements should be there. So spatiotemporal is a very natural kind of correlation that we should bring while analyzing this type of data.

The model that we have used or I'm presenting here to compare the two groups, I am assuming that the heterogeneity, spatial heterogeneity, I'm going to try to incorporate or address that using a random effect that has a variance and also from one link to the other link, the error variances are not the same. They are changing and that is a huge kind of heterogeneity we are assuming from every region, those errors are different.

On the top of that, we are also assuming that the error variances for one group is different from the other group at every link. So that means that we are talking about a large set of parameters. First of all, the beta 1 for the control group is there will be 3,700 for that, then the study group, another 3,700, then the variance, 3,700, 3,700, and so on and so on. So we are introducing a large number of parameters in the study to incorporate basically the heterogeneity not homogeneous all over the brain, but it is changing from one region to the other region, and that is a very, very kind of big parametric consideration instead of simplifying, just oversimplifying the problem.

And to compare the two groups, we are testing the intercept parameter of one group compared to the other group or the difference of those intercept parameters. I should mention that this is not a longitudinal study. This is a kind of cross-sectional study, just only one point, so there is no trend effect of the slope parameter involved. But the model can be generalized, just like that, and if longitudinal data are available, and we have done that, but that is not what I'm going to present here.

So incorporating those type of variances, we are first defining a test statistic. We can see that under this Z and Z-prime, those all types of variances will be there, and testing the hypothesis, this is the testing of hypothesis, beta-naught minus beta-1, and there are so many parameters, so many hypothesis testing, 3,741, the null hypothesis, the alternative obviously are not relatively zero, and we'd like to find out why those are different.

Now, anybody who's familiar with that, when we are talking about the large sample this comparison or high dimensional comparison, basically we are depending on what is called the false discovery, by discovery we want to mean significant, and nondiscovery means not significant, and declared nonsignificant and declared significant. So this V is extremely important, declared significant. That means we are telling that it is a discovery, whereas it is not. This is from the null. The total number of declared discoveries or the significants are R. So P is a bad thing to do. But it happens.

T on the other hand, those are all true alternatives, but we said that, no, those are not significant. So that is another mistake what this T.

Now playing the role of V and R, we define basically the false discovery, and T and M minus R, we define the false negative, FN, and then rate and so on.

So the first question is if we do not address the multiplicity issue that with the large comparisons, then what will happen? The answer is very simple. Our type I error rate may be as close to 1. It depends on the number of hypotheses we are testing, and that is a very, very nonresult. So we can that in this graph that when the number of hypotheses is only 100, the general type I error rate is reaching close to 1, and that is a very nonresult.

And we can see numerical figure also under the assumption that the tests are independent, and the type I error rate is .994, which is close to 1. So we should address the multiplicity issue.

Now, when we are talking about the false discovery rate, basically it was started in 1995, the first generation false discovery rate, by Benjamini and Hochberg, and I'm fairly sure all of you know this, then the adapted Benjamini Hochberg and by Efron gave the flavor of Bayesian kind of false discovery rate, most of the times it is called local false discovery rate.

In order to understand the basic difference between the Benjamini Hochberg type of approach and Efron's approach, Efron basically used more information from the data to develop his local false discovery rate. More information means he's using that what is the distribution of the null, null p-values. What is the distribution of all the p-values? How many -- what is the proportion of the null hypothesis? So all types of ingredients is bringing up before giving it the flavor of the Bayesian approach or the Bayesian flavor to control the false discovery rate.

So we have in our data, when we are using different types of the FDR approach, we got different obviously expected that significant connectivity’s, sometimes this is underconnectivity by negative value, sometimes overconnectivity, and so on. So the empirical based approach, kind of Bayesian type of thing by Efron, and the empirical null also by using an Oracle approach by Sun and Cai. These are in 2002 to 2010 or 11, 12, so in that decade. In those years, they developed this kind of thing.

So now that we have done the simulation study, the total number of subjects in the study are 23, and we are talking about now controlling the false discovery rate. So what is the influence of the false discovery rate on sample sizes if we increase, can we get a smaller false discovery rate? That was the question.

And the other question was that there are so many different approaches so how these methods are working in terms of controlling the false discovery rate. All of you have worked on Benjamini Hochberg approach, you know that there is a Q value, which is a kind of type of thing I'll talk about, but not really, but Q value, it has become more or less kind of subjective type of thing. What should be the Q value? There is no specific answer to that, unlike the alpha should be .05 all over the world for at least medical data.

But here, the Q value can vary, and in fact, it should vary. I have a publication, I showed that how the Q value should be chosen based on the number of null hypotheses, based on the concept of alpha and based on the power of .80, how from that data extraction, what should be an appropriate value of Q? Anyway, so when we chose different values of Q, but this Q is not that much subjective. It can be very much well-defined for Q values. We see that as sample size is increasing, our false discovery rate is basically decreasing, and when we are comparing across the different methods, adaptive, theoretical null, empirical null, and so on, now we see that for a 100 sample size N and for say Q value .05, then Efron is doing very, very good, the false discovery rate by Efron is .051. Whereas Benjamini, adaptive Benjamini, we put that Benjamini is that rate is .122, whereas it should have been .05, and the empirical null, that means under the normal distribution, it is .086.

So there is a reason and we have done many, many simulation studies for other types of (indiscernible) data also, and all that time, we have got better results by Efron and I believe that because of incorporating so much information from the data, while computing the local false discovery rate, probably that's the reason for better results for Efron's Bayesian type of approach. So we have done this kind of thing.

Then we have identified what are the significant regions compared to the control group and 56, what is that, right caudal middle frontal. That is the kind of hub of disruptions for the data. But it is just the data analysis.

Now the power and sample size. So the power and sample size on the other hand, it takes talking about the error rate, the type II error rate basically the nondiscovery rate, and that is denoted by (indiscernible) T over m1, or so on. Okay.

Now, let's go through this results and see how the power is -- how different things, different players, are playing the role in computational power. This is the first picture. The nondiscovery rate and the false discovery rate. So that the m1 is the positive discovery rate as it is increasing. That means the type I error rate is increasing, then we can see that the nondiscovery rate, not the power, but 1 minus power, which is the nondiscovery rate, it is also decreasing; as it is increasing, the other one is decreasing. That means as the false discovery rate will increase, the power will be increased. Yes, the power will be also increasing. Just the traditional thing, what we have learned, it is following that. But the y-axis here is the 1 minus power of that nondiscovery rate.

Okay. So the marginal false discovery rate, as you can see, basically this is a kind of approach by Efron, is taking care of the proportion of null, proportion of alternative and so on. Now this is another interesting picture. So all throughout our power analysis, we are talking about what is the effect size, what is the heterogeneity, but we never talked about in multiple comparison, what is the role of the null proportion?

That is very important. Now, heuristically, it is very easy to understand. If we have more and more null, then we'll take the whole night to find out an alternative or significant. That means we need more sample to get the power, because most of them are null hypothesis.

For mental health studies, late life depression, our proportion of null is very, very high, like .98. So we see that the null is playing a big role for the power analysis, and Q also, what value of Q we should set up for the power analysis.

So all these things are important. Now this one is saying that the effect size in a way, delta and FDR, both of them are playing a role for power analysis. This is the first picture, first graph of our analysis. Sample size and power, and we have to -- if we have a delta, we should do all types of basic analysis to figure out what is the proportion, what is the null, distribution of the p-values, what is the alternative distribution of the p-values, what kind of mixed distribution we should incorporate, what are the variances, bring all those things for computation of the power.

And then, if we use one of the formula like the negative discovery rate 1 minus of that power and bring those, this delta is basically playing the role of the effect size, and FDR is also like that. Then we can see that sample size will, as increasing, three different things: first of all, more effect size will require less sample size, now more FDR will be giving us the more power, but that may be often misleading, so it should be set to a reasonable value for the study after exploring that data.

And this also saying that the role of P-naught, the proportion of null, if the proportion of null is very, very high, then it will require larger sample size to achieve power, and the blue one is the proportion of null is very high. Proportion of null is very high means what? There are only few alternatives, few discoveries, and you need a big data, big sample size, to discover that. So that is basically ignored in the literature more or less, while doing the computation.

So what is the summary and the conclusion of that? The conclusion of that, we should be very careful about false discovery rate. Efron's procedure at least in my many simulations looks better to control by FDR or local FDR. Non-discovery rate I used as the basic for power, and the effect sizes, instead of a fixed effect sizes, I assumed it has a distribution, so while doing the simulation, we had basically randomly choosing an effect size from its lower and upper percentile from the inside thing.

And sample sizes decrease as effect sizes increase, as you all know that, and the new thing here is the null proportion increases, the sample size also increases.

Thank you very much for listening to my talk.

Okay, so let me go to the next speaker. Our next speaker is Rajesh Nandy. He is from University of North Texas. He'll be also touching the multiple comparison in the context of fMRI data, but his modeling system is semiparametric approach, and let's listen to him. Thank you.

Rajesh?**DR. NANDY**: Thank you for inviting me to present. It's a pleasure to be part of this. So, today what we are talking about, the problem is kind of from the early days of fMRI data analysis. However, the solution that I'm presenting is a bit unconventional, and we will go over that. So in a very classical fMRI data analysis framework, in the early days, typically a subject would be presented a task paradigm inside the scanner, and so what we do as statisticians, we look at the expected brain response to the presented stimuli and then we try to identify which voxels are providing signals which correspond to the expected response, and we essentially are looking for or testing for a possible fit with the modeled basis functions or the regression functions.

However, even though this approach is pretty simple, we all know that there are three issues with making the classical inference challenging in functional MRI inference-based approach, and also I should add that that's what also makes the area so fascinating, because most of the things that we have learned in conventional statistics courses do not directly apply here, and that's what makes it a fertile area of research and new frontiers in terms of how to do statistical inference.

So the three key things that we face here are the first being a strong temporal autocorrelation in the fMRI data, which is inherent. So that means in regression, we kind of always have shown that the errors are uncorrelated so that has to be adjusted. Then obviously, there is this issue of multiple testing, because we have these hundreds of thousands of voxels potentially. So these two are well-known. However, the third issue which actually is pretty strong, but rarely addressed in the context of fMRI data analysis, is the inherent low frequency processes in the human brain.

These processes are there irrespective of whether we are performing any active function or not, and these follow certain low frequency patterns. So what that means is that when we are doing an inference on activation data, we need to take into account what's going on with these low frequency processes.

Now, sometimes people do use some correction using some kind of filters to kind of get rid of the low frequency processes, but still, when we do that, we lose some information and also that itself is problematic when we have, like, periodic tasks, which we will discuss in the next slide.

Actually, not in next slide, in one of the subsequent slides. So the approach that I will present in this particular talk would actually address all these three problems in one shot and which I consider to be the apple of this approach. Now, with multiple comparisons problems, the simplest solution at least in the context of classical fMRI inference is to do a Bonferroni correction. However, because of the sheer number of voxels involved, that can be really conservative, because if we simply divide the alpha by number of voxels, that is too conservative.

So a better approach or a more accepted approach is to estimate the family-wise error rate using the theory of random fields, and with that approach, we also need some kind of presmoothing. Now what I have observed is that most of the time, interestingly, the random field estimates do not offer any improved result compared to Bonferroni even though it's a more sophisticated approach.

Okay, now let's talk about the third issue that I started with. These are the low frequency processes in human brain, and what is well-known is that even in resting state, there are many processes which has frequency range less than 0.1 hertz, and not only that, these processes are often synchronous, which actually can be a good thing if we are doing functional connectivity studies and Dr. Bhaumik presented his work in the previous presentation, where that was key. However, that could be a problem in a more conventional inference risk analysis where you have a specific task that the subject is performing.

This problem is most severe when we have these block-design paradigms where the subject is performing tasks in a periodic on and off fashion. Now we know that the power of this low frequency processes are characterized by 1 over f, so that means lower the frequency, stronger its effect. Now, the usual parametric estimation procedures used in popular software like SPM or FSL, they don't really take into account the effect of these low frequency processes, and also what we will see that the phase of these processes also play an important role.

Now as I said earlier, sometimes we can do some kind of filtering, but that would also filter out activation if we are doing these on/off type of designs. Now with even primitive designs, that solution works reasonably well if you get rid of the low frequency processes, but that process that I am presenting will work in all types of design.

Here I am giving just example of the role that the low frequency processes plays, and also its phase. So what we have done here is in this particular figure, so the fMRI data is still resting state data. However, I am fitting like a model with an on/off paradigm, and you can see it on the top right. So that is the design matrix, and of course anything detected to be active here would be a false positive, and what we see here is that we have about 11 clusters and a bunch of positives, and these are obviously all false positives, because there is no active task involved.

Now, in the next slide, it's the same thing, but it's for a different subject. So again, it's still resting state data, and we are fitting the same model, but as you can see for this particular subject, what we have is we have a lot more active voxels being detected, and that's probably due to the fact that because of these low frequency processes, if the phase is matched then of course we will have more false positives, whereas if there is a mismatch in the phase, then there will be fewer false positives, and there is no way for us to know whether there is a phase match or a phase mismatch.

So this kind of illustrates the problem of kind of calculating parametric (indiscernible) distribution as is conventionally done, because that might mean that we will have a lot of false positives simply because of these low frequency effects, and there should be some correction for that.

So what we are proposing here is that instead of parametrically calculating the null distribution of course, without being able to calculate the null distribution, there is no inference, so that's the reason people need to make the assumptions, all the assumptions for parametric inference in terms of the distribution. So here what we are doing is that we want to use resting state data as null.

Now of course I already said, and it's well known, that resting state brain have some inherent brain activities, and they don't disappear in the activation state. So that's the key thing. So if we are calculating or trying to control the type I error rate, then we also have to control for that, and so that's why I am a proponent of using resting-state data as null whenever it's possible.

So how do we then correct for multiple testing using null data or resting state data as null data? For that, what I will do is I will also calculate the family-wise error rate, and there are many techniques for that. Each has its strengths and weaknesses, which are listed here. So there are some methods that are purely nonparametric, and generally those methods are very intensive computationally and can take a really long time, even with modern-day computers, to come up with this error rate calculation. So I am proposing a method which is a hybrid approach, and that is a semiparametric approach. Which actually solves a lot of these problems and also not intensive computational.

So my method is also a bootstrap or resampling type techniques. However, instead of resampling raw data, I would be resampling order statistics. So obviously, if we do like a simple resampling of all the observed values of the statistic and all the voxels, then there would be a large point mass on the maximum value. So what we will do is we will define something called normalized spacings, shown in the next slide, which actually has a behavior of i.i.d. random variables, and we will be resampling the normalized spacings.

So how do we define normalized spacings? What we do is we first calculate the relevant test statistic, which can be a simple t statistic and can be any statistic, B, F, doesn't really matter. So we calculate the statistic using parametric methods at all voxels, and let's label them X1 through XN, where N is the number of voxels considered in the brain, and then we order them in descending order, where X1 is like the largest and we order them in descending order, and then the normalized spacings are defined as d-I, equal to i where i is the index for the order statistic, and then we multiply by the difference of the ith order statistic and the next order statistic. It can be theoretically shown that if the observations are i.i.d. exponential, so are the normalized spacings.

However, the actual statistics are unlikely to be exponentially distributed. In fact, most of the times they won't be. However, if we know the parametric distribution, we can make the observations approximately exponential by taking a suitable transformation. For example, taking negative log transformation of the p-values.

And also there are some asymptotic results where we said that this distribution of the normalized spacings are somewhat robust to the valuation of the exponential distribution assumption as long as they belong to type 1 Gumbel extreme value distribution.

Now, are the normalized spacings i.i.d., because that's a key assumption for exchangeability if you are going to implement bootstrap. So what we have seen is that of course strictly speaking they're not i.i.d. However, the violation is minimum and also even if it's violated, it can be shown that the estimate is more conservative, which means that we would still be able to control the type I error rate in the appropriate level.

So here is an example, I will just give the details on the imaging.

Here what we are doing is we are comparing the threshold estimates using the semiparametric method we proposed and the usual random field theory, which is implemented in SPM and FSL and other popular software.

So what we did here, for the sake of simplicity and lack of time, we actually used Fourier basis functions which are phase invariant. So that way we wouldn't have to worry about phase mismatch, because the phase will automatically be accounted for the Fourier basis functions. So what we found in this case, we had resting state data and from the same subjects we had also an activation data, which is a periodic phoneme matching task with four periods, and each period is 36 second on/off. That's how the periods were constructed.

Now if we use random field threshold and here it would be an F statistic, because we have multiple predictors, so the corrected threshold would be 6.67 for unsmoothed data and 6.69 for smoothed data, and there were 20 false positive for unsmoothed and 28 false positives for the smoothed data. The Bonferroni corrected threshold is 6.674, which is essentially same as random field threshold. However, using our order statistics thresholds they are much higher, which are 12.47 and 12.55, which essentially controls the false positives much better.

So here is essentially from SPM2 what we did using unsmoothed resting state data, and you can see all the false positives detected using the Fourier basis functions depicted in the design matrix on the top right.

So this is the actual activation map, and this is using random field approach and you can see these large blobs, and there are also some obvious examples of false detection.

This is what we did when we used our proposed method using order statistic, and the map is much cleaner, as you can see here.

I think that is the last slide. So thank you for the opportunity.**DR. BHAUMIK**: Thank you, Rajesh. I think before going for any kind of discussion, let's have the talk from the third speaker, who is Deepak N. Ayyala, University of Augusta at Georgia, and who will be talking about adjusting for confounders in cross-correlation analysis of resting-state networks.

Deepak, are you ready?**DR. AYYALA**: Thank you. I'd like to thank Dr. Bhaumik and Dr. Wouhib for the wonderful workshop and opportunity to present my research.

So, today I'll be talking about confounders in cross-correlation analysis of resting state networks. The main idea of this study was to do a test-retest reliability of resting-state networks, and this was joint work done with my PhD mentors, Dr. Roy, Dr. Park, and Dr. Gullapalli from University of Maryland School of Medicine provided us with the data.

So as you all know, functional connectivity MRI identifies regions of the brain which demonstrate fluctuations in the blood oxygen level, which can be used to study the functional connectivity between different regions. Spontaneous connectivity patterns when the brain is not being subject to any task is called resting state network, which can be used to study the baseline connectivity between different regions in the brain. This has enormous clinical potential in studying diseases such as traumatic brain injury, Alzheimer's disease, and many other neurological diseases. So any deviations from the pattern of connectivity from normal subjects can be seen as a potential biomarker of the particular disease.

So the reason why resting state networks are useful is because of their high degree of reliability, both at the individual level and also at the group level. So many studies have studied the reliability of these resting state networks in memory networks, motor networks, and also in comparison between healthy elderly subjects and subjects with amnestic mild cognitive impairment.

In studying resting state networks, there are two main ways of identifying the network itself. The first method is to use seed-based correlation analysis, which calculates the correlation of the time series of any given voxel with respect to all the remaining voxels to construct a connective network between voxels which are simultaneously activated, and the other method is to use independent component analysis, which identifies regions of voxels in the brain which are activated using the score separation method, and you're all familiar with both methods. I'll note that these methods are also very reliable and they're consistent for identifying the resting state networks.

But using a voxel-level map is very computationally challenging, because when we are talking about connectivity between the beta voxels, the order of magnitude of computations multiplies squared. So if you have 100 voxels, we are looking at 10,000 connections, which is the reason why region-level connectivity maps are much more easier to study. So at the region-level analysis, the cross-correlations are not calculated within individual voxels. But a group of voxels are combined together and carved out as regions of interest, and the signal within the region of interest is used to study the cross-correlation between different regions.

So when reducing the data from the voxel-level which is via dimension to the regions of impact -- to the region-level data, the popular dimension reduction techniques such as principal component analysis can be used to compress all the information in these voxels into a single value, which represents the signal of the region of interest. Consistent strong correlations between these different region-level signals is an indication that the resting-state network between these regions of interest is reproducible.

So this differs from the two talks that we had in the session so far, because we are not trying to identify regions which are activated. For example, as we see in these scans from different visits and different sessions of a given individual, given slides, we are not particularly interested in identifying the regions which are activated, for example these regions, but what we are trying to study is the reliability of the connectivity between two activated regions or any two given regions of the brain. So we are looking at the connectivity map rather than the activation map itself.

So while these methods are known to be reproducible and are also reliable in studying the resting state network, there have been studies which looked at the effect of any confounders that may be introduced into the experimental design while studying these resting state networks. So common study design confounders are visits. So when an individual comes for different visits on different days, or if an individual -- different scans are taken on the same individual, for a given visit, and we can also have other confounders like gender and other things, but we are not getting into subject-specific confounders. We're only trying to investigate study design based or the experimental design-based confounders, and the other aspect of studying the reliability of these models is that the temporal dependence in these region-level time series may inflate the standard error and can lead to incorrect results.

So what we looked at was to build a comprehensive method to test for the reliability and reproducibility of resting state networks by taking into account the confounders and looking at this region-level correlation in the resting state neuroimaging study. The key challenge here is that there is no existing method -- no distribution for the cross-correlations between the different regions. So what we basically did is calculated the covariant structure of these cross-correlations and derived the asymptotic distribution of the cross-correlation so that we can build a multivariate analysis of variance model, standard linear model to test for the reliability by testing for the effects of the different confounders.

So the motivating data for this work came from a study which was conducted at the University of Maryland School of Medicine in 2011. So seven subjects, all right-handed, were enrolled in the study, which involved three male and four female, and scans were recorded on three different visits. So each subject was recorded on three different visits with a seven-day gap between the first and second visit and a 14-day gap between the second and third visit, and during each visit, three scans were taken, where each scan consisted of images from 171 timepoints with 2 second spacing between images.

The regions of interest were drawn based on anatomical landmarks and functional coherence to the motor network. So the five regions of interest were the supplementary motor area and the left and right primary motor areas, and the left and right dorsal premotor areas.

So when we have a collection of N regions of interest in the resting state network, taking all (indiscernible) connectivities, we would have a total of N choose 2 different connections, but not all of them may be physiologically important or meaningful. So here we only considered a network with C connections there in this particular study, we took six connections, which connected all the primary motor and all the premotor areas to the supplementary motor network, which in a way works like the center, and connection between the left and right premotor areas and connection between the left and right premotor areas. So those are the six connections that we are interested in. But this can be done using all the N choose 2 connections as well.

So to measure the connectivity between different regions of interest, we used the time series of the signals recorded over all the timepoints within a scan, and say for any given specific visit v, and a specific scan s, during the vth visit, we denote by Xqt(v,s) the time series of length T timepoints for the qth region of interest, where v and s denote the visits and the scans of the data.

The cross-correlations at lag zero, since it's a time series, we are using the cross-correlation at lag zero, which is a common measure, for studying the strength of connectivity between two regions of interest. So any typical element of the correlation matrix can be easily calibrated using the cross-correlation at lag zero. This will result in an N by N diagonal matrix where the diagonal elements are 1 and it is symmetric. So we have N choose 2 unique elements. As we said, the N choose 2 unique elements correspond to all the connections, but since we are interested in only C of them, we are going to represent the data as a vector of length C elements where it measures the correlation at lag zero between the kth and the lth region of interest as long as this connection k,l, is in one of the six C connections that we are interested in.

And to normalize the data and avoid any boundary conditions, we used a Fisher z-transformation to avoid boundary conditions, and for the subject, for any subject, we can concatenate the data across all the visits and scans to give us single vector of C times S times V, which is the final data.

So to the main result or the main thing we need now to build a model and test for the effect of the confounders is a distribution assumption on this vector of cross-correlations. So there have been results which studied the autocorrelation or the distribution of the correlations for independent observations, and also for dependent observations, there are some results for the autocovariances at lag zero and some for autocorrelations. But there is no existing result for the distribution of the autocorrelations at lag zero when the groups are dependent and we have multiple groups the way we have in our setup as multiple visits and scans.

So that brings us to our main contribution, which is to derive the distribution of the asymptotic distribution of the Fisher z-transformed autocorrelations at lag zero for any given visit and scan, and the result is that it's asymptotically normal as the number of timepoints goes to infinity. This asymptotic normality is based on standard assumptions which are commonly used in fMRI analysis, which are second order stationarity for the time series of all the voxel-level signals, and having a spectral density which is square integrable.

This result holds for any finite collection of covariances, and the correlation of the covariance between any two of these normalized values is given by this formula, which not only involves just a correlation at lag zero, but through this function delta, it involves the correlations or the entire autocorrelation function for all the lags, from negative infinity to infinity, and in practice we have a finite sample, so therefore our inference has to be based on finite sample estimate.

So we use window-based estimates to replace the infinite summation here, and all different choices of windows, like Bartlett windows or the Tukey-Hanning or Parzen and quadratic spectral smoothing kernel windows, all resulted in similar results. So there is no specific choice which works better than other choice for the windowing method that we used. And this now allows us to use only the autocorrelation function, which can be estimated from timepoints t minus 1 to the negative t minus 1.

So coming to the analysis to model the effect of visit and scan, which are the main confounders in our particular data that we are analyzing, we assumed that the observations across subjects are independent and identically distributed, because the observation, the individuals are independent, and the cross-correlations are normally distributed by the previous theorem, where the covariance matrix sigma is as presented previously. And the resting state for any two scans during the same visit or the different visits can be correlated. So we are assuming that the scan for the vth visit and sth scan, the v prime visit and the s prime scan, are not independent. And we also test for this to see whether or not the assumption makes a difference results.

So these fit a simple linear hypothesis to study the main effects of connection-specific visit and scans by writing the mean as intercept plus the visit-specific effect plus the scan-specific effect, and we also had an interaction to see if there is a significant interaction between the visits and scans.

So using this model, if we look at the distribution of the concatenated signal vector of length c times s times v as I presented previously, we can write it as Z times theta where theta is the joined parameter vector involving all the main effects and the interaction effects, plus epsilon-i which is a normally-distributed random letter with a covariance matrix obtained from the mean result that we have.

So the parameter vector theta involves the intercept, the visit-specific, scan-specific effects, and the interactions, and this can be estimated easily using the feasible generalized least squares estimate. So the main advantage of having a model written in this form is that this allows for missing data in the sense that if a subject misses a visit or a scan, then the subject's data can still be used to estimate the other parameters while adjusting for this missingness.

So the feasible generalized least square estimate can be easily calculated standard result, and the covariance matrix sigma hat estimate here is from the theorem with the true correlations rho is replaced by the sample correlation, autocorrelation function r, and to test for the significance of the visit, scan, and interaction effect, we use a Wald-type chi squared test statistic using this theta hat as the sample estimate.

So the different hypotheses can be written as in linear forms, and the tests will have specified degrees of freedom, which are just based on the number of connections and the number of parameters based on the number of visits, scans, or interactions. The Wald-type test statistics can be easily constructed using the quadratic form of the estimate for each effect.

So in addition to looking at the three effects, visit, scan, and interaction, the resting state data that we analyzed was also processed using three different methods, which is one of the things that Dr. Nandy was talking about in the previous presentation depends on how we filter the data. So the three methods used for filtering were physiological filtering, and filtering to remove the average time series trend of the white matter and the cerebrospinal fluid, and the overall global filtering, which removes the average time series of all the brain voxels, and we did the analysis using two assumptions, assuming that the visits and scans are independent and the visits and scans are not independent.

So the overall results indicate that there is a very significant visit effect across the three visits, and even the scan effect is very significant across the three scans adjusted for the visit, and when we look at the interactions, while all the interactions are significant, the only type of filtering which was able to get rid of the interactions was the physiological filtering. The results based on the second assumption, which is independence of visits and scans, also resulted in a similar conclusion with the physiological filtering removing the interactions, although it's not very strong evidence.

With this in mind, we went ahead with the physiological filtered data, and since it did not show any evidence of significant interaction, we tried to analyze the data to study for significant scan and visit effect for each individual pair of regions of interest for each connection. So for each of the connections, when we looked at the overall visit and scan effect, there was a significant scan effect for all the six connections that we took, but when we looked at the visit effect, it's interesting to note that all but one is not significantly varying across the different visits. There was only this connection between dorsal premotor areas, which was significantly varying across the visits.

The next analysis we further did was to look at the study of the different scans to see if there was a declining effect of the scan or decaying effect of the visit, and what this result we have is that the scan effect is still very significant, strongly significant, even for comparing between the first and second scans or first and third scans, but when we look at the visit effect, since the first and second visits are only seven days, and the first and third visits had 21 days in total as a difference, we can see that the visit effect becomes weaker comparing the p-values for most of the connections, but there is no continuous trend here. That is why we cannot say for sure that this trend is valid.

But to conclude, what we have observed is that the method that we developed provides a robust framework to test for reproducibility and reliability of resting state networks, allowing for missing visits and scans, and the physiologically filtered data was able to remove the interaction effect and in the motor network data that we analyzed, there is a significant scan effect which induces the residual effect on the -- I think in the interest of time, I'll go to the last one.

So the overall effect of the visit was actually driven by only a single connection as opposed to the scan, which was very significant overall, but I want to conclude with one restriction of this data that we are currently working on, is the number of replicates was very small, so we are currently looking other databases to see if we can apply our same method test for reproducibility.

The last point is with respect to the sample size and the number of connections that we can analyze using this method. This can only accommodate a very small number of connections, because the asymptotics are not with respect to the number of subjects, but it is with respect to the number of timepoints. So if we want to have a network with a large number of regions of interest, say, 10 regions of interest, which will have 50 connections, it will require the subjects to be in the machine for a longer amount of time to take full scans. So this is not a real feasibility. That's why we are working on studying the asymptotic properties of the estimator which can accommodate larger networks.**DR. BHAUMIK**: Deepak, thank you very much. Now we are going to the second phase, which is the discussion. There are two discussants. We are going to the discussion part. There are two discussants, one, the first one is Olu Ajilore from University of Illinois at Chicago. Dr. Ajilore is a professor of psychiatry, and he is not a statistician. He is a psychiatrist. And from the same university, he is also director of many centers, like Mood and Anxiety Disorder, Clinical Research Core Center, and associate director of residency training and education and associate director of medical scientist training program.

And the next discussant will be Nicole Lazar, introduced a little later, but this is time for Dr. Ajilore Olu. Are you ready, please?**DR. AJILORE**: Thank you, Dulal. I don't have slides. So I'll just be talking.

As Dulal mentioned, unlike my esteemed colleagues on the panel, I'm not a statistician. So I thought it would be useful to have sort of a high-level overview of the panel as well as discuss some questions that are relevant to our work that I hope will be of use to folks in the workshop audience.

First, we heard from Dr. Bhaumik, who talked about power analysis and sample size determination, accounting for issues like multiple comparisons and null proportions in high dimensional neuroimaging data, and then Dr. Nandy also talked about multiple comparisons and talked about some novel approaches on how to address the issue of multiple comparisons in fMRI data, and then finally we heard from Dr. Ayyala about methods to assess reliability of resting state networks, taking into account sources of variability across different timescales.

So the uniting themes among all of these talks are that these were very interesting, innovative methods for addressing sources of noise and variability in neuroimaging data. The other sort of uniting theme was that they all focused on functional neuroimaging, whether resting state fMRI or task-based fMRI, and so I had a couple of questions that I wanted to bring up for general discussion, in part selfishly inspired by our own work where we often do multimodal neuroimaging.

So in addition to functional neuroimaging, we also will use diffusion-weighted imaging to create structural connectivity networks or we'll do joint modeling of connectivity, combining both structural and functional neuroimaging data. So my first question for the panel is whether these methods, which were described in the last hour or so, could be applied to multimodal neuroimaging data as this is increasingly an approach taken on by many labs.

The second piece relates to a couple of our projects where we're combining neuroimaging data with other types of data, and this bridges in themes that we heard from the earlier panels. So among the things that we're collecting are ecological momentary assessment data from smartphones. We're also collecting passively and unobtrusively obtained data from smartphones that we're interested in combining with neuroimaging data. So I'm wondering about the applicability or the generalizability of these methods to looking at neuroimaging in combination with data acquired across different timescales from different sources.

And then the last question that I have relates to what was discussed in the previous panel, where we have tremendous analytical flexibility in how we approach our neuroimaging data, which can affect outcomes. So for example, the NARP study, the Neuroimaging Analysis Replication and Prediction study, was mentioned as an example of how you can take the same type of data and get very different results, depending on what parameters you selected, what analysis pipelines you used, and that's also a very important source of variability that needs to be addressed. So I would love to hear from the panel their thoughts on this issue and whether some of the methods described could be generalized to address those problems.

So with that, I want to make sure I have enough time for my colleague, Dr. Lazar. Thank you.**DR. BHAUMIK**: Thank you, Olu. Our next discussant is Dr. Lazar, Nicole Lazar, who's a professor of statistics at Penn State University. Whether you know or don't know, I don't know, what I know for sure is that she has written an excellent book, an excellent book on fMRI study, the name of the book of is The Statistical Analysis of Functional MRI Data, published by Springer. I cannot remember the exact year, probably 2007 or 2008. That was her excellent contribution in the neuroimaging data, and she will be talking about basically I believe the imaging processing data and maybe statistical inferences, also. Nicole, thank you very much for sharing.**DR. LAZAR**: Thanks, Dulal, for inviting me to present today and discuss the session. So I had prepared some slides, but I think I'm just going to, like Olu, talk through some things instead.

As was mentioned in several of the talks, this question of multiple testing, multiplicity, is a longstanding issue in fMRI data analysis and functional neuroimaging data analysis in general, and I was really pleased and happy to see that the panelists are addressing these questions in new and innovative ways, which are really what's necessary and supported by the data that we currently have.

So I'm not going to go through and summarize again. Olu did a nice job of summarizing the key issues. But I do also have a few questions for each of the panelists, and so I'm just going to go through those, and then hopefully we can open up the rest of the time, we'll have some for discussion, in general, from the panel.

So starting with Bhaumik's talk, I was really interested to see this focus not just on type I error, which we're all used to thinking about, those false discoveries, but also the type II errors. You can think about those as being missed discoveries, things that you should have found but you didn't, and we talked about -- he talked about the tradeoff between these two types of error, and so my first question for Dulal and for the other panelists is which is the more critical type of error to be thinking about when we're thinking about connectivity in particular? So to put this a little bit more precisely in the context of Bhaumik's work, is it more important we have a false finding of a connectivity difference, that type II error, or is it more important if we miss an important difference, that type II error?

And of course, as we know, we can increase sample size to balance those competing goals, but increasing sample size isn't always possible. So if we had to make the choice between that type I error and that type II error, in this particular context of connectivity differences, which do we think is more important and which do we focus on?

Obviously, I think that's going to depend on the context and the scientific questions and maybe even, as Dulal mentioned, how those errors are defined, it would be interesting to have a little bit of discussion about that, and then also he showed us some work on comparing two groups. I'm just wondering how hard it would be to extend this idea to multiple groups. So if you had multiple groups of subjects that you were thinking of comparing the connectivity networks on, or even within the same groups of networks, could we do this longitudinally and see perhaps deterioration in networks, in certain networks over time. So some of those issues I think I'd like to hear a little bit more about.

Moving to Nandy's talk, I really liked this idea of resampling the normalized spacings. I thought that was a very interesting approach. I was really struck by the fact that the thresholds with this new approach are so much higher than the others. So we saw some results from the random fields and the Bonferroni adjustment, which were pretty similar to each other, but the thresholds with this new approach were much higher, just about double.

So of course, as a result, the activation maps that we see were correspondingly much sparser, especially if there was no correction for autoregressive behavior, and so again, sort of tying this back to the first presentation in the session, I'm wondering about the power tradeoff here and is there any way that that can be assessed in the context of this new method? It's one thing to think about controlling the family-wise error rate or the false discovery rate, but I think it's also interesting to think about what happens to power and what's the price that we're paying in terms of power when we make those kinds of changes.

And then finally, the third presentation, oh, thank goodness, I'm so happy to see a talk about reproducibility and replicability. This is a huge problem as I'm sure many of you know, and looking at it within resting state networks across the scans, within a visit, and across visits, which gives us a longitudinal aspect, I think is really, really important, and so some key points here. Can we infer those connections in the resting state network in a stable way, and the effect of the different preprocessing streams and the different assumptions that we make on that?

One thing -- so one thing I would point out here is that there are different ways to define reproducibility and replicability, and so Ayyala touched on that a little bit. I think it's really important to keep that in our minds, that different communities define these things in different ways, and so having our own definitions clear and clean is important.

So my questions to the last speaker, and of course any of the panelists please should feel free, and jump in on these. So this was a very hypothesis-driven approach, it struck me, with a very strong focus on p-values and whether certain results were statistically significant or not, and I found myself when I was thinking about this talk just wondering about effect sizes and their standard errors and might that be a more stable criterion for replicability rather than this focus on statistically significant results, or not statistically significant results.

And the other thing that I was struck by, and I'd be really interested in hearing what the panelists think about this is when I did look at the effect sizes, it seemed that the scan effects were generally quite a bit larger than the visit effects, which seemed to be a little bit counterintuitive, and so I was wondering if there was something -- if there was some intuition as to why the scan effects were typically larger than the visit effects, especially because the scans are nested within visits, but that didn't seem to be the way that the analysis was carried out.

So I'm just curious from the panelists and particularly I guess Ayyala since this is his analysis, if he could say a little bit more about that and enlighten us as to what things we need to be paying attention to in this type of analysis.

I will stop now and leave the panelists time to discuss.**DR. BHAUMIK**: Thank you, Nicole. Thank you, Olu, and the other two speakers. How much time do I have? I have two minutes. So let me try to address some of the questions that Nicole pointed out. Number one is which kind of error should be prioritized, the type I or type II error, or the false discovery rate or nonnegative. The answer to that question depends on the context. I mean, if there is any confirmatory test besides the statistical test, so false discovery may not be that bad. An example is the breast cancer detection. There are many ways they measure that, but the clinicians will go through the biopsy test, no matter what happens before.

So that kind of false discovery is not really that bad. Rather the confirmatory test will be followed. But the nondiscovery is really bad. It means that the subject has the breast cancer and it is not detected. So this is just an example, and I think it depends in the context what kind of priority should be given the first type error I rate type of thing or the type II.

The second question she asked is that can it be extended for longitudinal design. The answer is yes, by the modeling system, and if we talk about that trend, the trend analysis, then the appropriate parameters should be tested accordingly.

The third question she raised is that can it be extended for multiple groups? The answer in principle probably yes, but I don't know mathematically how hard it will be.

Then our first discussant, Dr. Ajilore, he basically mentioned that different types of heterogeneity, we try, but that doesn't mean that we have incorporated everything, provided we know the sources. And to the sources, we have to depend on the people from whom we are getting data, and Dr. Ajilore all throughout he helped me, most of the times I analyzed his data. So if we know probably we can incorporate in the modeling system as Deepak has also incorporated.

What type of -- how to control the multiple comparison, that is always a debated issue that Rajesh has done some way, I have done a different way, and Deepak has an also different way. So there is not unique answer to that question. But I tried to compare some of those multiple comparisons and tried to figure out. But one thing is clear that more and more information you extract from the data, it is better for your decision or conclusion.

Thank you all for participating and listening to us, the speakers, as well as the discussants.**DR. WOUHIB**: All right, that's really another great session, and more into the technical aspects of statistics, and thank you, Dulal, for organizing that. We will convene at the top of the hour, which is around 9 minutes from now. Thank you very much.

(Break)**Agenda Item: SESSION IV: Recent Statistical Developments in Imaging Genetics****DR. WOUHIB**: Hello everybody. This morning I was kind of blanking when I was attempting to call Dr. Thompson, one of our organizers on this workshop, and moderator for this session. Dr. Thompson is a professor of biostatistics at the University of California San Diego. The topic of this session is the recent statistical developments in imaging genetics, another important subject area in mental health research. Wes, please take it away.**DR. THOMPSON**: Thanks, Abera. I would like to thank you and Michael and the NIH for hosting this important workshop. I think the talks have been fascinating so far. I think our session is probably closest in spirit to the second session, in particular Martin and Todd already beat me to the punch on several of the messages that I was going to try to communicate, so I’ll keep my preliminary remarks a little brief.

I just want to say that you can interpret imaging genetics narrowly as being methods for actually using genetic data to predict brain images, structural or functional brain images, or to try to assess credibility of imaging data and so forth.

But our session is going to take the interpretation a little more broadly in terms of what are the lessons that we’ve learned from the field of genetics research in GWAS studies, which have gone through this whole pathway of having candidate gene studies, finding that the candidate genes have severely inflated effect sizes, that they didn’t replicate, and then going to genome-wide studies and finding that a very small percentage of the heritability was actually explained by quote significant effects.

And then they went to this whole path where they got huge sample sizes, a lot of that in our field was driven by the psychiatric genetics consortium, now they have samples upwards of a million people in metanalyses, and now they’re funding highly replicable results. They’re widely distributed across the genome, but each one is tiny.

And I firmly believe that’s where we’re going to have to go in imaging, that effects are smaller than we thought, they’ve been really inflated in the literature because of the small samples in publication bias and P hacking, and that large samples are going to be a partway solution to this problem. So advanced statistical methods are important.

As Martin said we need to adapt our methods now to handle this scenario of large samples where maybe effect sizes with confidence intervals are more important as Nicole was saying, in that they’re probably widely distributed across the brain. There may be no such thing as a truly null voxular vertex if you have a large enough sample size. And so we can borrow some methods from genomics research, but not blindly, because there are significant differences between imaging and genetics as we all know. And so that’s going to be partly the topic of this session.

And so we’ll start off with a colleague of mine who I’ve worked with now for a number of years who is a professor of biostatistics at the University of California San Diego, Dr. Armin Schwartzman. He received his PhD in statistics from Stanford in 2006, and his research centers on the development of statistical methods for signal and image analysis, with biomedical and environmental applications. So I’m going to hand it off to Armin.**DR. SCHWARTZMAN**: Thank you for the invitation. It’s exciting to see how much interest there is in doing good statistics for mental health. So I want to pick up exactly where you left off, and the question I am going to answer here is the question of how much of the cognitive traits that we see in let’s say something like IQ, memory, or anything else related to mental health, can be explained by genetics or brain anatomy and activity.

And the reason why I want to answer this is because when I came into this topic when I started working with Wes, all these effects are tiny and highly distributed, and we just want to know first of all how much information there is overall in the data.

Now I need to explain that I’m not a neuroscientist nor a geneticist, so I cannot tell you exactly how much of the variance of these traits are really explained by neuroimaging measures or by genetics, but at least I can tell you in the data how we can estimate those numbers.

So these are the two domains, neuroimaging and genetics, and I’m going to go one by one, and I will start with genomics. So genomics is in the context of GWAS. For those of you who may not know, just to say briefly, genome-wide association studies, what they do is they collect for a large number of subjects, let’s say on the order of 10,000 or even more. For each one of them they will collect several traits. For an observed trait that could be cognitive, something like IQ, or a memory task.

And then also collect a big panel of single nucleotide polymorphisms. So this would be a long list of maybe a million or more of SNP allele counts. What does that mean? Our genome is very long, but most of us share our genome, and most of the differences between us are concentrated in these particular base pairs that have different base pairs that show up, for example I have an A in a particular location and you may have a C in that particular location. So we have these allele counts. And that’s our data. We have a column vector, so length 10,000, which I’m going to call Y, contains the trait, and then a big matrix which may be 10,000 by let’s say a million of these allele counts.

And the question is how much of the variance in Y can I explain by X? The simple way to do this is to post what we call the polygenic linear model where we just express the trait as being a linear combination of these allele counts by some coefficient beta.

Now these coefficient betas represent the biology or the mechanisms by which these allele counts get transferred or translated into the trait, and therefore I’m going to be treating these proficiency data as being fixed but completely unknown, while X is what is random because X changes from subject to subject.

Based on this model we can define now what the fraction of variance is. So from this equation if we consider epsilon to be some additional effect, perhaps environmental or some other source, that is independent from the data that we have, we can decompose the variance of Y as the sum of two components.

Beta transfer sigma beta, which corresponds to the variance that is explained by the phenotype X, plus sigma squared, which is the additional variance. Based on that decomposition we can simply define the fraction of variance explained as a ratio between the variance explained as the ratio genetics by the covariance.

In genetics literature this quantity is called SNP heritability, and I want to distinguish simply between this quantity and the general heritability. The reason why I call this SNP heritability is because I’m only trying to calculate the amount of information that is in the data that I have. This would not include other genetic factors that are not in the data that is available to us.

Now you may look at this linear model and say well the fraction of variance explained by X is simply R squared, the coefficient of determination. Yes, that’s all it is, that’s what we’re trying to do. What’s so hard about this? Well, the coefficient of determination, R squared, or adjusted R squared which would be the better estimator, that works if the number of subjects is much larger than the number of features or predictors, M. But we’re in a situation which is the opposite, where we have many more predictors, many more SNPs than we have subjects, even though both numbers are large.

So what do we do in that case? This turns it into a high dimensional data problem. And for this, Wes and I and others in the group developed an estimator for the GWAS heritability estimator, in short GWASH, and this came out in a paper in the Annals of Applied Statistics in 2019, and I’m going to tell you the details about this in a moment.

But before I do that, just to give an idea of what kinds of numbers we are talking about, here is a summary of some analyses that we did on data from other papers about height, you see there are two cognitive traits, IQ and educational attainment, that’s how many years of education the person has achieved, versus others that are not cognitive like height and BMI. The numbers as you see are pretty big, number of subjects, number of SNPs right here. This particular analysis just received a number of SNPs about 800,000 for technical reasons, but you can see the numbers.

So these are the estimates given by GWASH, and these are the standard errors that our estimator provides. It’s interesting to see that for example IQ may be a little bit more heritable, about 20 percent heritable according to this list of things at least. The educational attainment, not so much, less.

BMI, not very heritable according to this. These numbers are important. And why is that? You can see the height for example is highly heritable, maybe 40 percent. These numbers matter because for example if BMI and educational attainment are not very heritable, that means that what we do as a society matters.

And if we try to give education to people, and if we try to teach them how to eat correctly perhaps, this can have a big effect on their lives. The point is that things that are less heritable are more under our control. Therefore, I think it is important that we know how to estimate these numbers correctly, and that’s what this work is about.

What is the GWASH estimator? The formula for it is actually quite simple it turns out. And I like it because it’s interpretable in a very nice way. If you think of Y tilde and X tilde as being standardized versions of Y and X, where all I did is just standardized by columns, standardize here means your invariance is one, it just makes the expression simpler, then the estimator of the fraction of the variance is explained by this form.

Let’s parse it out. M and N, M is the number of snips, N is the number of subjects. S squared is the second moment or the variance of these so-called correlation scores. So what are the correlation scores? You just take every X, every column, the data for every SNP, and just take the correlation between every SNP and the trait. So UJs are the list of all the correlation scores that say how correlated each SNP is with the trait, and just take the variance of all that.

So why is this interesting? That is because what happens in the case that suppose that the SNP counts of the genetics are completely uncorrelated with the outcome, completely, that would be our null hypothesis. So our null hypothesis, all these correlation scores because they’re properly standardized will have mean zero invariance one.

When you estimate the variance that number will be about one. So under the null hypothesis we expect the square to be equal to one, if we take a square minus one what this captures is the additional variance that is produced precisely by the relationship with m and n.

And there’s another quantity that is interesting here, this quantity mu2, which is an estimator of this quantity, which is called the second spectral moment of the correlation matrix. So the correlation matrix in the case of SNPs would be what is called the linkage disequilibrium matrix, and this captures the total amount of correlations that there is in the SNPs.

The fact that we have to divide by this number is because there is redundancy in SNPs because of the correlation. And so this sort of adjusts the number of effective SNPs that really are indicated. And to illustrate that with data you can see that as you increase the number of SNPs for each one of these traits then at some point you see that at some point adding more SNPs doesn’t really help very much, and the estimates stabilize.

So this really shows that the amount of information that there is in the data can be captured by an effective smaller number of SNPs. Of course, mu2 also has to be estimated, and here is just a formula that shows how you can estimate it from the sample correlation matrix as long as we adjust for a finite population factor. Even though m is large it’s still important and needs to be accounted for.

And in terms of theoretical results, I just want to point out that in this paper we showed how this estimator has strong theory behind it showing that under some conditions, at least in the case of the paper where we show that the data scales and the correlation is weak. Weak in the sense that it’s local, between SNPs, which usually is true, because SNPs are usually correlated if they’re close enough in the genome, then our estimate is consistent and it has a normal distribution with a variance that can be calculated.

And we have simulations that show that under different conditions whether the data is a mixture, not a mixture, whether X is normal or binomial as it should be in the case of SNPs, different correlation structures, we tend to estimate the fraction of variance explained.

So this is about the genetics part. Can we do something similar when we have neuroimaging measures? So the way to do that is just to post a similar model. So in this case, let me just remind you, if I go back up, we post in the case of genetics this model where the trait was equal to a linear combination of the SNP counts with the coefficient data. If we are looking at the brain, it’s not quite correct to look at the brain as just a list of pixels or a list of voxels, because voxels have an anatomical structure.

Therefore, maybe a better way to do it is to pose it as an interval model, and this is called a functional regression model in the functional data analysis literature, but it is linear nevertheless where s here is an index that goes over the shape of the anatomy. And the expressions get a little bit more complicated, but essentially, we can make this extension where now we can calculate or define the fraction of variance explained again as the ratio of the variance explained by the neuroimaging measures divided by the total variance.

And the GWAS system can be adapted in this case, essentially pretty much replacing sums for intervals but in a very simple way, although the technical details need to be worked out. And Wes and I have been joking that we could call these brain-wide association study estimates possibly. I’m not sure if we’re going to continue these seriously, but it’s a good way to refer to them.

However, when we want to adapt and use this estimator for brain imaging, something fails. And it turns out if you look closely at the assumptions that are required for heritability of the GWAS estimator to really converge, as I said be consistent and work great, I said the correlation between predictors must be weak.

And in GWAS this holds again because SNPs that are close together in the gene tend to be more correlated as opposed to the genome breaks into pieces. But in the brain this is not quite true. For example, let’s say that we’re looking at some X that is cortical thickness.

Well if I for some reason, if I have a thin cortex, my cortex is going to be thin, not only this part but it’s going to be thin everywhere, so there are long range correlations. And there is also unknown for fMRI and other modalities. Therefore, what do we do? Our assumptions are being broken. So this has required us to look really closely at what are the assumptions that we need, and how can we address them and fix them.

So in this case, to address the long-range correlations, one idea that we’ve been exploring is to see whether those correlations have been captured by the first components of the data. Here is an example of something we tried where we did a simulation, this was done by Chun Fan, one of our panelists as well.

And he randomly sampled individuals from the ABCD study, it’s large enough, and these are the sample sizes, a thousand up to 5,000 subjects sampled from there, and in samples we take the entire images using 5,000 vertices of cortical surface area. What we see here is that we apply the original GWASH estimator, and the results are quite biased with respect to what they should be in the simulation, the bias, large variance and everything else. However, if we try this idea where we remove the first principal component of the data, that seems to very much fix the bias and reduce the variance quite nicely. Which is a very encouraging result.

And to give you an idea again of the numbers, this is now using the ABCD data itself using 8,000 subjects for the GWAS analysis of ABCD and about 3,000 for the imaging measures for surface area, mean diffusivity and resting state fMRI density and for different traits that were measured. This is just a preliminary analysis, but it’s interesting to see too that surface area which is purely anatomical doesn’t explain very much of the outcome. However, diffusivity does quite a bit more.

And as we know diffusivity has to do with the structure of the neural fibers, therefore higher diffusivity means better conductivity in these neural fibers, and that probably explains why that contributes to intelligence and problem solving. Also resting state fMRI the amount of activity is an important component. So understanding these numbers will give us a lot of insight into what are the mental processes or the anatomical characteristics of the brain that may relate to these outcomes.

So to summarize as a main message here, the approach that we are trying to promote here is one that will allow us to consistently estimate the fraction of variance explained, and it’s general enough that it allows us to use different types of predictors, like for example as I showed already SNPs and genetics or different modalities of neuroimaging, but by the same token we could potentially extend this to other forms of high dimensional predictors, like for example metabolomics or microbiome, many other types of high dimensional panels that we could be incorporating into this framework, and then try to explain how these cognitive traits work, to what types of sources they can be attributed. However, basic is not so simple, therefore future work is going to concentrate on the following things.

First of all, the conditions are actually quite important. This is technical work, but I think it is important because if you want to feel confident about the estimation or the numbers that we produce we need to know that these methods actually work. And to know that they work we need to be able to diagnose in the real data situation whether these conditions actually hold.

One of these issues as I mentioned is the presence of short-range or long-range correlations. So this needs to be calibrated carefully. And then for this idea of removing the pieces or components also needs to be formalized and we need to prove that it actually works and we will get consistency and also check if you remove two pieces or components have you removed the correlation or have you not, that’s where we need diagnosis again.

And another big topic is that this method, and in most others that we know about, they assume homogeneous populations, and populations are not homogeneous, even the ABCD data is not. So another big topic that we’re working on is how to take into consideration the fact that there’s a mixture of different populations both in terms of the traits and in terms of the genetics and the brain characteristics, and how to take those variables into account. Thank you very much**DR. THOMPSON**: Thank you. That was a great talk. So we’re out of time, so I don’t think we have time for questions. So let’s move to the second speaker, Dr. Chun Fan, who is an Assistant Professor of Radiology also at UC San Diego. Dr. Fan is the co-director of the Population Neuroscience and Genetics Lab there. His research interests focus broadly on the domains of psychiatry, neuroimaging and genetics.

And so I would say that Chun actually has an incredibly unique background, being a certified psychiatrist with a formal epidemiology background and a PhD in cognitive science and a deep knowledge of both genetics and imaging. I’d like to welcome Dr. Chun Fan for the next speaker.**DR. FAN**: Thank you so much Wes for the kind introduction. In a way because of my very heterogeneous background, I can say I’m more like a bioinformatician rather than a good statistician. So a lot of my discussion here is about targeting the critical side of dealing with imaging genetic data. Most of the themes that I’m talking about are from the collaborative work in PoNG Lab and the broad range of our collaborators within UCSD and also across international borders.

So today I’m going to focus on, the main thing is highlighting our effort to do voxel-wise GWAS analysis, imaging data. And the main challenge to deal with that is really to have an efficient algorithm that can quickly do so and also have several different ways to really improve the reproducibility and the interpretability of that voxel-wise result.

So the first thing, to motivate our effort, is to look at as Wes and Arman mentioned earlier, now we know we observe a lot of effects that are small and distributed. And especially for imaging genetics it goes in both ways. For imaging side the effects are well distributed and small, for SNPs we do know every SNP contributes a very small amount of the trait variability, and combined together we have this kind of double curse, not just dimensionality but also effect size which is small.

And just to show you, on the righthand side is our recent print paper we put in our archive, we performed a diffusivity GWAS using the diffusion metrics on the top panel here as this table has several different features, basically trying to capture the tissue component for white matter, and we try to look at what we discover as genetic loci significant both in discovery and replication set, and we look at those loci how they overlap with other imaging GWAS having have been brought before.

And when you look at, even though our imaging modality mainly focuses on white matter, you can see a lot of loci overlap with the loci found in cortical surface measurement, surface volumes, subcortical volumes, or white matter matrix as well. So even though the effects are strictly from one single modality, and you go on to find regions you have this kind of pleiotropic effect that can be found in all other brain image modality and brain regions.

So if we put the question about the so-called using the more classical approach, the regional interest approach, looking at one modality and one region at a time. And just to further illustrate this kind of distributive effect in the genetic association on brain imaging study, the righthand side is one loci we found consistently replicated across the dataset and also has the biological possibility to support the experiment.

This particular loci, you can see the voxelwise association map here, it’s really scattered across the whole brain, and probably some regions have clusters around certain cortical striata regions, but indeed these are just scattered across, and a lot of them just right beneath the cortical surface.

So the thing is, if we want to look at these kinds of perceived effects, probably stuck between to give us a balance, once thing is the power, how we can detect these particular genetic effects, genetic is such kind of good causal end curve because the relationship to the (indiscernible) process, but on the other hand if we just look at one voxel at a time, yes, we have better spatial resolution but we might lose power. And we also want to ensure that we can reproduce our result. The effect size per voxel is so small, how can we reproduce out the effect that we discover?

Finally, if we find a way as combining all these voxel-wise analyses into an inference similar to what Armin did to get at the SNP, the issue is how do we integrate the result, because for neuroscientists doing the regional interest approach provides good heuristic to integrate certain brain function. So in this talk I’ll mainly focus on what is the practical solution using (indiscernible) to deal with largescale imaging genetic analysis, and starting with voxel-wise linear mixed effect model.

The main thing, adopting the linear mixed effects model, is the response to the complex study design. Take ABCD for example. The righthand side, the top panel is showing the ABCD study about the genetic ancestry portion, each color represents one genetic ancestry group inferred by genotype data. So the blue one is the European descendent.

So even though a large portion of ABCD subjects have European descendant, we can also see a majority of them also have a very complex mix with other groups. So these kind of background heterogeneity already built in with ABCD, the other reason is that ABCD is trying to capture the diversity of sample so that the analysis can be more generalizable, but on the other hand the in terms of genetics(?) it introduces complexity you need to deal with.

Beyond that, ABCD also has mixed relatedness. As the lower panel shows here, this is genetic relatedness inferred by genetic data in ABCD. You can see the distribution from zero to one here, zero meaning that there’s no relatedness between pairs or individuals, and one meaning that they’re 100 percent related, as monozygotic twin.

The majority of ABCD individuals hover around below 0.25, meaning that they are unrelated in a way, some of them probably cousins, but not much, but still they have the heavy tail end as you can see here circled by red, some of them are twins or siblings or some of them are monozygotic twins.

So to further complex picture, because ABCD is designed for longitudinal analysis, so there’s also another component of longitudinal repeated measure. So the thing is adopting this kind of complex nesting structure, one possibility is using linear mixed effect model because it’s quite versatile and well supported in the genetic field so that we can combine together, taking into account those different relatedness in the constellation(?) meanwhile performing voxel-wise analysis to data to perceive the effect without the SNP effect being biased by those complex structures.

But how doing the (indiscernible) solution is just expensive, because think about the voxelwise LME, for each voxel, like the median amount of hair, each voxel K you need to estimate the related(?) part, and then getting the related part and then you have to look over all the seeds(?) in the X right here. So what we do is using the approach that has been utilized in the imaging field and the genetic field and we synthesize them together, have a two-pronged approach.

On the left-hand side we’re dealing with imaging, K, we assume there is a smaller set, C set of configurations where the component is much smaller than the K of voxels. First, we do the moment estimator to get rough voxel-wise parameters first, and this can be done very quickly as a linear operation because these moment estimators basically work as a linear equation, but then we try to bin the average of them based on the nearest neighbor clustering into finite C set, and we restore those C sets of inverse metrics.

On the genetic side, because we already (indiscernible word) then we can just use a score test instead of the full-blown LME, and we can quickly get a testing statistic as pi squared distribution And taken together they can even further speed up by avoiding doing the denominator part during GRAMMAR-Gamma approximation, which has been used in some(indiscernible word) genetic field, but because if you just forego the environmental sequelae part then the statistic will be biased downward as conservative, so you need additional gamma term to scale the testing statistic, but a scale in all the other SNPs for this purpose.

To showcase this two-prong approach, how they help, we do simulations, on the left-hand side, the result here is coming from 1,000 simulations, and it’s showing the effect, there is effect between the SNPs and of course voxels. So the X axis is from the LME using (indiscernible) and Y axis is our estimation coming from our two-pronged approach, and you can see the for the beta coefficient, a key statistic, they are basically identical. And surprisingly given all this approximation still it seems that the fixed effect part for imaging GWAS is quite stable.

And the benefit of it is mainly time, computation resource is reduced, cost is greatly reduced, and time needed to finish the calculation is improved dramatically as the right panel shows here compared to the full plot LME (indiscernible) it just takes so much time but (indiscernible) but this is just purely a critical reason to quickly get (indiscernible) statistics, of course voxels, but the main interest still we want the inference, and we don’t want inference on each voxel, because indeed we will suffer the burden of compression.

But fortunately, there is a lot of inference based on combining the different voxel-wise statistics, like min P, the traditional min P approach, we can take the most significant region and do a test on it, or test statistics independent with each other and do Fisher combine, or other groups have proposed using therapy in the meta-analysis context, MANOVA, or even using a permutation test, some non-parametric approach to test each different combination function.

And there’s a kind of clinical work with UCSD, we recently also explored more like semi-parametric approach, basically it is (indiscernible) min P but in the known distributions defined by permutating the null SNPs that define the cumulative distribution function, and these kinds of pool permutation strategies have been used before, and recently we also published in Nature Communication recently, and now there’s one paper under review. And you can see the right panel here are the (indiscernible word) plot from the voxel-wise analysis using these parametric approaches combining voxelwise summary statistics altogether.

The (indiscernible word) permuted the result, alignment in type one error is well controlled when looking at the actual observed data results. The orange line is the voxelwise results. The blue line is using the regional (indiscernible) several hundred regional (indiscernible) combined together, and the pink line are coming from the min P approach, so the power is improved when you contend voxel-wise data, combine them all together.

On the other hand, since we have the voxelwise math, how we can say that the overall pattern can be replicated in independent study, the intuition starting with the pattern why don’t we just turn these into the prediction framework, let’s do a kind of scoring, imaging data in the independent set, and get a score based on the linear sum of the effect size estimate (indiscernible) poly-voxel poly-vertex score in the context of (indiscernible) the cerebral cortex and also another one strictly in the genetic context under review right now.

So the step is starting with the first one is just from GWAS, let’s assume the effect(?) per voxel RID, and then instead pre-multiplied the estimate of imaging data and then do the regression, then demonstrate basically (indiscernible word) look at the correlation between two statistical maps. And the benefit of doing the precalculated score is you save computation time. You can do an independent dataset, and without worry you have to estimate per-voxel effect even when your dataset is relatively small.

And you can see on the right panel here is a result comparing to the min P approach using this voxel-wise map to do a kind of poly-voxel scoring for replication of the number of replicated loci is going to be increased from 50 percent to 70 percent each. But on the other hand, we also hope to improve on the implementation.

As I said (indiscernible) multivariant influence with voxel-wise map, just try the appropriate power and have a practical solution put together as a statistic, but the problem is we still hope to see what regions are more responsible or showing on average larger effect size so that we can probably infer even for deeper(?) genetic loci what kind of function might be related to brick(?). So what we think of is whether we just look at the average effect size, even the anatomic region.

Because from the very beginning when doing voxelwise analysis we have to register our image onto a (?) which already contain some segmentation of the interested region so we define the outer space, so we know the probable map for a given voxel. So when we calculate the expectation of the data, given the anatomic region, we can approximate that by just taking the weighted average of the data and ranking our coding to the region of interest. And the variance, again because this is in genetics between the GWAS space, the variants of those particular scores, we can approximate that by just sampling from all the other SNPs.

As was shown right here in our pre-print paper, the first row of this analysis, the ne we mentioned earlier in the first slide, (indiscernible) So we ranked the top seven ones for the arrangement and you can see they spatially landed in the white matter region. In the second row is another loci that we found. We know from animal studies this is related to motor control, and indeed when we do this ranking based on original(?) analysis all effects indeed can be seen within the cerebellum white matter.

So in general this is a quick overview, I went through all of our efforts trying to improve the power to discovery, enhance the reproducibility, and providing some heuristics to improve the interpretability for imaging genetics. Of course, those are very heuristic, we still need to know the limitations and the bounds of this, so we need more harmonization(?) to know the task limitations. Right now we really rely on the participation of this approach based on biologic function is fine and also based on (indiscernible) controlled but still a diagnostic to know when you actually failed.

And finally just to note, all these things we tried to integrate into a web-based approach, we call it the Data Exploration Analytic Portal, because we want to improve accessibility for those researchers who just want to use the tool, the public resource, or have to navigate all this complex data, we put in a web portal we call the DEAP, pioneered by ABCD study in JAMA, and on the righthand side here is just schema how we do it, The ABCD study, UK biobank in the future, curated data repository and put all the analytic models that we mentioned in that area, and researchers can hop in the web portal and do their calculation and basically decide. So I want to thank all of our collaborators at UCSD, also UCLA, and Damion and Deana but particularly our great trainees and postdocs in PONG lab, Robert Loughnan, Zoey Huang, Shu-Ju Lin, Juntin Ren, and Rong Zablocki. Thank you.**DR. THOMPSON**: Thank you. That was a great overview in 20 minutes of a rapidly expanding field that is really crying out for novel methods. As to your last point for providing informatics tools, that I think feeds into our next talk, I think that’s really important that we don’t forget that aspect of making these tools available.

And so our next speaker is Dr. Kevin Anderson, he’s a postdoctoral fellow at Harvard working with Dr. Randy Buckner. His work investigates the biology of normative brain aging, incorporating tools from psychology, neuroscience and genetics, and so Kevin please take it away for your talk.**DR. ANDERSON**: I will be presenting some ongoing work from my postdoc with Andy Buckner, constructing an analytic platform for studying the biological bases of brain aging. So there are a few things I’d like to convey in this talk, the first of which is that there is a tremendous opportunity at the given moment to study the biology of brain aging using existing public depositories like the UK biobank and gene expression data repositories like (indiscernible) which I’ll unpack later.

In the second part of the talk, I’ll describe an integrative platform that we are constructing to facilitate dynamic analyses in visualization of brain aging effects across these levels of analysis, from neuroimaging to gene expression and genetic variation. And at the end I’ll show you some actual videos of the platform, a working version of the platform in action and walk you through some results.

Before going further I’d like to give you a visual sense of what is meant by brain aging. Even in the absence of disorders like Alzheimer’s and dementia, there are serotypes, patterns of neurodegeneration that occur as a course of normal aging.

So this neurodegeneration is not a unitary(?) construct, and it varies widely across individuals, but nonetheless follows certain stereotyped patterns, one form of which is general atrophy involving loss of both grey and white matter, amounting to about half a percent per year in the later decades of life.

Also observing an increase in the presence of these fluid filled cavities called lacunes which are typically subcortical and thought to be vascular in origin, as well as the increased presence of white matter lesions which will show up as these hyperintense voxels on a structural MRI, which reflect demyelination and are linked to cardiovascular risk factors like obesity and smoking.

And so a critical goal of the field at large is to identify factors both genetic and environmental that influence the trajectories of these age-related changes occurring over the course of normal aging. And I want to highlight that the work I’m presenting is one component of a multi-institutional collaboration funded by the Simons Foundation which is bringing together researchers who study diverse model systems from C. elegans to killifish up through rodents, primates, and humans.

And this is all in an effort to identify mechanisms of risk and resilience in the aging brain and establish baselines for these age-related changes across model systems and datatypes, with the ultimate goal being to identify potential targets to minimize the associated cognitive decline that co-occurs with these neurodegenerative processes.

So this collaborative will as it matures over the coming years produce an immense amount of new data, but the work I’m presenting today attempts to integrate existing public and widely available data in order to make progress on these important questions, and it inserts a constructive framework for integrative styles of insights.

So in particular we focused heavily on data from the UK biobank, which as many of you know is a prospective cohort study of around 500,000 individuals in the UK, and which provides extensive measures of cognition, behavior, mental health, medical healthcare records, as well as genetic variation, and importantly for our purposes in vivo measures of both brain anatomy and function obtained using MRI.

And so the size of this collection was for uniquely powered and comprehensive studies of brain aging, as well as its genetic and environmental modifiers. And what I want to highlight here with this distribution is that the ages of the individuals who have undergone MRI imaging fall between the ages of 50 and 80 years of age, which is when you expect to see these neurodegenerative processes begin to emerge as opposed to normal aging.

A second critical arm of our ongoing work is to integrate large biobanks of brain gene expression data. So these are actually gene expression estimates obtained from postmortem samples of brain tissue. And this will allow for the identification of genes and gene sets and biologic processes that are differentially expressed at various points in later life.

I want to highlight that many of these datasets were not collected for the express purpose of studying brain aging, rather they were collected for instance to identify genes differently expressed with patients with schizophrenia versus say matched controls, or maybe autism versus controls. Nevertheless, there’s major variation in terms of the age at which these brain samples were obtained from these donors, which again falls within this 50-to-80-year age band where we are particularly interested in identifying age-related trajectories, and for instance gene expression.

And these age distributions somewhat fortuitously align with that of the previously mentioned UK biobank neuroimaging sample, where we have data on both genetic variation and non-invasive measures of anatomy and behavior, which opens the door for potential cross-level styles of analysis, to for example link genetic variation to intermediate gene expression through in vivo measures of anatomy and subsequent aging.

And so this frames the goals for the current work, which is to aggregate these genetic, phenotypic and associated brain gene expression data into a unified database, as well as build an analytical platform or front end that allows for dynamic visualization and analyses of these really complex, hard to wrangle, large and extremely multivariate datasets.

And so we intend to use this analytic tool to quantify age associated patterns in both neuroimaging as well as gene expression to potentially reveal interesting clues about the aging process, such as nonlinear age trends, patterns of covariation, and most importantly to identify moderators of these age-related trajectories, such as specific forms of genetic variation or lifestyle factors.

And one thing that’s absolutely critical to highlight here is that this tool or this database in no way bypasses the regulatory requirements and data applications that a researcher must go through in order to access each of these constituent resources. Rather this is a complementary tool that we hope is going to facilitate exploration and discovery of data that an individual researcher would have gone through the process of obtaining himself.

And so here I’d like to now jump into a demonstration of this application. Here we have actually a screen capture of a local version of this platform. And the first thing I’ll show you is that we’re able to explore an extensive amount of data from the UK biobank across thousands of individual brain MR phenotypes as well as non-imaging phenotypes, which we can intersect with around 12 million individual genetics variants which passed QC.

And the sample sizes for these data range anywhere between 42,000 for neuroimaging phenotypes upwards to the full half a million subjects from more extensively collected phenotypes. So here we’re going to jump into some videos. And the easiest way to interact with the application, you can just type in the name of the phenotype that we would like to explore.

Here I’m typing in volume of the hippocampus, which is known to have robust age differences across late life. And this immediately jumps us to a data dashboard. So what actually happened, this is a video from my laptop of the data we received from the data backend, a linear regression was computed, the results are presented here relating age to the volume of the hippocampus in the UK data, covarying for a set of commonly used MR variables.

And here we want to make the operations, transformations, any filters that happen to perform this analysis as transparent as possible and as reproducible as possible. And so one way in which we do this is allow for exploration of the phenotype distributions that went into this particular analysis.

And we also want to allow for building of intuition of how different covariates may influence an observed relationship. So they can be added or taken away over here on the left, although we try to make a smart initial decision about which covariates are included for any given analysis.

The most important feature I want to highlight though is that we can run interaction analyses with any of the genetic or phenotypic variables within the UK biobank. So here I’m typing in the RSID, which is a SNP ID for a well-known risk locus for Alzheimer’s, so this is the APOE4 locus, which is associated with later development of Alzheimer’s.

And so a new analysis has been run and the data have been served up to the dashboard, and you can see that individuals with two copies of this risk allele, so this dark line here, two copies are associated with reductions in overall hippocampal volume, but that there is a significant interaction such that the effect of this risk allele becomes more exaggerated across the lifespan. And so this potentially could indicate for instance pre-symptomatic processes contributing to later development of risk for Alzheimer’s.

What I next want to highlight is other ways to interact and generate insight into these immense datasets. So here another thing we want to do is view the association of all the neuroimaging phenotypes. In this Manhattan plot each dot here is an individual brain imaging phenotype. Its height on the Y axis denotes its association to age.

And if you hover a mouse over it you can view the results of the precomputed regression linking that phenotype to age. And if you click on it it would again bring you to the by age dashboard. Those of you who have worked with the UK biobank will find this category hierarchy familiar. You’re able to add or subtract whole categories of imaging phenotypes and change what’s being plotted in this Manhattan plot.

Next, I’m going to jump to a related measure to the hippocampus. This is a measure from diffusion MRI which affects brain white matter, and we’re going to examine the hippocampal cingulum, which is a major upper structure of the hippocampus. This is again video being captured in real-time, and you can see the relationship of this diffusion measure to age here, potentially showing a nonlinear curve, which can be unpacked and later warrant analysis, again pull up this APOE4 RSID, and we will observe that individuals with two copies of this risk allele have reductions in this diffusion measure, which is related to white matter integrity of the hippocampal cingulum. And this is a phenotype that isn’t as extensively studied in the Alzheimer’s literature as for instance hippocampal volume.

And what’s nice about this is if my advisor Randy were to ask for an analysis like this I might go into my office and return a few days later with the results, and here what we’re doing is allowing for the exploration of the data at a much faster pace. So here I’m going to show you how a related variant, this is a SNP close to a gene called VCAN which has been related to white matter repair. And this SNP is broadly associated to overall diffusion measures.

And what I want to highlight here is the difference between the types of moderators which change the slope of the age-related relationship and those that are more associated to global differences in a particular phenotype and may not be moderators of age-related differences. And so you can see that the effect of this particular allele has a strong main effect on this hippocampal syndrome diffusion measure, but is not strongly interactive with age.

And so many of the talks have correctly highlighted that the effects of individual variants are vanishingly small, particularly also within the imaging genetic literature, and so polygenic analyses are absolutely critical in order to aggregate signal across biological processes in multiple variants, and from points across the entire genome.

And so I want to highlight how we are building out the ability to conduct polygenic analyses, here using a phenotype reflecting overall brain volume, which is a linear decrease with age. This is volume of brain white matter. And here what we’ll do is upload a set of SNPs with associated weights that reflects their previous linkage to overall brain volume.

So here is a set of example weights where each SNP is given a numeric value reflecting whether it’s positively or negatively associated to brain volume. Again, upload these data. We give our new polygenic score a nice descriptive name. And the backend will actually compute a polygenic score and run an interactive analysis to test whether individuals with a higher polygenic score have correspondingly higher overall brain volumes.

And I want to mention that this reference GWAS is purely a demonstration because it also included data from the UK biobank, which obviously has problems with circularity that you want to avoid, but I just want to highlight the mechanisms of a polygenic analysis in this way. You can see that individuals as expected who have a higher polygenic score for brain volume have correspondingly higher (indiscernible word) values.

And the final thing I’d like to highlight is that we are not constrained to only examine genotypic moderators using something like this. For instance, we can really important, it’s absolutely critical to understand the role of environmental and lifestyle factors in these age-related processes. So what I’m now plotting is total volume of white matter hyperintensities, which is thought to reflect demyelination, which increases with age, and has been linked to cardiovascular outcomes.

And so with this interaction analysis we can test whether individuals with higher BMI, which itself is associated with poor cardiovascular outcomes, have corresponding higher white matter hyperintensities. And this reveals a significant interaction with age, such that BMI is linked to increased white matter hyperintensities, and that this relationship becomes more exaggerated.

So with the final couple minutes I’d like to show some ongoing work building the gene expression platform to allow for quantification of genes and gene sets and their change with age. So this is just to remind you separate data from the UK biobank integrating sources of gene expression from postmortem brain tissue.

So here we can type in the name of an individual gene, which will bring us to a visualization platform showing how that gene is expressed by age. In this case this is plotting data from Quantum Cortex from the NIH GTEx repository. Other brain regions can be incorporated here, for instance by clicking on this button in the left-hand side.

I’ll note that the C4 gene, it’s part of a complement cascade related to immune function. Here we’re now plotting BDNF, brain derived neurotrophic factor, which is important for neuronal plasticity, and this particular gene is showing a decrease expression by age.

The last thing I want to highlight is it’s absolutely important to look at gene sets in order to make inferences about the types of biological processes that are showing increases or decreases across the lifespan. And so we’re able to search for what in this case is called gene ontology categories. Here this is showing the expression of genes related to pro inflammatory responses listed at the bottom of this page, which show an increase in expression with age.

So to conclude I want to highlight that age is a major source of variation in these largescale biobanks, and whether or not you’re interested in questions about brain aging, it’s absolutely critical to understand how they influence your data in order to make conclusion about whatever other phenotypes you would be interested in for a given repository. I’ll also suggest that these dynamic analyses increase the accessibility and pace of discovery, for instance working with these data requires a lot of investment to get to know how to even load them onto an analysis environment.

And also our hope is that we are able to facilitate integrative insights that bridge multiple data types and biological levels of analysis. So with that I want to thank everyone for listening, the organizers, our funding from the Simons Foundation, the collaborative and all the data sources as well. Thank you.**DR. THOMPSON**: Thanks Kevin. That was a really great talk, it looks like a super cool resource you put together for the community. So Dr. Yu has a couple questions for you in the chat, which I’ll let you respond to in the text. But somebody also asked when this tool is going to be publicly available, or is it already?**DR. ANDERSON**: I don’t have a timeline for that. It’s still under heavy development.**DR. THOMPSON**: And again, Dr. Yu has a couple questions that she sent to the panelists plus the participants that maybe you can respond to. So our next discussant needs no introduction, literally because he’s been introduced, but it’s Dr. Tom Nichols, Professor of Neuroimaging Statistics and a Wellcome Trust Senior Research Fellow in Basic Biomedical Science at the Oxford Big Data Institute. Tom, I’ll let you take it away with the discussion.**DR. NICHOLS**: I really enjoyed all these talks, there is some really prescient work that’s going on. So I just have a few slides to summarize what I got out of this. I have to say I used to think genetic heritability was boring, because really if you’re trying to do inference on heritability, if you can measure it, and it’s from the body, it’s probably got nonzero heritability. So doing inference on heritability is kind of not so interesting. But as I think Armin pointed out there is really useful information here.

I just love this figure he had from his talk, what better way of showing this, we actually had these side-by-side measures of variance explained, and directly comparing them in terms of intelligence, yes, it is heritable, but actually way more variance is explained with brain imaging. Also I find this interesting because sometimes there’s a notion that static measures of brain structure like thickness, like you see here, or area might explain more in the investigational case, these are variants.

I do wonder that Armin was very careful to describe this and to me it just looked like heritability, and I don’t know if that was an important distinction that he was making to distinguish it from heritability in some way, but I really enjoyed seeing that. Having read the statistical genetics literature, I really enjoy seeing the rigor behind where that comes from, especially if you’re familiar with LD score depression, there’s a lot of links between LD scores in this work, but LD score has a lot of arbitrary (indiscernible). I think in their construction, this is a much more principled way to get excellence(?).

So I’m very happy he’s talked about doing the imaging, but something else to think about is we’ve removed the fixed a problem with your analysis, which is its dependence across the measures, but perhaps could there be interest in variance in those PCs that you removed. So I get it that you need to remove them to make your method work accurately, but maybe there should be a side analysis saying how much variance can be explained by those two principle components, and add it on or what have you.

And then also very interesting in your talk Harman you talked about population, and I wonder what will dovetail into that is the complementary imaging where you actually know about the structure of the links between of individuals because it’s a family study, or you can get the empirical relatedness from (indiscernible). I was wondering if there’s links in there, but you will enjoy that work.

In Chun’s work, really impressive volume of work, really showing that there are, what I thought again might not be very practical or useful, genome wide or voxel wide association actually can be done practically.

And just to summarize tricks, one of the things is do a proximate analysis, use that proximate analysis to cluster states (indiscernible)). I’ve seen that in other work, score tests do a fine job especially when these effects are small and general anyway, and then another trick that I like using my work is using distributional approximations and permutations. Permutation is great, too expensive to do exhaustively or even nearly exhaustively, but then using distributional approximations. We use this term (indiscernible) it is strange that 0.4 is a really common relatedness in the ABCD, I would have felt a 0.5 is the most common spike because they’re siblings, so maybe you can comment on that.

And then Kevin, really impressive work. I think what wasn’t said was while this is really cool, I think having a platform that’s available, really this is a way a lot of things can be going, because when the data is so massive, I don’t think it was directly mentioned, but the genetics data in UK biobank, especially as you get whole sequence exome data available, no individual is going to want that. Maybe one person at an institution and sharing with them, the data is getting so massive, it is really not practical. Very few people are going to want to access the whole dense information, and platforms like this to make important analyses available to people are going to be really incredibly useful and important.

Of course, Wes and Chun are involved in DEAP, other initiatives are trying to do some things in common for the ABCD data, and I think it’s really important work and good to see that. I had a question of heritability, Kevin answered that, but one thing I really think you got about provenance and communication of results, to say, hey, I found this really cool analysis, now how if I came back in six months could I know that I would get the exact same analysis. Maybe with new updated data, or maybe with a historical snapshot, data that was there six months ago.

So I really challenge you to think very carefully about provenance and ways of capturing exact analyses, because if you are going to be the source -- I feel like there are two different modes to these platforms, there’s kind of playing around, for example in UK biobank it has its churches(phonetic) platform, and that gives you deciles and medians, this is playing around, those are just descriptive. But if you actually want to take analysis and write a paper, are we getting to that point?

Are you going to be able to use those platforms to write a paper? And if so it’s got to be reproducible, you’ve got to be able to exactly describe the data that went into it. And I think that’s where we need to be going given the size of the data, but it’s an extra challenge for people who build these platforms to have that level of provenance and documentation so the person using it will know exactly what is the analysis that’s been done.

Also one thing to think about is maybe we’ll be able to switch to people can see the standard deviations as well. These are really impressive, but of course when you have such nice data it’s always nice to kick in the standard deviations and standard errors, oh wow, there’s a lot of variability. But because you have 4,000 subjects actually these are tiny standard errors. I think you do like to hear reports showing the coefficients but making sure triple units like partial R squared or something, some interpretable way of expressing how much variance is explained by the effective interests.

So those are just some of my highlights, my questions, maybe we can have discussion among the panelists, pick up some of the questions I threw out there.**DR. THOMPSON**: Can all the speakers join Tom and respond to some of Tom’s questions?**DR. SCHWARTZMAN**: I can start by answering Tom’s comments. The first one about the comparison with delta C and other methods in the genetics literature, yes you are right, one of our main motivations has been, besides go==there have been so many methods proposed in the genetic literature, they haven’t been really studied theoretically. And that’s why I emphasized so much in the talk checking conditions and having diagnosis for those conditions is important. So I think it’s important to go over these things on a strong theoretical basis so we can trust our results. So that’s the first comment.

And the second one, about the principal components that are removed, you’re totally right about this, and I failed to explain it. Stephanie in the audience also brought it up. Of course, the long-range correlations, both pieces contain part of the signal that may be important. So what we’re trying to do here, we’re thinking of an approach where just as we decomposed the variance of the trait into the component of brain imaging, we could think of a tree where we decompose it into part of the variance as explained by long-range correlation, maybe captured by networks, things like that, and the other residual part that corresponds to more localized effects but possibly distributed.**DR. THOMPSON**: There is an interesting difference between genetics and imaging there I think, in that any long-range correlation say across chromosomes in genetics is usually thought of as due to population stratification, which is usually considered a confounding factor. Whereas long range correlation in the brain probably really is there and is really part of the signal.

So it’s controlled away in imaging by regressing on the principle components of genetic ancestry, whereas you don’t necessarily want to do that in imaging because you could be getting rid of actually a pretty good chunk of the true signal. So how to differentiate those two cases is not clear to me. But that is an interesting difference. I think, between genetics and imaging.**DR. SCHWARTZMAN**: This could be estimated precisely I think into decomposing these two components, one being low dimensional, one being high dimensional.**DR. FAN**: I can probably go next for question that Tom raised. When looking at this the relatedness distribution, IBS estimated the functional type data. I think this highlighted the issue of how you can accurately estimate the empirical evidence metrics. Because in ABCD study, it’s quite complex, you’ve got a mixed population in that you have a large chunk of haplotypes transmit very differently, and meanwhile they have different level, within probably some of these jittering amounts, IBS is due to a combination that we don’t know.

So combined together, the (indiscernible) estimates look at independent SNP across genome, ABCD as a whole, estimated within this, there has to be an impact by these complex relationships between individuals, whether they are in the same genetic ancestral group or are indeed from the same family, or even sometimes we do know like in UK biobank we observe the spatial distributions where they are, it is also within this journal, so that’s the thinking. But I think this just highlights the complexity of having those things.

Any of those things can be mis specified and cause a fail in a certain degree, so I think it’s highlighting the importance to really understand the complexity of it and try to see the limitation of each different approach. But the good thing is when looking at these very quick approximations, as triage to see what loci are most significantly associated with that, some misclassification can be tolerated.**DR. ANDERSON**: Your point about reproducibility is extremely well taken, and we want to make the analyses that are being computed on this platform, for instance, as fully reproducible and transparent as possible. For instance, the number of different variables within the UK Biobank is immense. You need to be able to handle for instance ordinal versus continuous and all types of different regressions and showing users what specific regression models were selected is going to be absolutely critical.

So a thing that we’re working toward is a way to say both the instantiation of any given dashboard view, but also provide, for instance, one potential way that we’re thinking is provide reproducible code that a user could run offline independent from the analysis, and then begin to iterate towards a publication quality analysis for that set of findings.

Because while currently the current situation is incredibly useful for sort of an atlas style lookup of effects, it doesn’t replace the kind of deep insight that you need, and in order to make scientific discovery, and also for extracting information from patterns and covariation across these huge amounts of multivariate data**DR. NICHOLS**: I think that’s a great idea, that idea of a handoff between the platform and (indiscernible). Are there any questions? I know someone early on in the comments talked through connections between psychology where there’s been a lot of work on sub measurements, and the idea that in brains we’re trying to combine all these measurements. I’m not sure that was answered.**DR. THOMPSON**: It’s in the answered chat -- I didn’t really answer it though, I just said there were more fine grain neuropsych measurements in ABCD.**DR. NICHOLS**: I think it’s also a situation where we don’t have good first principle explanations for how every single voxel is related to each other in the brain, unlike as you may in an instrument, a battery of tests, so I think the chance of it being more on the empirical structure. Are we supposed to be done at ten to?**DR. THOMPSON**: I think we are supposed to be done now. I think we are finished. I just want to thank the speakers and Tom once more for what I thought was a really interesting session on imaging and genetics and their confluence and novel methods to try to address this new era of large data in imaging as well. Thank you very much.**DR. WOUHIB**: That is great. A very interesting session. Thanks, Wes, for organizing such a great session. We will come back to the final session of the day after a brief break to stretch in maybe seven or eight minutes from now.

(Break)**DR. WOUHIB**: Our last but not least session will be moderated by my own Division Director, Dr. Sarah Hollingsworth Lisanby, Director of the Division of Translational Research here at NIMH. I would like to take this opportunity to thank her for enlightening thoughts when we were thinking about this workshop and, also, everybody who was involved from NIMH in this workshop.

We have four panelists, Drs. Stuart, Guo, Bhaumik and Thompson, and I believe also from NIMH we will have Dr. Gordon, Dr. Hanson, Dr. Brouwers and Dr. Freed, to be involved if they want in the discussion.

This is a discussion about the role of statistical methods in improving mental health studies, and I am assuming it will be a very hot topic. Holly, please take it away.**Agenda Item: SESSION V: Panel Discussion on the Roles of Statistical Methods to Improving Mental Health Studies****DR. LISANBY**: Thank you, Abera, and thank you everyone for an exciting day. I will just make some introductory remarks and then we will go into our general discussion.

We want to ensure that the research studies that we support are rigorous, reproducible and adequately powered, but how do we do that? With finite funds, how do we approach the inherent tradeoffs, many of which were discussed by the panelists today?

Increasingly complex data streams and high dimensionality data really demand novel approaches to transform this type of data into insights that can advance science and ultimately inform care, as we heard from some of our speakers. We really rely on peer review to evaluate the quality and rigor of experimental design, and we recognize that the quality and rigor of the analytic approach plays a big role in the overall impact that the research can have. This is where engaging statistical experts and the data science community at the earliest stages of study design can really make a difference as we seek novel insights into the causes and treatments for mental illness.

I want to thank everyone. It has been such an exciting day so far with stimulating presentations that have really plumbed the depth of the challenges we face and statistical methods for mental health. We have heard about some possible solutions, we’ve heard considerations about tradeoffs, and we have heard a few teasers about the focus of the Day 2 session which will be on data visualization, I believe some lovely examples of that.

Before passing it on to the session chairs I will just mention that in one of the discussions reference was made to funding for the development of analytic methods, and I would like to draw your attention to two relevant programs that NIMH has that do fund grants in this area. You already heard from Abera about the program that he directs on statistical methods in psychiatry, and also Dr. Michaela Ferrante directs the programs on computational neuroscience and computational psychiatry. So everyone needs to come back for Day 2 to hear more from Michaela’s program and about data visualization.

We have asked each of our four session chairs to give a brief synopsis of their panels and then we are going to open it up to questions. I would like to start with Dr. Elizabeth Stuart to start with Session I.**DR. STUART**: Thank you so much. I am happy to give a little overview of some of the themes that I heard. Actually, just building on your last comments about the existing programs, one of the things that I have appreciated and the reason I work in this field at all really, is that NIMH has long supported methodological work in services research, maybe not necessarily through a specific PAR but the review panel has had a long history of that which I very much appreciate. I will come back to that at the end of my remarks.

I think some of the themes I heard in the first session this morning, which was focused on mental health services research, is around how to harness complex data. We heard about electronic health records, extensive longitudinal data, social media data. Sometimes these are in the context of a randomized experiment, maybe randomizing different individuals, maybe randomizing clinical practices, usually larger scale than some of the other studies that we were hearing about but needed in order to have the sufficient power.

The range of questions that these mental health services might be asking include prediction, causal inference, descriptive analyses. We heard about even the value of just documenting rises in mental distress during the pandemic and having high-quality data to be able to track that.

And I think a broad theme is generally, again, prediction in terms of identifying people who are at risk or to help identify optimal treatments for people and to help them find the treatment that is going to work for them.

Of course, there are lots of complications to doing all of this grand scheme. Again, sort of the complex data complications and confounding and how to deal with the many, many measures we might have. We might not really know, for example, what are the exact predictors that are going to be the most relevant. That is what some of the speakers were talking about.

I want to end with a couple themes that I have been thinking about moving forward. One is the statistical issues in terms of how do we best integrate data across different sources. Could you imagine a world where we can bring together fine-grained data on individuals and brain imaging, for example, with services information, electronic health records? So, how do we really best harness all of the data and bring it together?

The other theme -- and this came up in the Q&A -- is uncertainty. With a lot of these things, especially for prediction or trying to identify an optimal treatment regime, there is a lot of uncertainty in that, and so how do we express that uncertainty appropriately but also give guidelines or information that is going to be useful?

Finally, I just wanted to say that a theme I also heard is to build bridges between clinicians and public health practitioners and statisticians. There are three ways we can think about doing that -- and a theme from today was the need for that. One way is through statisticians serving on review panels, and I don’t mean just one token statistician to raise concerns about the power analysis, but statisticians who become deeply involved in the subject area and so can really understand it and contribute to the discussion in a really thoughtful way.

Second is meetings like this. I think this has been such a great meeting, and the Q&A has been really great. I hope we can do something like this in person where we can really bring together these different audiences.

Third is training, and maybe we can come back to this. Both how do we train statisticians to communicate with applied mental health researchers, but also how do we train applied mental health researchers to understand the statistical methods and maybe know their strengths and limitations and potential use? So maybe we can come back to some of that. I will now pass it along. Thank you.**DR. LISANBY**: Thank you. Now I would like to invite Dr. Ying Guo for Session II.**DR. GUO**: Thank you. First of all, I would like to thank NIMH and Abera and Michaela and Sarah and Dr. Gordon for organizing this workshop. I think it is really timely and it has been a fantastic workshop with a lot of great ideas that we heard today. I will just give a summary of my session which was focusing on generating reliable and reproducible findings. I would like to acknowledge the panelists who spoke at this session.

First of all, from these talks one of the big things we heard is it’s very important for us to evaluate the stability and reproducibility of the findings we generated from the studies. This should be something that is like built in as a standard step in future studies. Instead of stopping at generating the results from current data, we should always think about how reliable, how reproducible these results will be across data and the perturbations.

Dr. Bin Yu talked about the concept of perturbations. There are two kinds of perturbations we should consider. Data perturbations, basically, how well standings will be reproduced in different studies, and how the findings will be generalized to different data sources. In neuroimaging, for example, different modalities of images. Also, in imaging we know that preprocessing is a big component in an analysis procedure and there are many different ways to do preprocessing. Tom Nichols touched on that. So, how do the results stand when you preprocess your imaging data using different pipelines?

All of these kinds of potential perturbations of the data needs to be considered when we are thinking about how stable the results will be across those perturbations.

Another type of perturbation we should consider is methodological perturbation. Tom Nichols spent quite some time in his talk to talk about some important findings where he tried different software and different pipelines looking at the same kind of data, and it is really amazing to see how different the results are coming out of just using different software. I think that is kind of revealing and, at the same time, it really worries us. And everyone is doing the right thing. All this analysis is legitimate analysis, but how the results can be so different coming out of it? So this is something that we should really seriously take into account and keep in mind.

Also, when we report the results coming out of these studies, we should really be transparent about what are the procedures and pipelines that were used to generate the results.

Taking it to the next step, how can we use advanced statistical methods to improve rigor and reproducibility? I think our panelists have come up with some great suggestions. The first thing is that we should really start to promote advanced and best statistical practices for generating reliable findings.

Dr. Bin Yu talked about the PCS framework. I think this is a really great modern statistical and machine learning framework that is a big umbrella framework encompassing a lot of key ideas and best practices in both machine learning and statistics. She didn’t have time to talk about all the details of it, but this is really a major framework that I think can help to improve practice in terms of how we generate results in studies.

Also, Tom talked about we should do a lot of validation to identify some best methods for every step of our imaging analysis and build a consensus pipeline, such as fMRIPrep, which is a really great example. Also, multiverse analysis. Basically, we should acknowledge the methodological variation coming out from it, and really take into account that if we sample over those variations what will be your results.

Martin talked about leveraging large-scale datasets. I really like that. NIH spent a lot of resources obtaining those datasets and then really reach resources. One of the important uses of it is that we can use it to validate the results from smaller studies. We can use it to build priors which can help to improve the power and validity of smaller sample studies in local clinical trials.

We can also use these large-scale datasets to test reproducibility of new measures or new findings. For example, the new concept of down-strapping means you sub-sample these big datasets, generate some smaller samples that mimic your local studies and see how variable your results will be in those local studies.

Martin also talked about this great concept using a federated approach to share models across different groups.

I would like to end with this really nice plot from Martin’s talk. He talked about the different stages of the study. Usually I think most of the studies end here. We collect one data sample, and we do our analysis, do cross-validation, get the results, get a paper published and then that is it, basically. There is really a strong need that we need to take it to the next level. For example, how will you be able to generalize if you collect new samples from the study, and what about when you take the model to different studies, different labs, different scanners where you collect the image from then will your results still stand?

And then, to the last stage, can you reproduce your results in large-scale diverse populations? I really like this plot that Martin has come out with. And to be fair, to take this study to the following stages take a lot of time and effort after the initial study ends, so I think it is very important for us to recognize the need for doing this extra work.

One final thing I thought about is this analogy that if we just do the first-step approach it’s kind of like everybody contributes, the pile stands, and that will be this assigned pile and it will grow bigger but it will not nicely get taller. But if we do this and take the extra step, we are mixing -- you know, concrete stands together, and then we mold them, we rotate them and we can actually build high-rise buildings by using these more concrete findings. I will end here, thank you.**DR. LISANBY**: Thank you very much. Wonderful analogy. Now I would like to invite Dr. Dulal Bhaumik to summarize Session IV.**DR. BHAUMIK**: Thank you. My session was basically statistical inferences for neuroimaging data. Things like the power analysis, what size sample do we need for power analysis. In order to answer that question, we should be aware of what is called the false discovery date and if it is two months then it will not serve the purpose.

We tried to see what the players for discovering the power analysis in the sense that when multiple comparisons are involved involving either the many voxels of the brain, then the traditional concept of power analysis doesn’t go through. So we need some different kind of definition for power itself before we even start. But just like Type 1 error rate, the false discovery date also plays a big role before even starting the power analysis.

We compared different methods for controlling that false discovery date, and what we wanted to show is that as more and more we explore the data, once we have the data that is needed and then try to find the ingredients from the data itself, and then incorporate those to discover the false discovery date.

The second part is that once you are happy or some kind of happiness is there, then we can think about the power analysis. The very traditional concept of effect size really doesn’t work that much here because there are so many tests involved. So, instead of a fixed effect size we can think about varying effect size, which means that effect size will be changing from this value to the other value, it can take any value, and so on.

Then it also depends on the proportion of null hypothesis, so null hypothesis means comprehensive study, mental health patients can be controlled or a different kind of comprehension then, when the (indiscernible) are not different, that is the null hypothesis. In some regions or some networks will be different (indiscernible).

The nature of different networks may be different, so we should explore the variation of those networks, and that is the heterogeneity. Also, the effect size is based on this kind of heterogeneity and try to find out the distribution.

These are the ingredients for the power analysis, and once you have that then there are a couple of definitions that have been developed in recent years for power analysis and you can use that.

The whole focus was that it is not a blind kind of prediction for power analysis, rather we need some relevant data and try to explore that even if it is a pilot or some standard data -- use those kind of data to figure out what are the parameters that are needed both for false discovery controlling and also for power analysis. That was the first talk.

The second talk had the same kind of problem, multiple comprehensive problem, but from a different angle, power controlling the false discovery. Also it is extremely important what is called low frequency that that brought into the picture analysis. That plays a major role in controlling the false discovery.

The third talk was based on if we have confounders or covariates, then how to incorporate those and then go through the testing of hypothesis.

This whole session was focused mainly on inferential problems of fMRI data, all three talks illustrated with resting fMRI data, but that can be active fMRI data also, but focused mainly on different types of hypothesis.

At the same time, the big question in my study, what should be the sample size to get 80 percent power. That kind of problem we also tried to address.**DR. LISANBY**: Thank you very much. Last but not least, Dr. Wes Thompson for the last session.**DR. THOMPSON**: Thanks. I am going to focus on what I think the lessons are from genetic studies to imaging, and I think all three talks broached this topic.

First of all, there are substantial similarities I think between genetics and imaging data, and some of the most germane ones are that effects are likely widely distributed but small. There may be no such thing as a null voxler or null vertex. If you have a large enough sample size, your confidence interval and your effect size may not cross zero for any vertex or voxel. I think that is quite likely to be true.

In that scenario you need large samples. The ones that were focused on mostly in the talks today were the ABCD study and the UK Biobank, which are truly large samples. But if you really are in that paradigm, there is no substitute for having large studies. There is just no way around it. No statistical method is going to get you where you need to go if the effects are that small and distributed; you need a lot of data. That is the lesson that was learned from genetic studies where maybe the methods aren’t always so sophisticated, but they made huge progress once they started getting the sample sizes in the hundreds of thousands of people.

Along the same lines, small studies combined with publication bias have likely resulted in hugely exaggerated effect sizes in the literature, and that is for reasons that Martin brought up. If you have a lot of variability in your effect size estimate and you have a large number of phenotypes in your small study, the temptation is there, almost irresistible -- I am an applied statistician and have been working in the field for 20 years, and it is almost irresistible to publish a result that has a P less than .05 even if it’s fishing data, even if you are not intentionally doing it. So that has almost certainly resulted in inflated effect sizes in the neuroimaging literature.

A fourth thing which Chun touched on is the issue of population stratification causing bias. That is saying that people with different genetic ancestries, if you don’t control for that in genetics analyses it will cause biases in your search for GWAS associations. The same thing is true in imaging. Genetic ancestry impacts brain morphology and that results in spurious relationships, and so that needs to be taken care of in, realistically, ethnically diverse samples such as we have in the United States, or we should be having in the United States.

But there are also meaningful differences between genetics and imaging, and that means that we can’t just port methods over en masse from genetics to imaging. Three really relevant ones are, for imaging, effects are spatially correlated; they are not IID. GCTA and (indiscernible) regression assume that the generative effects are IDD. The correlation in the effects are due to the LD, which is kind of a spurious way to introduce correlation in effects. At least that is the model that they use. That is not an appropriate model in imaging; it is not even close to appropriate.

Imaging phenotypes can be noisy, and we have seen that. Martin and Tom mentioned reasons why.

Both of those methods in genetics don’t assume that the genotype data is noisy; they assume it is not noisy. Maybe there is some noise in that but I think the noise is much lower than it is in imaging. Imaging data we know is a lot noisier, and so that has to be built into our models. If we don’t account for that we are going to make incorrect inferences.

The third which is obvious is there are a lot of imaging modalities. We have structural and functional, we have diffusion, we have resting state, we have task. They will provide different estimates for variants explained, as Armin showed, but then how to combine across modalities? It could be, and is almost certainly true, that no imaging modality is completely capturing every aspect. No single imaging modality is capturing every single relevant aspect of the brain.

We know that behavior is seeded in the brain. At least I think that is the scientific viewpoint. But there is no guarantee that MRI is capturing all of that relevant variation. And so hopefully it is correlated with the relevant variation but it is not going to be exactly the right thing. If we had the perfect tool that was noise-free then maybe the effect sizes would be a lot larger.

So the effect sizes being small in imaging doesn’t mean they are not important because it could reflect issues of measurement rather than underlying issues of biology, or both. So, small effect sizes need to be interpreted in terms of measurement and biology, not one or the other. I think that is a relevant message that it may be somewhat different from genetics.

What are the lessons, how do we progress from here? First of all, I think it is increasingly clear that we need large studies. We are on a treadmill currently, or we have been, where lots of small studies publish results that don’t replicate but they do publish effect sizes that are large, and then they are used as the basis for a power calculation in another study that justifies a power calculation that justifies an effect size of .8, and that study doesn’t replicate. But then they find another effect that is not related to that but is large, because they are fishing for results, and this results in us never making progress. It’s like a treadmill that never goes anywhere.

And so we need two things in these large studies, because we are not going to get rid of small studies and we shouldn’t get rid of small studies because they will always have a place. We need statistical methods that leverage the large datasets. GWAS is one, but I should have made this more general. We need methods that are built on this paradigm of small but widely distributed effects. And then we need to be able to use these large studies somewhat to help with this power problem with small studies.

I think there are two things that could do this. I don’t know what the answer is but there are two things that I think could do this. One is using -- just as in genetics they have the polyvertex score which is a way that you can take large studies and then leverage that to come up with a simple measure of genetic merit or propensity for a disorder in your small study, and that can help power small studies.

The second thing I think has to be like a harmonization approach, and this could be how the NIH funds small studies. A small study has to have measures that are harmonizable with large studies so that you can add them in somehow, compare them to large studies or create a meta-analysis or something. I don’t know what the answer is but there has got to be some way to escape from this treadmill going nowhere with neuroimaging research.

Those are I think the main message that I wanted to point out.**DR. LISANBY**: Thank you. I would like to now invite all of the session chairs to put your cameras on, as well as our NIMH colleagues, and we would like to have a discussion amongst the panelists. You have each summarized really important take-home points from each of your sessions and now we would like to look across these. We welcome Dr. Heinssen, Michael Freed and Dr. Brouwers. It is wonderful to have trans-NIH institute representation here.

Let’s open it up for discussion amongst the panelists. We will also be monitoring the Q&A box for comments from the audience.**DR. GORDON**: If you don’t mind, Holly, I would love to start. I don’t know if this applies to all of the chairs but I think some of them made comments that this might apply to. Where we can identify approaches that we think ought to be universal or at least ought to be considered part of good scientific practice for most studies, how best do we identify them and disseminate them? How can we make sure that good practices are used?

Obviously, we can do what we can during review, but I am curious about what you all might say in your individual areas of expertise.**DR. LISANBY**: I see Elizabeth Stuart’s hand.**DR. STUART**: It is a really great question. I wish I had a magic answer like this is how we do it. You mentioned review panels and I think that is right, sort of having a diverse set of people on those to be able to bring different perspectives and raise red flags from their own expertise. That is where I think interdisciplinary review panels are incredibly useful because everyone brings their own piece.

I will flag, just as an analogy, that the Patient-Centered Outcomes Research Institute, PCORI, has their methodology standards which have tried to do this. It has been a little challenging, and I think methodologists have become more comfortable with them, thinking of them as like a bare minimum standard. I think one of the challenges is that once we get into more advanced things, some people will say, I want to do multiple imputation, and other people will say but if you do, I would prefer this. And there are some fights. But I think there is some low-hanging fruit that everyone should agree on, and PCORI has sort of tried to articulate those.

One of the points I would make is that I think finding the right level of those guidelines is the low-hanging fruit that should be sort of clear standards for everything and not get sucked into the debate about the fine specifics of exactly how things get implemented. And the PCORI methodology standards might be one model for thinking about that.**DR. YU**: Just to follow up on that, I think the NIH panels could also have some guidelines that incorporate a lot of the good ideas here. And for the reviewers who already use that. And then have the evaluation scores; that should be part of it.

And the other thing is some work with UCSF doctors put out this kind of minimum requirement saying what everyone will agree on, and that is the minimum. And they can have different tiers of this criteria that someone can debate, but should still have people (inaudible). I think that will help push things forward.**DR. THOMPSON**: I have been on a lot of NIH grant applications, and I have served on a lot of study panels. I am actually a sitting member of the Addiction, Risk and Mechanisms Study Section. One thing as a statistician that bugs the hell out of me is the power section. It is an exercise in people justifying the sample size that they would have gone with anyway. It’s a moral hazard.

It is often based on pilot data with highly exaggerated effect sizes justifying yet another small effect size. I don’t know what the answer is to fix this. Maybe they should be forced to show confidence intervals for the effect sizes they’re using and show the range. Maybe that should be a requirement or something? I don’t know. But the current power section is not helpful in any review process that I have ever seen.**DR. LISANBY**: Thank you for that comment. Yang and then Dulal.**DR. GUO**: If I may add my two cents, I think it sometimes starts organically. I will give an example. There is this method we call the compact method. This is a method that neuro-imagers borrowed from genetic studies to do data harmonization across different studies, so it is essentially a mixed model, basically. A group of neuroimaging statisticians kind of popularized this method.

I am a standing member of the emergent imaging technology section. Nowadays when we review sections whenever it is a multicenter study, all the members know -- do they do data harmonization, do they use compact? Sometimes I think the field grows organically when they recognize a great method.

Also when review panels review such sections they can give feedback in their response, so that is also a great way to let the applicants know what is the better practice that is expected from your application.

Another way that I think we can broadcast these kind of best practices is that NIH has funded some great software and tool platform such as (indiscernible), so a lot of methodologies are posted there. I think that is a great forum for people to see all the different methods that can be used to address a similar question like statistic connectivity, and NIH can potentially consider highlighting some of the best methods or consider consensus type of methods there, and that can encourage more people to use it.**DR. BHAUMIK**: There are other many different answers to the question that Dr. Gordon just pointed out. We are agreeing that variability or heterogeneity of the data should be addressed. We also agree that good modeling should be there, the study should be large enough, though there is no complete definition of that. But many of the central issues we all agree that should be addressed in any kind of analysis.

Now, whether we are up to that level depends on how good is our prediction or what we want to do, whether we can do that. Cross-validation many times through prediction is important and many of us addressed that issue. Reproducibility is another thing.

So, piece by piece we basically try to address. Now, what is the best method? Nobody knows. But essential ingredients, essential components of the study that we try to address. The power analysis, how helpful is it? Most probably it is not that helpful. And I completely agree with Tom. I have also evaluated many grants and the power analysis section is very weak, and that is basically the complexity of the problem. It is not an easy problem when we are talking about neuroimaging study and do the power analysis.

I think we have taken care of the complexity whether we agree or not, but we addressed those issues. Can we provide a single method that will solve all the problems? The answer probably is not yet, so that kind of discussion will be still going on. But it is good that at least we have a platform now where we are discussing and knowing each other and addressing those problems. At least I am happy about that.**DR. LISANBY**: Thank you. Bin, you had a comment?**DR. YU**: I wonder whether people like to explore my well-designed simulation studies. In my own research we find very carefully, data-driven, domain knowledge-driven simulation studies have been extremely helpful. I feel that is kind of under-utilized in the physical world. In chemistry and physics people have really serious simulation. It is not like something you pull out of your hat. You really have to design the simulation studies. And that is where a lot of discussion about power can be compared across different methods.**DR. LISANBY**: Would anyone like to comment on the topic of simulation?**DR. THOMPSON**: I think that is incredibly necessary for imaging and genetics methods development. If you are using simulations that don’t start from a basis in real data, like the realistic croshin structure and LD structure in GWAS or the realistic croshin structure across the brain in imaging, then your simulations don’t mean squat.

For those you probably need large sample sizes, too, to get realistic estimates of, say, the croshin structure of the brain, for example.**DR. LISANBY**: We certainly do see projects that take a simulation approach, but, as Wes is pointing out, the devil is in the details about what assumption you’re making and how well does it fit what the actual data are.

I would like to make an opportunity for some of our NIMH colleagues who are on the panel and have joined us if they would like to make any comments or if they might have questions for the panelists.**DR. FREED**: Thank you very much for inviting me. This has been a wonderful day today. Just a couple of things.

First, I just want to share with the panel that the Division of Services Intervention Research actually does support a program of methods research that studies the development, testing, refinement, designs, measures and analytic approaches inherent in conducting services research, so we certainly find this type of research very important in its own right.

And what speaks a lot to what Liz and her panel talked about is translatability of findings and the end users and the idea that having an approach that others can understand and make use of. Sometimes that means using a simpler approach if the end users are going to adopt the findings and make changes in their practice according to those findings and may choose a simpler approach because they can understand it.

What we have encouraged for many of our applicants, even in methods-focused studies, is really to try to involve end users in some way. That can come in the form of letters of support, it could come in the form of some type of advisory board or some type of dissemination approach, but perhaps also involving people as part of the study team who would make sense out of those findings and help to disseminate those findings to the practice community.**DR. LISANBY**: Thank you. Any other closing remarks? We have about one more minute if any of the other panelists would like to make a comment.**DR. HEINSSEN**: Thank you, Holly. This was just a really terrific day, very exciting presentations and a lot of food for thought. Building upon what Mike Freed just said, NIMH is always looking at the problem of implementing science to improve practice and how we can shorten that interval between scientific study and implementation into practice. You know, traditionally, the process has been experimental studies, randomized controlled trials, replication, trying to move that into the service system and observing the impact.

What was really exciting in the first session today was using different perspectives on establishing causal inference in non-experimental studies, which I think is really important for us to capitalize on the fact that learning healthcare systems are generating a lot of data but not in an experimental framework. And if we can have tools that give us the kind of rigor that we have become used to in experimental designs to be able to understand ways in which we might improve treatments in a nearer-term perspective, that would really be terrific. I just was so excited to hear in that first session the numerous examples of that.

And then in later sessions, although the content was different, the creativity and the rigor that was being proposed could possibly translate from neuroimaging back to services and then forward.

I just thought this was a tremendous day with a lot of food for thought, and I want to thank all of the presenters for those wonderful contributions.**DR. LISANBY**: Thank you, Bob. We are at the end of our time but I don’t want to close us out before giving a chance for Pen Brouwers to make a comment.**DR. BROUWERS**: Thank you. I am speaking on behalf of the Center for Global Mental Health in the Division for AIDS Research and so particularly the services part is really interesting for us.

But Mike was talking about expanding basically the prediction models to include also the providers. For us it would be really interesting as well if reviewers take into account how you do on the basis of these predictions potentially can also change policy. Because if we are going to the international sphere, basically, in order to really affect mental health as well as some of the integrated other diseases that we are working with, really the big change that we can make is if we can change policy.

I think some of these prediction studies that you are doing and expanding on can really help us to talk with departments of health in the various countries to promote a better integration of mental health services into their general health systems. From that perspective, I was really impressed and glad to see the work that is going on. Thanks for giving me the last minute.**DR. LISANBY**: Thank you. And on that note, I would like to give a big shout-out to Abera Wouhib and also to Michaela Ferrante who are our Co-Chairs. And everyone needs to come back for Day 2, which will be on Wednesday, on data visualization tools.

I would like to thank all of our panelists and our session chairs, the speakers, the discussants and, of course, all of you out there, the attendees, who have made today very stimulating. I learned a lot and I hope you all did as well, so thank you.