NIMH Research Domain Criteria Roundtable - Data-Driven Refinement of Psychopathology: Toward Precision Diagnostics

September 28, 2022

Transcript

BRUCE CUTHBERT: Welcome, everyone. I am Bruce Cuthbert of the Research Domain Criteria Unit at NIMH, and I am pleased to start off this roundtable webinar.

Before we begin, some acknowledgements are in order. First, I want to thank my colleagues in the NIMH RDoC unit: Sarah Morris, Jenni Pacheco, Rebecca Berman, and Syed Rizvi, for all their efforts to make this meeting happen in a short period of time.

We also thank TaRaena Yates and her staff at the Bizzell group, who have been so efficient in organizing every aspect of the meeting.

Next, we particularly want to express deep thanks to our panelists and moderators who have graciously volunteered to share their time and expertise today to lead the discussions.

Finally, thanks to all of you in attendance, and we look forward to your questions and comments.

Just a very brief introduction about the background for our meeting today. As most of you know, the idea for the RDoC project came about over 10 years ago by the increasing realization that our current disorder categories are heterogeneous syndromes that lack validity. RDoC was thus developed as a novel approach to encourage multisystem research on basic functional dimensions such as cognitive control or reward valuation, and how dysregulation in these dimensions could result in psychopathology.

As our former director Steve Hyman noted in a recent paper, modern studies are consistent with psychiatric disorders as heterogeneous quantitative deviations from health. Fast forward over a decade, and now very large databases can be created with cohorts of patients containing multiple measurement classes.

A key point is that these new data infrastructures are built to enable computational models. We now have the possibility of implementing machine learning algorithms for data-driven precision diagnostics that could identify transdiagnostic mechanisms and predict optimal treatments. This is, to be sure, a daunting and long-term enterprise, and today our sessions will focus on two significant initial challenges.

Session 1 will address ways to optimize behavioral tasks, both because they represent direct measures of behavioral functioning and also because the data are relatively inexpensive to obtain and analyze.

Session 2 will concern the many issues involved with building appropriate data platforms that can accommodate all our future tools and analytic strategies.

Session 3 will cover topics from both of the first two sessions and considering some existing datasets and tools already under way.

To provide some context for these complex topics, we start with two experts who will provide overviews to introduce some of the broad themes that we need to keep in mind as we move along.

So with this brief orientation, it's my pleasure to introduce Jenni Pacheco, who will serve as our emcee for today's roundtable. Jenni, take it away.

JENNI PACHECO: Thank you, Bruce. As you just very nicely went through, for our agenda today, we have three main sessions. Each session will start with a few brief talks by our distinguished speakers, who will be followed by what we hope will be a lively and informative discussion. We are encouraging all speakers to participate in all of the discussions, and we encourage questions from the virtual audience to be entered into the Q&A function.

The RDoC staff will be monitoring those questions, and we will interject them into the discussion as we see pertinent themes arise. We apologize that we likely cannot answer each specifically, but we would be happy for you to follow up with us after the roundtable today, if you would like more information.

Before we get to the three main sessions, we have two wonderful talks to start us off. While we have not built in time for dedicated discussion around these talks, there may be just a few minutes at the end for some quick questions. However, our hope is that these main ideas and themes will find their way into all of the later discussions.

So to start us off, we have Dr. Abhishek Pratap who is recently at Biogen Health. Dr. Pratap will share his thoughts about identifying and mitigating bias in precision approaches to mental health. Thank you, Abhi.

ABHISHEK PRATAP: Thank you to the NIMH RDoC team for this kind invitation. I am super excited. There are a lot of places where this doesn't get talked about. So I'm thankful for you to think about this important topic, which is close to my heart, and thank you all, the attendees. It's lovely to see so many of you here.

I'm Abhi. I'm recently leading, head of Data Innovation at Biogen Digital Health. Today I'm going to be talking about what does it mean to collect data in real-world settings for mental health.

Here are some quick disclosures. I am a Biogen employee, but the data presented today has no connection to Biogen and all opinions are mine, and some of my cost disclosures, collaborations, et cetera.

So with that out of the way, I'm going to get to the main point right away. We have a problem that is well acknowledged. In trials and studies, there is a lack of representation and digital is aimed as sort of bridging that divide. I want to sort of provoke that sort of thing and see where we are in the current state of things and how can we do better to understand someone's cognitive functioning in a very representative and equitable manner, and I'm going to sort of open that box for the next 15, 20 minutes or so.

I also want to leave you with a take-home message right away, which is underrepresentation of the right population in real-world data can actually impact the evidence and make it inapplicable to those who were not included in the studies. It's a mouthful, but it's a key concept of the discussion and presentation I want to have today.

Here are three case studies that I'm going to be talking about. It will be rather fast, but hopefully you get a sense of where I'm coming from and what are the key problems, as well as how do we fix them.

I also want to spend, make sure everyone is sort of around the same table. So I'm going to quickly summarize especially on kicking this off, this is a hard problem. I'm going to go back 100 years and would highly recommend the book by Emil Kraepelin: we really have no objective yardstick to assess someone's emotional feelings, behavior, in real-world settings, and that's true even 100 years afterwards. Our assessments, our toolings, are more directed towards what someone is thinking or saying, probably not what they are thinking, their internal phenotype.

And there are multiple reasons for it. I'm going to keep it very abstract and high level in terms of how people live their lives and how different factors could impact the health and their outcomes on people, ranging from individual behavior, social circumstances, genetics obviously, environment, and medical care. Something to keep in mind is more than 50 percent of this is coming from your day-to-day interactions, your behavior, social life, et cetera. That's a key message, and that's where the technology can sort of help, right? So the current state is we get snapshots of people's life in the clinic, which is amazing in terms of what we can learn about the various assays and probe, example MRI, but also realize the context has certain limitations under which they're captured, which is a very small slice.

The goal where we are sort of heading overall in the field is how can we mix these together so we can assess in the moment and complement it with clinical assessments. So it's not replacement. It's augmentation.

And that's where the clinical applications of various technologies, again a very abstract slide, sort of comes in. We are talking about devices, portals, wearables, voice, AR/VR, and how to remote monitor and telemedicine, especially in the last few years, and pandemic has taken a big sort of boom there, remote enrollment, et cetera. And this is nonexhaustive.

However, what it all means is we need to learn about these new data modalities to understand various behavioral states to infer certain clinical outcomes. That's where broadly the focus is on to sort of learn in a precision individualized manner on how their behavior could be impacting, predicting it early, or stratifying people based on risk factors or future prognosis.

What is important is to think about where we meet people, right? So this is to me what is getting closer to a patient or a participant means. It's their life. They might be sick. They might be shopping. They might be on a vacation. We are trying to say here is the digital protocol to collect data in your day-to-day lives with all the rationale that we just talked about.

So this is where I sort of segue into right in the core of the presentation is the hope is to have a very well-represented data based on various social demographic technical variables, such as age, gender, race, et cetera, and there is growing evidence now, a lot of research needs to be done, that it's not like this. It's more like this. I'm going to go back and forth. Hopefully you can see the difference now.

It also could be nonuniform. That's where some of the challenges start that need to be transparently understood so we can do something about it. It's almost like if it's not on your EHR or medical record, you never have those disease. So if it's not well understood on what the problem is, we'll never be able to fix it.

So this is where I give you a warm start. So what does it mean? Building and deploying digital tech is not enough. This is my own hand. I have used this picture many times. Some of you might have seen it. It's probably dated, but that's where the problem started for me, and it's not just two devices. It's not these two particular ones. It's a broader problem that this should be unacceptable, or at least flagged as we don't have enough confidence in what data is being produced.

That's one, and the second is the map that you will see or the plot you will see on the left side is who is staying and giving us data for how long and is this uniform or nonuniform, and if it's nonuniform, what does it mean? Who are we losing out, and is that leading to more challenges in the future related to digital divide?

So I'm going to give you a very high-level view of this paper. Bias alert, this is my own work in the past. Here is a QR code, and that will be on the next screen as well. I'm not going to get into the gory details, but here's the take-home message. This is by all means nonexhaustive and nonrepresentative in terms the many studies that have been done since then, but if you get a broad sense, the takeaway is we are heavily biased, back in the early days of digital health, probably even today, in where our participants might be coming from, which is heavily towards the two coasts and probably not in the center. My takeaway from this graph is if you are an obesity or researcher or you worry about cardiovascular disease in general, you should be worried about this map, because you probably might not capture people from the areas where the prevalence is more high. So the bottom row area is showing the race and ethnicity differences with Census 2010, which should be 2020 now, and the median income -- again, not causation, just correlations.

And then more importantly, the state mortality rank with respect to cardiovascular or chronic diseases and where the proportion of recruitment is happening, and that's what I was trying to say. If you are a cardiovascular or obesity researcher, you are likely to underrecruit from the south part of the country, which is known to have more prevalence of this. So there is a very high chance of being biased.

So that's my sort of level setting. This is just a warmup, and now I'm going to go into some specific details. Some of it is published; some of it is not. So take it with a grain of salt, but hopefully it's provocative enough to make you think what's going on.

We talk about Case 1 here; real-world data collection could be impacted by the people's or participants' condition severity. The N is 600. It's a large study out of EU in three countries. Smartphone data modalities as well as wearables, and so let's see what's going on. The first three rows is basically showing the different kinds of data collected. So phone active, so that's survey data, phone passive is I'm just going to monitor how much you're walking and what's your accel or gyro. That sort of stuff. And Fitbit is the Fitbit wearable data.

We are not looking at outcomes analysis today. We are looking at how the patterns and data might be and what does it mean for outcomes. So there are three clusters, each dot here for a green or orange or a pink is showing the density of the data collected on a given day, and it's a long window. It's about 10 months.

The take-home message is if you look at C3 cluster, yes, they are different in different proportions, but across the board, so three orthogonal validation in that sense, you see a significant difference in the baseline on boarding severity of the disease such as PHQ-8. We can question why it's not PHQ-9, but for a separate time.

There is a 4-point difference. That is huge. There is also behavioral differences, sociodemographic differences, that are sort of known in the past, but now we have evidence that people with more clinically-severe symptoms might be less likely to give or share data, which could actually bias the studies in a big way, based on regression approach or prediction approach. At the end of the day, you're trying to see how severity is associated with certain outcomes or certain features.

So that's case 1. Let's talk about case 2, wearables. This is work in progress. It's a very large study with 10,000 healthy kids, which is representatively sampled throughout the United States. It's one of the NIMH and NIH-funded studies. I'm not sure exactly who the funder is, but one of those.

There are ePROs, so that's surveys, and various wearables. I'm going to cut to the chase and summarize the key message here. Y-axis is participants, x-axis is days in the study. It's a very short observation period. The point is wearable devices may not be worn equitably by a diverse population.

What do I mean by that? First, it's really nice to see C1 cluster. It's all red. That means people pretty much uniformly participated. C2 is even good, right? You have pretty decent participation up to week 2.5. C3 and C4 is where the problems are. However, it's nonuniform. So the proportion of minorities in C3 and C4 are significantly higher. Let me show you that.

This is called survival plots from oncology work, but it's the same sort of idea. You're looking at how much time people are retained in the study in this case, and you can see across the board in a very short observation period of just three weeks, the effect size of Blacks for example or people with lower socioeconomic status or education are significantly different, and if this were to continue for a longer period, the slope and the delta is going to increase from thereon. That's an assumption, right? And that should be a problem.

Here are some more details. If you look at the race and ethnicity and the axis is wear time in minutes, we don't need stats to convince this any further. You can see white kids are wearing the devices for longer. Why? We need to understand that better. There could be multiple reasons, and our work is sort of getting into what is that. There are site-specific differences that are also confounded by regions. So I'm getting a little more complex there, but all these things matter at the end of the day when we are trying to use these digital modalities for health outcomes prediction.

There are also temporal variants. So much so for remote monitoring, if you see the last figure here and the purple sort of highlight showing the wearable time and during the pandemic, that was significantly lower. There could be many reasons for it, but at the end of the day, if you are collecting wearable data, this needs to be contextualized.

All right, I might be going a little faster, but the pros is that you will hopefully have some time for discussion. Let's do case 3 now.

So this is understanding sensor data to look at health equitably and consistently. We have taken data from a DARPA-funded study with 3,000 to 10,000 people, depending on the particular question of interest, and here we are asking a very simple question. There is a whole lot of work done in the field about passive data collection from bring-your-own devices. So people have their own devices that they bring to the study. It could be Android and iOS majorly, and we wanted to see if we are capturing similar sort of sensor modalities, how much of a difference in data is driven by the difference in the ecosystem, whether it's coming from iOS or Android, a very simple question.

Before we say someone's gait or bradykinesia or their cognition capacity is linked to X, Y, and Z feature, is it driven partly by the technical differences?

So there is a pipeline that my group in the past developed which we looked at completeness, correctness, and consistency. Today I'm not going to get into the specific details, but these are all standard signal processing workflows that sort of goes into looking at the underlying data quality consistency, et cetera. So apologies, I'm not going to describe all these features today. But I'm going to show you some key take-home messages.

So here there are three columns. First is accel, followed by gyro, and GPS, three pretty common standard sensors available in most devices these days. We are comparing them between Android and iOS, and I think my legend is missing. One of them is Android, the other one is iOS.

And you can see without running any gory hardcore statistics, there is a clear shift, based on missing data ratio, based on anomalous points, or based on signal-to-noise ratio. And, to sort of put the nail on top of it, the predictive power, whether using just this data to see whether the device is Android or iOS is very, very high, comparatively speaking.

So some of the signal that we might be learning might be confounded by the underlying data generation ecosystem, in this case Android and iOS, and we need to worry about it, not for all areas, but for some areas which are dependent on some of these sensors.

So I've talked about some of the challenges. Let me spend a few minutes, maybe a couple of minutes, on what is the way out of this and how can we address this in transparent manner? So our home-to-clinic monitoring idea of understanding what's happening in clinic and contextualizing with how that might be related to people's living environment, there are some common technological denominators that could be helpful, and I think some of the speakers later today will be sort of touching on some amazing work done in that space in terms of using technology for understanding mental health.

Broadly speaking, what we are going after is the features on the y-axis. There could be multiple. These are nonexhaustive. Smartphones, wearables, online data, EHR et cetera, to look at what kind of signals there might be to generate some evidence that matter to people and that is beyond what is captured in subjective surveys.

But it's not about deploying a whole lot of measures, but it's about deploying the right set of measures. What do I mean by that? We have to think about what we are trying to capture, how we are trying to capture, what's the clinical outcome assessment, how does that fit into all these data modalities, and it should not be a kitchen sink in my opinion, and then see what's tech fit for purpose, there's a lot of bullet points there. The main thing is how does it fit into the ecosystem, what is the participant burden, concerns of sharing this data, how much data do we need for how long, what's the data governance and privacy, et cetera.

Ultimately, we need all of this figured out to bring it together in the environment so we can do standardized assessments, measurements, and analysis to ultimately generate an integrated experience for clinicians and patients and their families. We are lacking some of this because of the silos that are created. It's incredibly hard to find some of these datasets, even to this day.

Which means we need to think about where the data is created, how it is curated, and what's the plan for consumption. A checkmark button releasing the data online is not sufficient in my opinion. We need to make it fair, findable, accessible, interoperable, reproducible in spirit, not as a checkmark.

And ultimately, we also need to get from here to here. What I mean, this is actually taken at an Airbnb I was at recently, I did not use the first two remotes, and that's what our patients are also doing, our participants.

We create complex technologies that are hard to navigate, complex apps that you can't really understand what's going on, not a blanket statement on all of it, but there is a lot of that happening, and we need to finetune to understand what is needed by whom so it makes them easier to use and consume.

Which requires a lot of co-creation. A lot of work has been done by folks in this area. So it's coming up, which is really good. So it's qualitative mixed methods work.

We need to think about people's concerns and perceptions, more importantly perceptions sometimes, on why someone is collecting this data, and there is a spectrum of what is more sensitive and less sensitive, and we need to think about what is being collected, why, what's the burden, who is asking for it. It's different depending on who the sponsor is, because perception change actually and there is research showing that the social affordance of it, you can't ask people to complete voice diaries when they are going about their day. It's not going to work.

Who share the data and how long is what we talked about. After all the work that you have done, we still need to worry about if we collected the right data from right target population. It doesn't have to match the U.S. Census, if you are dealing with a certain disease, but has to match your target population.

And then be proactive in how to address some of these concerns just in time, rather than post hoc, because the damage somewhat has been done in that sense.

So in all, there are a bunch of issues that are solvable, looking at it from a global south or a global west lens. This is more of a high-level picture than that, and there are multiple things going on and I think we need all hands on deck to understand how to quantify and address this. It's not just qualitative work. We can quantify these biases and then we can do qualitative work to address those biases in the field.

All of the work I talked about, I'm mainly a messenger. This was an amazing group I had the pleasure of working with just a few months ago, until a few months ago, and most of the work that I talked about was presented by these folks.

Thank you so much.

JENNI PACHECO: Thank you so much. Unfortunately, we don't have time for questions. I did see at least one come into the Q&A. So we can try to answer that online, and hopefully if any of our speakers have questions, they'll feel free to bring these ideas up in our future discussions. I think they're important and kind of come into all of our discussions later.

For our next presentation, we're delighted to have Dr. Brenden Tervo-Clemmens from Massachusetts General Hospital, who will offer us a clinical perspective on big data and computational psychiatry. Thank you, Brenden.

BRENDEN TERVO-CLEMMENS: Thank you so much for that introduction and truly a pleasure to be here and an honor to give this talk. Just to give you a little bit of background of who I am, I am a postdoctoral fellow at Massachusetts General Hospital Harvard Medical School, MGHHMS, and I primarily spend most of my time in research in the psychiatry department in the Center for Addiction Medicine in collaboration with the Martinos Center.

But I'm also an active clinician, where I work as a psychotherapist for adolescents with mental health and substance use disorders at one of our clinics, and my background and research interests kind of spans some integrative areas that I hope will be relevant to set up some of these talks, including clinical psychology, cognitive neuroscience, and quantitative and computational methods.

As one of these overview talks today, I have the distinct pleasure of previewing and providing a clinical context of some of the coming sections in this afternoon's series. But before I jump into that, just to give you a little bit of background of myself and scientifically where I come from and how I view the field, chief among my interests is understanding the neurobehavioral processes of the adolescent period and how these may be related to risk factors and the etiology of substance use disorders and related mental health outcomes.

And as a clinician scientist, I'm particularly interested in not just understanding these phenomena in terms of research, but also in terms of developing markers that we can intervene upon. So as a challenge then, kind of holding two worlds constant of doing more basic science and neuroimaging and largescale computational analysis of behavior and my experience as a clinician scientist has been translating these neurobehavioral markers from my research and observations from the relatively new field of noninvasive neuroimaging by way of, for example, fMRI and MRI, to substantive clinical intervention with adolescents in my own work has been very challenging.

We can probably have several days' worth of talks on why this is and all the interacting complex factors that go into translation. However, as a preview of the planned discussion from this roundtable and one that I'm particularly interested in, seeks to contextualize challenges in clinical translation within a broader framework of challenges to the reproducibility and robustness of psychiatry research.

These were already touched upon a bit in the talk before. So today I'm going to give you a clinical perspective and ways in which I believe big data and computational approaches may really move the needle on translation in this neurobehavioral research that I'm particularly interested in, and as someone more junior and standing on the shoulders of giants in my field, who you'll hear from today, this is also somewhat of a roadmap of the type of science that I'm working towards.

So I'm going to briefly speak to three bigger themes that my clinical colleagues and I who see patients often discuss in terms of potential challenges and opportunities for clinical translation, and to provide some structure here and talk about reproducibility, incremental validity, and the consideration of research designs that have real treatment relevance.

So just to provide a bit of a potential pathway towards translation and heuristic, we might ask does a given clinical neurobehavioral marker reproduce. If so, does it tell us something new about our patients and participants that we didn't already know, and if both of those are true, how does it set us up to intervene?

The first idea here is that the premise that reproducibility in research is absolutely a necessary precursor towards any lasting consideration towards clinical translation. For example, if we look at something like a broad heuristic adapted from biomarker staging, multiple levels of validation and for this case I'll argue reproducibility are required prior to any consideration of real clinical application.

The way I think about this in my own work then is that a necessary first step towards potential clinical utility is research utility, validity, and reproducibility. This might seem somewhat obvious to us. As a colleague of mine once quipped, reproducibility should be the air that we breathe as scientists. However, it's worth noting that recent and emerging concerns across medicine in areas of psychology and neuroscience should give us pause and consider how our -- how we might effectively translate our science towards clinical care.

As many in the field, and again, some of the experts in this area you'll hear from today have sought to then use increasingly large datasets or databases as we've already heard or what I'll refer to here as big data and complementary computational approaches to first evaluate and then hopefully improve reproducibility. To give you a sense as a clinician scientist how I view this emerging literature, to highlight some of my own work, this is work that I co-first-authored with Scott Marek that's now out at Nature that was supported by a massive multi-institutional team of investigators, took this question on in our own area of interest of how do brain metrics potentially tell us information that's clinically relevant to individual participants and ultimately with extension then individual patients.

So this is looking at brain phenotype associations or what we might, we have taken to calling brain-wide association studies that examine cross-sectional correlations between interindividual differences and brain structure and function and out-of-scanner psychological phenotypes or quantitative traits. Again, clinically relevant as these are cognitive and mental health phenotypes that are often a target for our interventions.

We used, just to kind of give you a sense, close to 50,000 neuroimaging participants from three independent datasets. What we found, giving a sense of reproducibility in this area, likewise converges with a number of reports that have come out even in the past year that there are very small effect sizes on average that link interindividual differences or person-specific differences in brain structure and function to these stable psychological phenotypes or quantitative traits. This really highlights the necessity for what we've already heard about today, large sample sizes and aggregation of data sometimes clearly an order of magnitude larger than we have had before, and also highlights the need for refined metrics for these types of cross-sectional brain-wide association studies or what I'm defining here as brain phenotype linkage.

Again, this makes clear why, as someone whose career has sort of followed the emergence of big data and neuroscience, why we might struggle to translate this type of research towards the clinic. Of course, this project is for brain imaging and one very specific type of brain imaging, cross-sectional associations with phenotypes.

But more broadly, I want to point out and kind of highlight some of the excellent work done by the members of the panel, that this work occurs in a broader context of largescale measurement studies that have become increasingly common with the availability of largescale data, both by way of consortia data funded by the NIH and the NIMH, among others. So for example, I showed you work that used data from the Adolescent Brain Cognitive Development study, or ABCD. We can also think of existing largescale health data sources, such as the electronic medical records that we just heard about, and the new and refined computational tools that you'll hear about that set up this type of work.

Together, I would argue, this type of measurement work is really necessary to move the needle on clinical translation, at least in my own area that I can speak to in depth in adolescent neurodevelopment.

So for example, studies examining test-retest reliability of neurobehavioral metrics or how various analytic approaches lead us to have different inferences, such as multiverse analysis or the related concept of specification curve analysis, as well as generalizability across contexts, groups, and outcomes that we've heard about already today, and then of course the gold standard of mega- and meta-analyses that aggregate results across the literature and provide a conclusive answer, hopefully, regarding clinically relevant neurobehavioral markers and their reproducibility.

Continuing down the potential translational pipeline, the second theme I want to discuss and what other clinicians and I often look for, is incremental validity of a neurobehavioral marker. That is, if we can find a reproducible neurobehavioral marker, what information, additional clinically relevant information, would this given neurobehavioral marker tell us about one of our patients, or perhaps to make it more direct to me, tell me about one of my patients? Ideally this is something about one of my patients that I don't immediately already have access to in terms of clinically available information.

To unpack this a bit, it might be useful to provide a context of what I'm going to define here in terms of heuristic as a distinction between clinical post-diction versus clinic prediction, and often what we have had historically in neurobehavioral research, for example, I told you about the brain-wide association studies, is associations between neurobehavioral metrics and concurrently assessed diagnoses or clinical symptom manifestations, and this is what I'm going to refer to here is clinical post-diction as these neurobehavioral measurements are being correlated with diagnoses or symptoms at the same timepoint.

However, often what we really want in the clinic is a reliable prediction of future diagnoses for clinical symptoms so we can provide early prevention and intervention. To kind of make this a bit more clear, we can think of this fairly predominant model of what I'm referring to here as clinical post-diction, we can see that the concurrent or same timepoint association between a neurobehavioral marker is often, or at least historically, being benchmarked against an existing diagnosis for symptom scale. That is at the same timepoint or one that a patient already has.

Therefore, as a clinician, it's not always readily clear that given that we already likely know such a diagnosis or symptom presentation, how our clinical team would change the treatment strategy with respect to this concurrent neurobehavioral marker.

So in contrast, in kind of highlighting some of the original ideas of RDoC, as my career has kind of followed that trajectory, we would enormously benefit as clinicians, although this remains somewhat elusive in my own area of research, of neurobehavioral markers that provide incremental validity over other known risk factors, psychiatric risk factors, and together may help us refine diagnoses in symptom presentations that actually do fundamentally change the way we provide intervention and treatment.

Taking this a step further, particularly in my own clinical interests, often what we want in the clinic is not what I've been referring to as post-diction at all, but again prediction of future diagnosis or future changes in clinical symptom manifestation that can ultimately lead to early prevention and intervention. Given this wonderful opportunity, again highlight some of the work that has really shaped my own career and talk about sort of in the area of adolescent substance use as an example here, we know that childhood impulsivity both as a clinical manifestation, ADHD, as well as a dimensional assessment in broad putative externalizing spectrum, is a robust risk factor for subsequent adolescent substance use. Kind of highlighting the clinical advantage of this as a prospective prediction and not this post-diction is that we can clarify why both psychotherapeutic and pharmacological interventions related to impulsivity can mitigate and ultimately reduce the risk of subsequent substance use in adolescents.

Of course, however, there are many, many more examples where we lack such clarity, and we really benefit from you all on this call of how do we identify predictive neurobehavioral markers now that we're gaining larger access to big data and these refined computational approaches that can ultimately direct our treatment.

And to point out something that I'm particularly hoping for and using this time on the call to solicit from the community is really what we ultimately would love in a translational context is how can we leverage neurobehavioral research to directly inform treatment? By for example using a given neurobehavioral marker to reproducibly stratify patients into our treatments and know who might be a better fit for which existing evidence-based treatment.

So with that in mind, I'm going to go through this last section with some speed, given it was so well introduced in the last talk, but if we have a reproducible and incrementally-valid neurobehavioral marker, how might we align research then towards potential intervention and treatment?

I want to point out an obvious but important clinical observation that I have that I assume is shared by almost everyone who sees patients in psychiatry, which is that my patients report mental health and substance use symptoms across so many different reasons that it would be hard to summarize well, but if I was forced to, I would say that universally the best way to think about this is that they report symptoms related to the complexity of their life and the interaction between their life history, their moment-to-moment stressors and day-to-day stressors, and their current circumstances.

That is to say that the symptom presentation and the challenges that they face are ones that unfold over time as a process and are interacting with their environment and are not a static level of risk. So this is, to be truthful to my own research that I have done and published, is that rarely do I have a patient who understands, for example, their substance use purely through the lens of a trait-level hyperactive reward system or high levels of age-adjusted impulsivity. Obviously, these symptoms are far more complex than that.

So this brings to mind an important distinction that I know we'll hear about today of between-person research that's focused on interindividual variation or what we might think of as a nomothetic approach to psychiatry, and within-person research focused on intraindividual or within-person research of what we might think of as idiographic, potentially leading to an idiographic approach to psychiatry.

I want to highlight -- this was well spoken to in the last talk, but as a clinician, why we are so excited about this latter type of intraindividual or within-person research, and kind of the simplest way that I can think of explaining this is that ultimately this type of data that focuses on following the same patient over time and builds within-person or ultimately personalized models is because it aligns with how we already meet and how we already deliver care in terms of timescale and inference.

So for example, this idea of within-person approaches can facilitate movement away from static in-lab estimates to real-world longitudinal predictions, which has been very exciting in my own area and excellent work from groups I really admire has used, for example, smartphone assessments, which I know we'll hear about today, to link real-world in the moment or daily fluctuations and changes in mood to changes in substance use.

So part of the precise reason then why this is so exciting for us as clinicians is this level again of moment-to-moment and day-to-day variability that how patients understand their symptoms and how we discuss these symptoms and work to intervene with patients in the clinic.

So again, just to make this more explicit, another way of saying that is that this personalized and longitudinal tracking of participants or patients over time through, for example, smartphone assessment or broader electronic medical records that we've heard of aligns the timescale and practices of research to that from the clinic, and perhaps ultimately a real hope that we have is that an important work has already been done starting to this, again highlighting work from my own research area, is that further allow us to not only align research to the clinic, but ultimately bring the clinic to the patients. For example, this is an excellent review looking at potential in the moment and adapted interventions based on mobile tracking in substance use.

So with that, I want to end by zooming out and again acknowledging I'm standing on the shoulders of giants, as I am doing here, but I guess humbly suggest that ultimately across all of these themes I discuss today that clarifying the potential translational end goal of neurobehavioral research or translation of neuroscience early and often might help us bring this research closer to the clinic.

So what do I mean by that? Sort of a conceptual point that's inspired by an excellent commentary that was written by Caterina Gratton Steve Nelson, and Evan Gordon, that came out in Neuron this year focusing on neuroimaging and brain behavior linkage in human neuroscience, but I think more broadly, if we consider this, we might think of translational neuroscience in these types of studies as approaching an important divergence between two different types of studies that, again, I would argue might mirror two different types of translational end goals in psychiatry, and on the left we have the use of very large samples or big data, whose translational end goal is likely diagnostic or prognostic screening, that as we build increasingly reproducible and generalizable models, by largescale data and advanced computational methods, we'll be better able to refine diagnoses and symptom manifestations of mental health across the population, hopefully making reliable prediction about future risks and potentially stratify patients towards existing evidence-based treatments.

In many ways, I would suggest this mirrors observational population studies in epidemiology and genomics. But of course, I want to make the point that there's a fully complementary path whose translational end goal is mechanism identification and treatment development, that uses smaller focus studies, perhaps using longitudinal data and/or interventions, to introduce the intersection of randomized clinical trials with these types of metrics to examine the basic science of symptom processes over time and perhaps even within a given individual and help us develop targets for new and even personalized treatments.

So as an early career scientist, clinician scientist, given this fantastic opportunity to speak with you all today, I would say this is exactly why this is the most exciting time to be doing this type of research and thinking about clinical implications, as both of these paths forward towards translation can be pursued in a complementary fashion, and, as we've already heard are gaining momentum and broad support with new data and new methods and new platforms to analyze and translate these data.

With that, hopefully I've laid the groundwork here for some of the other sessions where there is discussion planned. So there won't be questions here, but I'd be happy to follow up via email with any questions, comments, or ideas.

Thank you so much.

JENNI PACHECO: Fantastic. Thank you, Brenden. We did have a couple of questions come in through the Q&A box, so we'll try to get to those before the end of the day. But now we'll turn things over and try to get started with Session 1, where we will focus on behavioral task optimization and consider what is that value added by behavioral tasks in clinical decision-making, what do we need to do move towards individual prediction, and how should we think about the relationship between task performance and neurobehavioral mechanism.

To start us off, we have a great lineup of experts. Dr. Craig Hedge from Aston University, Dr. Zoe Hawks and Laura Thi Germine from Harvard Medical School, and Dr. Zeynep Enkavi from California Institute of Technology.

In addition, we're lucky to be joined by Dr. Russ Poldrack from Stanford University who will help moderate the discussion after these talks. Thank you to all five of you, and take it away, Craig.

CRAIG HEDGE: Thank you, Jenni, for the introduction. Thank you to the team for inviting me to speak. So an underlying theme of what I'm going to talk about today is whether we can teach an old dog some new tricks. So this is my old dog.

I'm going to talk about why some of the tasks we've perhaps tried to use for measuring individual references might not be as good as we would like them to be. That's one of the key points I want to talk about today. It's not all doom and gloom in the sense that there are some things we can do to improve the way that we use these tasks, increasing trial numbers is one of them, but there may be limits to the extent to which that can help us.

And also a comment that we don't want to lose sight of validity when we're talking about optimizing our tasks. I think it's easy to chase after the things that we can measure quite well and perhaps lose sight of what we were trying to measure with them in the first place.

There is an old paper which I really like by Lee Cronbach where he talks about the two disciplines of scientific psychology. What he suggested was psychology had kind of split into two, and there was one root of psychologists who studied individual differences and they tended to study things like intelligence and personality who went one way, and then there was experimental psychologists who tended to study things like attention and perception, and Cronbach referred to there being a desert in between these two branches of psychology.

I would like to think that nowadays we are trying to cross this desert, at least from growing up in the experimental psychology background as I did, that we are looking at individual differences more in some of these processes that traditionally have been in the realm of experimental psychology. But I think there are still some consequences to the way that we have gone about measuring these things.

For example, if you were to ask someone who traditionally does individual differences and use neuropsychological batteries a lot of the time, what makes a good task, then they might refer to things like reliability for test/retest reliability and predictability. An IQ test needs to predict things in the real world, otherwise it's not really doing what we want it to do.

You might have people who use cognitive assessments as screening tools, say, for example, the Montreal Cognitive Assessment. We don't really expect much reliability or variability in a healthy population. We expect that healthy adults should all be at ceiling on this kind of assessment.

But what we want it to do is to be sensitive to when cognition starts to decline, such as in the case of dementia. So we're not necessarily expecting the same things from it as we are from the intelligence test.

If you were to ask an experimental psychologist what makes a good task, then they might say that it's a task that consistently shows the expected effect. So I hope a lot of people will be familiar with the classic Stroop task in psychology where people are asked to name the color of the written word and ignore what the written word actually says.

What we show very robustly over decades is that people are slower and they make more errors when the stimuli are conflicted. So when you see the word blue written in a red font or yellow font. This is a really highly robust task. We can use it in undergraduate demonstrations and demonstrations to the public, and it pretty much always shows the effect, which makes it a very good experimental task. But I think one of the consequences is that it doesn't necessarily make a good individual differences task, for reasons I'll get into.

So a few years ago, we published a paper which was titled the reliability paradox, and what we were looking at in this paper was trying to explain what could be seen as a paradox where these tasks which work very well in experimental contexts seem to be letting us down somewhat when we try to translate them into individual differences research. So we focused a lot on inhibition tasks, responsive inhibition or cognitive control tasks. There were some other ones in there, but this was sort of the main focus of our work.

Why should we care about the test-retest reliability of inhibition? So there are some theories of inhibition which suggest that it should be this dynamic thing which is context dependent, and we should be flexible and in some situations, we might need to enact more control than others.

But I think if we look across the literature to some of the areas and their health outcomes and developmental outcomes in which people have tried to apply inhibition tasks, then it doesn't really make sense if we don't assume that there is some kind of stability to people's ability to inhibit response. Otherwise we wouldn't try to look at it in the context of these applications. So that's why we focused on it, and at the time we were doing this work, there wasn't a lot of test-retest reliability information out there on some of these tasks already.

So rather than take you through the nitty-gritty of each individual task, I'm just going to give you the overview which is that a lot of these task measures are not as reliable as we would like them to be. So to try to quantify that a little bit, if you are looking at correlations between task performance and, say, a measure of neuropsychological symptoms, then the reliability of your measure will impact on that. If you look at this red diagonal line, which is where you assume that the true underlying correlation between task performance and your neuropsychological symptoms is what we would traditionally call strong, so a correlation of .5, that if our reliability is around about .5, then that will cut the correlation that we can observe in half.

So the consequence is that with a lot of these tasks, we will always be chasing after low correlations between task performance and the things that we're interested in in our research, which isn't really a good place to be. Similarly, if we look at statistics that look at individual change, so the reliable change index is one of these where you have someone's baseline score, that this black dot in the center of the distribution represents someone who has an average performance on this task, and then we can think about how large a change would we need to see in order to detect a significant change to that individual?

And with the reliable change index, the reliability of the measure is taken into account here. So with poorer reliability, where you come out with this red dot, you would need to see a larger change in order to able to detect a significant improvement or decline for that individual. So the reliability of our measures is very important for our use in individual differences and in clinical applications, and a lot of these tasks aren't really providing as reliable measures as we would like them to do.

One thing we found in our research is we administered more trials than perhaps a lot of studies did. So for example in Flanker and Stroop tasks we had 240 trials per condition, and what we found is that generally the reliability of the measure will increase the more trials you administer. There are diminishing returns after a point, but this is a general trend that emerged across all of our tasks.

So that is something we can do when we're thinking about using existing tasks or developing new tasks where we can try to optimize them to some extent. Obviously, this introduces other tradeoffs, especially when it comes to clinical populations in terms of perhaps fatigue or motivational differences. But it's something that is relatively tractable for us to do.

In terms of the explanation for why this happens, so we weren't suggesting that these tasks are bad tasks. Like I said, the Stroop task in many contexts is a very good task. But it's perhaps that we have optimized it to do something that works in opposition to what we want to do for individual differences. So for decades, we've been using these tasks in experimental psychology because they produce robust within-subject effects, and that's why we continue to use them and why they became popular, and it's actually beneficial for producing robust within-subject effects if you have low levels of individual differences. So if everyone gets a Stroop effect of around about the same, then that is more likely to give you a significant within-subject effect, as the standard deviation of your effect goes into your effect size calculation and it goes into a T test.

So, if for decades we've been selecting these tasks because they produce robust within-subject effects, then we may have actually been working against ourselves in terms of producing reliable individual differences measures. So this is perhaps a general lesson that we can take forward into the development of new tasks and the selection of new tasks that we might need to think differently about the way in which we optimize them and the criteria that we evaluate them by during task development.

And another thing I think we can stop and think about here is which measures should we be optimizing. I showed you that the reliability of a reaction time difference score does improve when you run more trials. But that doesn't mean that we should be optimizing reaction time-based scores when it comes to thinking about how these tasks might be used for clinical decision-making.

So this is three real participants from some of my data, and I would argue that it is not clear from this data who has the poorest level of inhibition. So participant A has, for example, the largest reaction time effect but the smallest error effect. Similarly, participant B has the largest effect in errors and the smallest effect in reaction times, and then you have participant C who is somewhere in the middle of both.

If you were using this task to identify someone who has better inhibition or poorer inhibition, I don't think is necessarily anything about our theory on the surface which tells us which measure we should be using and which measure we should be optimizing. I think this is something we can stop and think about when we're doing this process.

JENNI PACHECO: I just want to give you a warning, if you could maybe finish up in the next minute or so. That would be great.

CRAIG HEDGE: Sure. And we've shown that this matters. So, we've shown that the reaction time-based effects and error-based effects don't correlate, and also you can see in meta-analyses that you often will see effects in one and not the other. So it matters when you're deciding which measure to use.

And so I shall wrap up with some concluding thoughts, which is should we ditch difference scores altogether? I would argue that we shouldn't, or at least we don't want to lose sight of using tasks which focus in on these specific cognitive processes that we're interested in. We know that simple reaction times and simple measures may have better reliability, but they don't necessarily offer us this incremental validity that Brenden talked about.

There may be some measures that we can take from these tasks which provide us more validity and account for some of the problems, like the lack of process parity. For example, we have used evidence accumulation models such as Roger Ratcliff's drift-diffusion model. They're not the only way, and I think Zeynep will talk a bit more about this, but they are one way forward in which we can get a bit more from the task that we're using.

And the last thing, that if new tasks are needed then I think we need to start thinking about some of these things earlier in the development process. So thinking about reliability earlier on and what's the best measure to take from that task earlier on and then use those to optimize and move forward.

I shall finish by thanking my collaborators and pointing you in the directions for papers if you need more details. From there, I shall hand you over to Zoe Hawks for the next talk.

Thank you, everybody.

ZOE HAWKS: Thank you so much. So as mentioned, my name is Zoe Hawks. I am a postdoc working with Dr. Laura Germine at McLean Hospital and Harvard Medical School, and I'll be continuing our conversation on behavioral task optimization, specifically I'll be talking about considerations developing, evaluating, and scaling behavioral assessments for precision diagnostics.

Today when I say behavioral, I'm going to be referring specifically to smartphone-based digital cognitive assessments, but the issues and considerations I'll be discussing are relevant to behavioral tasks as well.

I'll also be continuing the discussion of reliability, this time in relationship to time series data, and so I'll be focusing on how behavioral time series can advance precision diagnostics. There are two main ways in which this may occur, first by repeatedly sampling data from the same individual, we can get a highly reliable estimate of stable cognitive ability, which may support diagnostic evaluations and risk determinations.

Second, time series data allow us to start asking questions about environmental factors that influence within-person fluctuations and cognition, which may inform longitudinal risk monitoring, provide insights into behavioral compensation, and enable personalized clinical recommendations.

So importantly, as we've heard, a task that was developed to identify stable between-person differences may perform poorly if used to detect within-person fluctuations, and vice versa. To detect stable differences, behavioral tasks have to demonstrate between-person reliability. That is the proportion of variance due to differences between individuals must be greater than the proportion due to differences, variation due to differences within individuals. Without high between-person reliability, we can't hope to use a behavioral task for clinical decision-making.

Reliability is also important if we hope to identify environmental and psychological factors that exert systematic influences on cognition. So, here we turn attention to within-person reliability, and that is the proportion of variation and scores that's due to differences within individuals across measurement occasions, rather than differences within a measurement occasion.

High within-person reliability indicates variance across occasions isn't just noise. Rather, it's meaningful and we can seek to explain it and we can seek to predict it.

So in a few minutes, I'm going to hand the mike over to Laura and she'll discuss the process of developing and scaling tasks that allow us to estimate reliable between-person cognitive differences and within-person cognitive fluctuations.

Briefly, I'm going to discuss our lab's recent efforts to understand factors that influence performance on these reliable tasks. So, much of our ongoing work has been in a sample with type 1 diabetes, and this is a sample in which the association between blood glucose and cognition is hypothesized to be strong, both between and within individuals, and therefore advances in precision diagnostics have important clinical implications for glucose self-management. This image here depicts a continuous glucose monitor or CGM device, which takes glucose readings every five minutes. Those recordings are plotted over time in blue, and in tandem we administered brief cognitive tasks to participants' phones three times per day for 15 days.

Using random forest regression with nested cross validation, we observed that clinical characteristics predicted between-person differences in cognition, specifically clinical blood glucose and demographic variables predicted almost 80 percent of the variation in processing speed, and diabetes variables alone extracted from the CGM device predicted almost 40 percent of the variation in processing speed.

However, when we attempted to predict within-person variation using a random forest regression as above, as well as a hierarchical Bayesian approach that better accounted for nested data structures, performance fell dramatically.

So how do we start thinking about new approaches to improve within person prediction?

Many of our analyses to date will fail to predict within-person variation to the extent that there are between-person differences in the magnitude or strength of the within-person associations. Using type 1 diabetes again as an example, most individuals show a quadratic association between glucose and cognition. So here we have the CGM glucose readings on the x-axis and working memory on the y-axis.

But the strength of that association varies. So we can see that for some individuals the acceleration of those curves is strong, whereas for others they're showing more of a linear effect. If we want to allow the strength of the IV-DV associations to vary across multiple predictors and if we potentially want to model interactions, it becomes computationally intractable to do so using traditional group modeling approaches.

So we're starting to think that accurate prediction of within-person cognitive fluctuations may require a person-centered modeling approach or an ideographic approach, as we heard about earlier. For example, we might build a separate predictive model for each participant in our sample, split their data, split their time series into training and validation sets, and then using permutation testing, we can compare performance in the validation set to chance. This would provide an empirical check on any sort of spurious finding.

I will note that this requires a lot of data, and in fact we're starting to estimate roughly three times as much data as we currently have. So maybe close to 150 observations per person. That is based on some simulated testing which I'm going to show the results of now.

So again, these are simulated results, but we think that they are promising. They suggest that person-centered models, here in pink, consistently achieve lower prediction error than chance models, and this is particularly true for our measures of processing speed.

Further, this sort of approach allows us to start unpacking person-specific influences on cognitive performance. So these may include things like practice effects, time of day, stress, and prior night CGM, prior night glucose. We might expect some of these effects to be common across participants, for example, practice effects is a fairly strong and robust effect that I was seeing in multiple of these person-specific models, but others may be more variable. For example, whereas this participant showed a strong association between prior night glucose and cognitive performance, another participant might show a stronger association between glucose variability preceding testing and cognitive performance.

And so this sort of individualized modeling can have clinical implications for clinicians who wish to make behavioral recommendations for glucose self-management.

Given that we're seeing these clinical implications, again in kind of the simulated set, varying future testing, the next key question becomes whether we can scale these approaches. So with that, I am going to turn the mike over to Laura who will wrap up our shared presentation here.

LAURA GERMINE: Thanks, Zoe. So, in the final couple of minutes I just wanted to talk briefly about thinking about how do we measure behavior at scale.

When thinking about within-person cognitive assessment, as well as between-person cognitive assessment, and the talks from Dr. Hedge and Hawks, I think scale is kind of the elephant in the room in a place with many, many elephants. How do we get to the frequency of assessment and the sample sizes that we need to make the sorts of nuanced and precision medicine inferences that we want to be able to make? And this is a hard problem and I think it involves a reorientation towards how we think about assessment and how we think about the way we do our science.

So, of course, thinking about psychometrics in all the ways that you've heard from previous talks, measures need to be sensitive and reliable, but they also need to be efficient, and that's something that we haven't been thinking too much about, and as Dr. Hedge mentioned is often in direct opposition to considerations around reliability, where longer tests are needed for more reliable data, but shorter tests are needed to minimize burden when we think about a scaled sort of sense.

We also need to be thinking about engagement. So commonly when we think about our research and think about our science, we think about our own goals and our scientific goals, the goals of the community, hopefully about patient care and clinical needs, but rarely about what the goals are of our research participants. And so, if we're going to get to a point where we have many people doing many measures, thinking about what the goals are of individuals, the way we've approached it has been using return of research results at the individual level, but also thinking about things like user-centered or universal design principles.

So thinking about engaging people as an important and critical barrier to getting to the sort of scale we want, and this is especially hard to do longitudinally.

Then finally thinking about the technology itself. So I think we often consider technology as an opportunity but as Dr. Pratap mentioned in his talk, often technology is a source of potential confounds, demographics of device ownership are things to consider, devices influence our measurements in important ways, especially in a bring your own device model, and then devices change over time. So anyone who has developed for modern hardware and software knows that modern hardware and software is a moving target. It changes all the time, and our measurements need to keep a pace for that.

So I think to get to scale, digital technology is absolutely required, but we need to consider some of these other things that maybe we're not used to considering.

When we think about task design, I think we also need to reapproach task design from the traditional approach. So you might think that the way you design a task is you make a new task, you do some pilot data collection, you might evaluate the psychometrics of that task based on the pilot data, maybe make some changes, but then move on to a larger scale study and maybe, as part of that larger scale study, you report the reliability and validity of this new task.

Given that so many of our tasks aren't going to translate to scale well, they're not going to translate to within-person measurement, they might not even translate to between-person measurement, as Dr. Hedge mentioned, we really need to think about tasks as prototypes, right? So you start with a task prototype.

Then you do data collection with randomization. So you might not know exactly which parameters are going to optimize reliability efficiency engagement, accessibility across technology types. These are all empirical questions. So A/B testing or testing with randomization needs to be part of the development flow of tasks. At that point, you now evaluate across all the different parameters you tested, the psychometrics, engagement, and accessibility, and figuring out which ones are priorities for a given use case or a given study, make modifications based on the results of that A/B testing, selecting parameters based on the considerations I mentioned, but also things like user feedback, user behavior, and item-level psychometric analysis, and then you go through this again, and data collection with randomization continues with modifications wherever you can find them until you meet some criteria for sensitivity, reliability, engagement, and accessibility and no further modifications are identified.

This is always going to be with respect to some purpose. So is the purpose within-person assessment in a certain group, because you're developing for that purpose? This is going to get us hopefully tasks that are more suited to the sort of scalable assessment, and sensitive to the types of variability that we really need moving forward.

So I'll go ahead and stop there and turn it over to Dr. Enkavi, who will continue our discussion of task development and psychometrics.

ZEYNEP ENKAVI: Thanks for having me. I'm Zeynep. I am currently a postdoc at Caltech, but previously I was a grad student with Russ Poldrack at Stanford, and today I am going to talk about a few things we've learned on how to use cognitive tasks to gain some mechanistic insights into some underlying cognitive processes.

I will primarily discuss data from healthy controls, but I will highlight some follow-ups that have been done and can be done to understand psychopathological conditions as well.

Around the time Craig's reliability paradox came out, we were interested in self-regulation and collecting a large dataset using pretty much all the measures we could think of. For the purposes of this talk, self-regulation is only an example cognitive process, though many of the tasks listed here might be relevant to our research, too, but inspired by Craig's work, one of the first things we did with this dataset was to document the lack of reliability, lack of retest reliability, in this wide array of tasks.

In addition to finding low reliability in our new data, we also delved into the literature and showed that the low reliability was the case in the little bit of data we could find in the literature as well. So in this key figure from the paper, the red line denotes a reliability of zero, the violin plots denote reliability estimates from our dataset, and the yellow diamonds are reliabilities from the literature, and as you can see, many of them the ranges up here, many of them were much below an acceptable level of stability, if they are to be used as individual difference characteristics.

But having this array of tasks also allowed us to examine other things as well, such as comparing the reliability of different types of metrics and studying the latent structure of, in this case, the construct of self-regulation. To foreshadow what's to come, these are going to be the two approaches I will present as two pathways to potential mechanistic insights.

But let me explain, let me begin with the first point of looking at different types of metrics, and explain what I mean by different type of metric. So as Craig also mentioned in his talk, in cognitive psychology we have a long tradition of measuring behavior using tasks such as Stroop where we ask participants to respond, for example, based on the ink color of the word they see, and measure things like their response times and accuracies, and the idea is that by comparing changes, for example in response times across conditions, we measure proxies of certain cognitive processes, like cognitive control or inhibition.

In this task, for example, we could look at how much slower one is in the incongruent condition compared to the congruent condition and call that the Stroop RT. Again, as Craig mentioned, interpreting performance in this task using Stroop RT could be tricky, because you can have a participant who is very fast but not very accurate and another who is very accurate but not very fast. Yet, they might have the same Stroop RT, and this as many of you are probably aware of is the speed accuracy tradeoff.

Cognitive psychology has a rich history of developing models, insightful computational models, to extract interpretable metrics from tasks using a model-based approach. In this case, a commonly used framework is sequential sampling models, evidence accumulation models, and these models construe each response resulting from an evidence sampling procedure. The idea is that during a trial, a subject samples evidence for one or the other possible response, and accumulates these in a decision variable, and when that decision variable reaches a boundary, the subject executes a decision for that boundary.

In this framework, in this modeling framework, we analyzed response times and accuracies simultaneously and captured underlying processes with parameters of the model. The parameters of the model are crucially designed to map on to separable mechanisms that give rise to the observed performance. This is in contrast to accuracies and response times, which are outputs of any underlying mechanism.

In this case, the drift rates are designed to measure processing speed, boundary separation is designed to measure caution, and starting point variability can measure some sort of response bias, but more importantly, now from this perspective, if we have a participant with low accuracy and fast response times, we can say things like, oh, they are not responding cautiously or in terms of parameters, they don't have a low -- they have a low response threshold, and we can distinguish that from a participant with fast processing speeds or high drift rates, because high drift rates would lead to a different prediction, would lead to not just fast responses, but fast correct responses in this task.

Still, the details of the specific model family and its parameters are less relevant for this talk than the larger points that from cognitive tasks we can both directly measure raw performance metrics and also derive model-based variables informed by our hypotheses on the generative processes, generative cognitive processes, underlying observed performance.

And these models are designed to map -- or parameters from these models are designed to map onto separate mechanisms. Using these sorts of model-based metrics to relate to psychology is what computational psychiatry has been doing as well, as mentioned in previous talks and the introduction, too.

Now, our dataset contains multiple tasks that could be modeled using this framework. So we could compare the reliability of the different types of metrics of these raw and model-based metrics. The good news was that the model-based metrics were no less reliable than the raw performance metrics, but the less good news was that none of the variables were actually at an acceptable level of reliability for individual difference analyses. But our analyses were data-driven.

We didn't focus on any specific tasks to do a more systematic comparison of whether it was possible to find better metrics or better ways of estimating reliability. But we did make all of our data and code openly available, which allowed others to dive deeper, which was great, because one study for example looked closer at the selection of tasks and did find acceptable levels of reliability for drift rates. So this again was looking at using the same modeling framework, same model-based approach.

Another study focused -- and I believe one of the authors will be talking at another session, and others are in the attendees now -- this study focused on another set of measures from a different modeling framework, specifically reinforcement learning modeling, which also has an extensive literature associated with it, and in addition to analyzing some of our data, they showed how higher reliability estimates can be computed, and also how the relationship between a model-based variable and a clinical measure, in this case a measure related to compulsivity, can be strengthened with more data and different parameter estimation methods.

So to conclude this first point, our largescale analysis of multiple tasks allowed us and others to examine and improve on the viability of using model-based metrics for mechanistic insights, but this wasn't the only thing it allowed. It also allowed us to examine the latent construct structure.

So that's the second path I'd like to argue that might help towards mechanistic insights. In fact, this was actually our first goal with our datasets, and one common way of looking into the latent construct structure of a cognitive process that we are interested in is using a method like factor analysis that is very common in the literature.

We used factor analysis to see which metrics from this wide array of tasks that we have covary together, and the hope was to see if we can find clusters of metrics that covary together. Then we might be able to interpret them as capturing some overlapping underlying cognitive process.

So, for example, the model-based metrics I mentioned before of drift rates and thresholds, et cetera, these loaded onto three separable clusters, and this confirmed both that the model-based measures and the factor analysis worked as designed to measure separable processes, but it also highlights how these model-based metrics can make performance across tasks comparable to each other, as well.

In our battery, however, not all of the tasks could be analyzed from this framework. In fact, there were almost as many other types of metrics, too, and the full space that we measured was captured by a five-factor solution. In the interests of time, I won't go into the details of these factors, but the larger takeaway here is that using a large battery of tasks allowed us to distill a potentially large and unmanageable number of metrics to a smaller list of meaningful latent constructs with mechanism-related interpretations.

We can do even more with this kind of data. So using methods like hierarchical clustering, we can get a sense of the distances between the metrics, as well. In our example, this revealed that the drift rates that loaded onto the same factor were not all equidistant to each other, the drift rates from different tasks were not equidistant from each other, and this kind of information can both help us design studies that maximizes our ability to sample the space of the cognitive process we are interested in, it might also help us build hypotheses about what processes might be secondarily affected as a result of treatments targeting specific processes, designed to target specific processes.

I think as follow-up studies, it would be very interesting and there is certainly some work out there already, it would be very interesting to see whether a latent structure for any cognitive process that we are interested in is, both in terms of the factors, but also in terms of the distances between the metrics, is similar across a clinical and healthy population.

Okay, so, to wrap up, one of the questions posed to us in the agenda was whether we need different tasks for clinical decision-making versus identifying mechanisms for treatment, and I might have to punt on that one and say that it is too soon to tell.

I think to be able to answer that question, we need to know whether we can estimate computational model parameters similarly well in clinical populations compared to healthy controls, and we also need to know whether the latent structure of a cognitive process is similar across healthy controls and clinical populations, too.

If the answer to both of these questions is no, then we can go back to the drawing board and see if we need different tasks.

So to summarize, I presented two paths towards mechanistic insights using cognitive tasks, the first relying on computational modeling and using model-based metrics and the second analyzing and comparing the latent structure of the cognitive process that we are trying to capture, and batteries of tasks instead of using individual simple single tasks are more powerful for both of these approaches.

So that's all I had. Thanks again for having me, and with that, I will pass it on to Russ to moderate the discussion.

RUSS POLDRACK: Thank you very much. This is a great set of talks. We're going to open up to questions now. Please go ahead and type your questions into the Q&A and I see there's already one, which is actually is relevant to Zeynep's talk. So why don't we address that one first? This is from Roger Ratcliff, who says that they have a TICS paper on the mismatch between neuropsych testing and cognitive modeling, they mention that reliability is high when enough data are collected, for example correlations in main model parameters from .7 to .9 or between task of .6 to .7. They also have tricks to increase power on difference measures by a factor of 2.

Zeynep, do you want to -- do you have anything to say about sort of this issue of like number of trials and reliability? I know Craig also mentioned it in his talk.

ZEYNEP ENKAVI: This is a question that we have received a few times now, and, yes, increasing the number of trials and Craig's analysis as well, can lead to improved reliability or high reliability estimates for some of these measures. But because we have the larger array of tasks and the various kinds of measures, when we did an analysis looking into the effect of increasing the number of trials for not just drift rate, drift-diffusion model parameters, but also different kinds of measures, we didn't see a consistent pattern.

So I don't think we can say broadly that increasing the number of trials will always lead to higher reliability estimates. I think it is possible, but then we also get into these issues of how many trials we need and at what point is a task useful for a clinical battery with, say, 600 trials per participant, or how many trials we can actually get reliable estimates?

So my answer would be it is possible to get better, more reliable estimates with more trials, even though it sometimes does bring you to a point where it is no longer feasible to administer that trial, and we also did not find a consistent relationship for all types of measures between trial numbers and reliability estimates.

RUSS POLDRACK: Great, thanks. Craig, did you have anything to add on that point?

CRAIG HEDGE: Thanks, Russ. I largely agree. I think Roger is right in the sense that you need to get enough trials, whatever measure you're taking from these tasks, whether it's reaction times or drift rate measures. I think there have been some papers which have shown that the reliability of drift rates generally is about the same as the reliability of reaction times, and actually it tends not to exceed them, and we shouldn't necessarily expect the computational measure that is derived from a behavioral measure to exceed the reliability of it, and similar to what Zeynep says, you have a challenge of depth versus breadth, right? If you have one task and you can spend 15, 20 minutes doing it, then you need to really know that you're using the right task as opposed to a battery approach where it might make sense to have shorter versions of multiple tasks in order to get across the range sort of thing.

So I think there are benefits to running more trials, but yeah, whether they are practical or not I think is still an open question in a lot of cases.

RUSS POLDRACK: Laura?

LAURA GERMINE: So, I think that the increasing trials for reliability thing is interesting, because in theory it should, but I think there's the engagement piece, especially when we start thinking about scale and many timepoints, where the longer you do the assessment, the more people are going to stop paying attention or stop having their brains be engaged even if they're nominally engaged in the task in theory, and that's going to vary on clinical conditions. It's going to vary on cognitive reserve, it's going to vary on all the things that we care about and that we want to predict.

And to the extent that when you have a group that now is performing at chance because they've checked out, that actually inflates your reliability. So I think when we think about reliability, it's not just psychometrics, thinking about those people who are doing tasks and their burden is going to affect the validity and reliability if we make those too long.

RUSS POLDRACK: I think that's a great point that classical test theory assumes that there's this like unchanging mechanism giving rise to responses on the task, and we know that that's not really true. It's been interesting for me to be kind of involved in this the last few years, because as an experimental psychologist, I didn't learn anything about psychometrics. I probably learned 20 minutes of reliability in my statistics class in graduate school and then haven't thought about psychometrics forever, because as experimentalists we assume we don't have to worry about it, and so I now -- you know, all of the rise of these issues that I think was really kind of spurred initially by Craig's paper, I think has really driven us all to think about not just how can we address these issues in our work, but also how can we start training the next generation of researchers in experimental psychology to have a better appreciation of the importance of psychometrics as well.

So I think there's a great next question from Adam Kepecs. Excellent and thought-provoking talks. My main question is what would success look like? Should we be looking for metrics for individual differences that are large and reproducible, or for ones that predict clinical variables, or are those goals the same? Does one of you want to jump in and tackle that?

LAURA GERMINE: I suppose I can say something. I think we have to think about prerequisites, right? So there are prerequisites to good measurement. There are prerequisites to clinical prediction. There are prerequisites to large and reproducible effects. One of those prerequisites is basic reliability, right? Another prerequisite is some sort of validity, and maybe the validity is does it predict clinical things, but thinking about measurement in the most basic, what are the foundational things that we absolutely need to do anything else, I think is really important, and that's in large part what many of our talks were about.

And then I think that the individual differences in clinical prediction are associated things, but, you know, what is the purpose? What is the research question then becomes the driving force, once you've got the foundations in place. But I think those foundations are really important for all of these questions.

CRAIG HEDGE: I think one of the questions that we had with the task reliability issue is that -- so, the context was there have been some debates about whether response inhibition or self-control, however you want to phrase it, is an underlying construct, whether these tasks all should correlate with each other, and I think the way we approached it is that you can't really answer that question until you have solved the reliability issue, that if you see low correlations between tasks that are supposed to measure the same thing, you don't know if that thing doesn't exist or your tasks are just rubbish.

So I think to me the reliability was a barrier to answering some of those more substantial questions like can we answer theoretical questions about whether these constructs exist or whether these constructs will predict clinical symptoms. That is kind of how we came at it.

RUSS POLDRACK: Yeah, and this whole discussion dovetails with a larger discussion that has been going on for a while, but I think has gotten a lot of play lately in the context of deep learning models, which has to do with this sort of tradeoff between explanation and prediction. You can get highly, really complex models that are really good at prediction that are very difficult to explain in terms of what they're actually doing, and I think that we start to merge into some of those same questions here.

So, Luke Stoeckel asks, for Zeynep, the latent structure analysis focused on self-regulation and its measures, are there features of constructs or performance or behavioral measure types that will perform or fit better or worse in this analysis?

ZEYNEP ENKAVI: So I've been rereading Luke's question a few times trying to understand, and I am not entirely sure if my answer will be addressing what he wants me to get at. One thing that popped up in my mind was, for example, in our factor analyses, one thing that we saw was even though we didn't know how all the other tasks were going to relate to each other, all of the delay discounting tasks kind of clustered nicely with each other, and without necessarily relating too well to all other things. So some delay discounting researchers took that as a good thing. Also, because that factor happened to be the most successful one in predicting other real-world self-regulation related behaviors, but then on the other hand, more experimental psychologists from like experimental psychology backgrounds were more kind of discouraged by that, because these were more survey-like tasks and not necessarily telling us as much about the mechanism.

And so in terms of like, yeah, I guess those tasks, that was one feature, their survey-likeness, perhaps the participants' ability to monitor their responses, because their responses consistency in those tasks was one thing that made those more reliable, and perhaps more predictable as well, or predictive as well. But in terms of helping us understanding mechanism, that wasn't a very satisfying set of measures, too.

So again, I'm not entirely sure if I'm getting at what Luke was asking, but that was the one thing I could think of, how participants' ability to monitor their responses in a set of tasks seemed to help with the sense of this measure of reliability. But please email if I didn't get at that.

RUSS POLDRACK: Cool, thanks. So there is a question from Michael Trang. What do you make of the increase in trials affecting the validity of the measure? I think what that really meant to say was affecting the reliability of the measure, because I haven't seen anybody talk really about the relationship between number of trials and validity.

Somebody can jump in and tell me if I'm wrong, and we've actually known since I don't know when Spearman published his prophecy formula, right? In general, it doesn't always hold with noisy data, but there's a lawful relationship between the amount of data and the reliability of the measure.

But there's a second part of this question that I think is maybe worth a little more discussion, which is something about the -- how do we think about the difference between, say, peak performance and typical performance or the purposes of the kind of questions that we're trying to ask. We often think about, like, the number that we estimate being sort of a single stable measure, but even within, as we've already talked about, even within a session, you can have variability across the session as people's attention flags. We know also that there are, if you look at reaction times, you can see kind of rhythmic effects, 1/f noise and reaction times.

Do panel members have thoughts on any of those points?

LAURA GERMINE: So neuropsychologists, the idea is you're getting at best performance, right? That's a big assumption in neuropsychological assessment, that we are trying to create conditions to get at people's best performance, because best performance is the thing that's the most diagnostically useful. I think that's an assumption that is not necessarily backed up by a lot of literature, and in fact, in psychiatry and mental health, typical performance, so not how well someone can do, but how well they do on a daily basis when they're grappling with all the stresses and potentially increases in symptoms of their daily lives is perhaps more important, or maybe it's poorest performance or variability in performance.

So I think which type of performance is the most important largely depends on what the research question is and what the thing is you're trying to get at. What is the person's potential or how are they doing in their everyday life? From a psychometric standpoint, the way I think the thing that always has to be asked is what is the reliability of the thing that you're trying to measure. What is the reliability of that peak? What is the reliability of the typical performance? It's much easier to get a reliable measure of typical performance if you have lots of timepoints. That's not too hard.

If you're doing things like the longest reaction time in a cognitive test, that might be predictive of things, but is it reliable; if you did that again would the same people be rank ordered in the same way as the highest versus lowest? I think those are important questions. So again, keeping the foundations and which of the measures is important is dependent on the clinical or the research question.

RUSS POLDRACK: I think that's a really important point. Estimating things about the tails of distributions is a lot harder than estimating the mean of the distribution.

Joel Nigg has a question or a comment.

JOEL NIGG: Just one comment here. The variability and inconsistency itself is information, and sometimes is predictive of clinical outcomes. Just a reminder of that. Of course, things like reaction time and variability are often correlated. Not always, but often.

We've noticed for example in neuroimaging data that amount of movement which is usually considered a problem in neuroimaging data, is a nice predictor of ADHD, and so there's a variety of these kinds of examples. So just a reminder that what we consider a problem sometimes is information.

RUSS POLDRACK: I think that's a really great point. If you look at the aging literature, there's a ton of work on changes in variability and reaction time along with changes in mean.

Zeynep?

ZEYNEP ENKAVI: I think I was just going to make the point that Craig's paper and all these other papers highlighting lack of reliability, they kind of made us rethink a lot of things, but I think what Laura has been bringing up as well is the perhaps the more important thing that it is forcing us to do is to really think carefully about what our research question is and what type of statistic the measure we are using needs to have for that.

So it was very straightforward to think of, oh, if you have some sort of significant correlation, that's you, you have a paper out of it, but whether that paper or that significant correlation is meaningful for anything wasn't something that we thought about or at least I was necessarily instructed to think about very carefully.

Now when designing the study, we look into, okay, what are we going to measure, what are the properties of the thing that we're going to measure, and for the question that we are trying to answer, is it going to mean anything? Because yes, that variability, that lack of stability, can be informative as well. Sometimes that might be what we need to get at, too.

So yeah, this emphasis on thinking about the research questions has been the most informative thing for me at least.

CRAIG HEDGE: So I would echo the comments that the variability is useful information. I mean, this is a lot more explicit when you use like evidence accumulation models, because that reaction time variability is used in the estimation of certain parameters. Obviously, you still perhaps assume some stability throughout the session when you're doing, but it is useful information.

Also, if the question about the relationship between trial number and validity, if that was the intended question, so Laura already mentioned the motivation issue where the more trials you run, you might expect that towards the end you are actually measuring people's ability to stay motivated as opposed to what the task is supposed to measure.

There are also suggestions when it comes to response inhibition and cognitive control that what you really care about is cognitive control in a novel situation. That's where it should be most meaningful, when you have had people doing a task for a long period of time. Then what you're measuring is the degree to which some of those processes maybe become automatic, and it's not the executive functioning stuff we're interested in anymore.

We did try to look at this a little bit in the paper. I am not sure if it ended up in our trial level analysis, but we tried looking at reliability based on number of trials based on whether you took the first trial and worked forward or the last trial and worked backwards. Basically what we found is that more trials helped it regardless of where you took them from.

So it's not to say that there aren't those mechanisms at play, like practice effects and fatigue effects and motivation effects that they weren't strong in our data at least when we tried to look for them, but we did try to look for them.

RUSS POLDRACK: Zoe.

ZOE HAWKS: Just a follow-up point, thinking about implications for clinical translation of these within-person variations we may see from typical to peak performance and so forth. I think to the extent that we're able to measure that variation, again, reliably, then we can ask questions about what sort of environments, what sorts of moods, what sorts of social contexts allow a given person to perform at their peak and is that something that is supportive for them from a functional standpoint?

And then related to variation, I think there's the question of if it is an important predictor of some clinical risk factor or diagnosis, like, what -- it's a proxy of sorts. So what is it approximating? What gives rise to that variability, and is that another black box that we need to unpack?

RUSS POLDRACK: So we have got a bunch of questions. I'm going to take advantage of my prerogative and pick one that I think it looks particularly interesting, which comes from Michele Ferrante, who says, so task optimization has to be question-dependent. Are we using the task to do, for example, are we using the task to do treatment assignment or identify diagnostic biotypes? As a field we need an environment, sort of like a gym for training task parameters, where we take basic tasks and intelligently optimize them to the specific question of interest? I think this is a really cool idea. This is idea of kind of gyms has gotten a lot of play in the machine learning world, the ability to develop whole environments to sort of test methods, and I think that something like this would be a really cool idea for -- because I agree that in thinking about how we might go about actually optimizing tasks, just the amount of data that we need to go out and collect would be really challenging.

So thinking about some ways that we might -- I don't know if this is actually where Michele was going, but thinking about taking advantage of machine learning methods to give us tools to have like kind of proxy subjects to run our tasks to help figure out sort of how things might work, I don't know. I'm interested to hear if any have thoughts on that.

LAURA GERMINE: So, I think that -- I'm not going to answer the machine learning question, I'll let someone else take that. But I think that the fact that there are so many different research questions, and you could design a task for every single type of research question, and you have to optimize each one, and also make it generalizable and also make it engaging and also make it technically accessible. Like it's impossible, right?

But I think a lot of the work can be done by understanding a few things about a given task. So understanding like the basic reliability and what it correlates with, but also things like what part of the normal distribution is that task sensitive to, and that one does a lot of work. So there are tasks that are sensitive to the tails, to really extraordinary ability or to impairment minus two standard deviations. There are tasks that are really good right around the mean.

And often I think when we assume something is related to it, like a task is related to a disorder or some function is related to a disorder, what we're actually looking at is that the tasks that get at working memory and dual tasking or something tend to have a certain difficulty level, and that difficulty level corresponds to the level of average impairment in a certain clinical group.

So the flip side of that is understanding the basic psychometrics of things like what part of the normal distribution is this task sensitive to, because that corresponds to level of impairment we expect in a clinical group or given a certain question or the area of the normal distribution where we expect change to be from and to. I think those basic sort of facts do a lot of work in helping select tasks that will be appropriate for certain questions.

ZEYNEP ENKAVI: This question on the machine learning part or optimizing design made me think of two things. One, yes, now there is work -- there has been a bunch of work recently on adaptive optimization of tasks. So while collecting data, just looking at getting a continuous measure of the performance metric that you're interested in and deciding when at what point you can stop collecting data from a participant.

But then the other question on whether having like sort of virtual participants, the one thing that came to mind, even though it's not directly related to this, in another issue of like estimating parameters for models that we don't know the likelihoods of functions for, for example, there's now work where people train neural networks, feeding in like simulated, like using a data generating process simulator and then basically simulating a bunch of data, and you can kind of think of these as would sort of give you a likelihood function, like if that is your generative process, then this is all the possible ways your data can look like.

We can sort of make use of those kinds of methods, even though currently we are using these to estimate parameters for arbitrarily complex models. If we have the lay of the land understanding what sort of -- what's the variability of the data can look like, and then where are participants, the data distribution falls into that in that possible realm, then that might help us understand better where -- how to get -- where our measure falls and how to pick our metrics based on that as well.

So, this is to say that there are neural network methods that are now using arbitrarily complex models to generate these likelihood distributions, but we haven't necessarily made too much use of that to pick metrics from studies yet.

RUSS POLDRACK: Thanks. So there is a question from Mario Luiz Furlanetto, Jr., and I'm going to -- which was a specific question for Zeynep, but I'm going to kind of generalize it a bit. It's a question about like we -- let me give you my interpretation of the question or a general version of it, which is we spend a lot of time having people do these particular cognitive tasks that are very different from anything they do in their real lives, and they often do them in the context of a physician or a neuropsychologist or sometimes they're just doing them at their computer at home. But I think throughout cognitive neuroscience, certainly, there's been an increase -- and I think neuroscience in general there's been an increase in interest in using more naturalistic types of tasks and sort of experiences to try to understand the neural basis of how people engage in those things, in part because certainly in certain areas of neuroscience, it's become really clear that gaining an understanding, for example, of how early vision works from very simple nondynamic stimuli just didn't scale when you wanted to go understand like how does an animal like understand a complex dynamic environment.

So I'm interested to hear thoughts from any of you about how do you see the kind of the move from these kind of very -- what you might think of as kind of denuded cognitive tasks, oversimplified, to more naturalistic types of experiences that might be -- they're obviously more challenging, but also might be better at actually eliciting the kinds of things we're trying to understand?

LAURA GERMINE: I think this is really important. I think one of the big frustrations I have as someone who primarily does cognitive testing is just this constant feeling of like this is not cognition in the real world, right? No good solutions to that. But I think what we do need to decide is when we try to make things that are more, let's say, ecologically valid, how do we judge that something is more ecologically valid?

Like, often what you see happening is they'll take the standard stimulus in a working memory task and replace it with objects that are real-world objects. Like, okay, I guess that's real-world objects. Or they'll try to take a task that's task switching and have it be two tasks that people actually do, but does that actually get you to ecological validity, or is it just swapping out one stimulus for another stimulus, and the task is actually equally unnatural?

I think one of the angles that I haven't seen explored which I think would be really interesting is thinking about the degree to which something is equally valid is the degree to which the task is natural for people in that they don't need a lot of instructions. There is something intuitive about the task, and if it's intuitive that means people bring to it an understanding from the experience they already have, and I think good user design gets at this.

And how is -- I think that's a version of ecological validity. So if something is so complicated that you need expert participants to do it, that's probably not very realistic real-world wise. Whereas something is more intuitive and that might get you closer.

I do think, though, that I hope digital technology enables us to have more naturalistic ways of measuring cognition, but again it has to be this foundation of psychometrics that I think often comes much later in the development process.

JOEL NIGG: I'm just thinking that this question raises so many interesting issues, and I appreciate this discussion in general. But I think about the tradeoffs of experimental control, mechanistic validity, ecological validity. You can probably have two out of three, but not three out of three. So what do we want to trade off there?

But also, this always raises the challenge for me, we work both with these computational tasks like I've been talking about here, and with a lot of rating scales data, and then we try to look about clinical prediction, I'll talk a little bit about it this afternoon, but just always raises the challenge for me, when do we want these tasks, what are they for, versus rating scales which clearly can measure real-world behavior very quickly and often very validly, but they don't get us that mechanism at all.

So I think there's kind of a question of do we want tasks to get us ecological validity or to get us mechanism, then do we want rating scales to get us -- and then how do you vary them? This is obviously a fundamental problem and challenge in the field, but I just wanted to highlight it.

RUSS POLDRACK: Thanks. My wife is an interior designer, and she has this thing that she says to clients. Fast, cheap, good. Pick two. I think that says exactly what you're getting at is there are going to be fundamental tradeoffs here.

Zeynep.

ZEYNEP ENKAVI: One thing about ecological validity, and this just occurred to me when Laura was speaking, and like one thing for example, a task that requires less, fewer instructions and how intuitive it feels might be more ecologically valid. But then on the other hand, there are things in this world that we do that are not very intuitive, like our retirement savings and where to put all of that and figuring that out has been extremely difficult and for example how quickly a participant can understand very convoluted instructions, ah, that might be ecologically valid for that kind of decision-making. Now again, yes, this is sort of a joke and not necessarily directly related to clinical decision-making. But I think ecological validity can mean a lot of things and while, yes, how intuitive a task is can get at some of the cognitive processes that we might want to capture, other things in terms of how much the ability to pick things up quickly when things, those might be other cognitive processes that need to be a little convoluted to get to be ecologically valid.

CRAIG HEDGE: What Joel was saying, it reminded me of a paper, I think it's by Terry Arkoni (ph.) where he says that psychologists should look more at prediction and less on explanation. I think as cognitive psychologists, we like our theories and our tasks, like the Stroop task, are very built around honing in on this very specific narrow mechanism that our theories tell us will be there.

And there is an argument that perhaps we should start from the other end and just see what works and let's try to understand what works, what it affects and what works. I think probably most of us would say we need a bit of both. We don't want to throw away what seems to work, because we don't understand it, but also, we don't want to throw away what we understand, at least to some extent, just because it's not immediately working how we'd want to. Our theories give us the ability to predict, and I think if we dropped the theoretical mechanistic approaches that we might lose that foresight a little bit.

CLAIRE GILLAN: I just want to say it's a really interesting topic, and I wanted to flag I think the role of passive data in some of this ecologically valid work that we're talking about. So we talk about, or you've mentioned, fast, cheap, good, but I think there's also lots. So we can get lots and lots of data from people living in the real-world, like adding another dimension to this. So Aaron Heller has a really interesting paper with Kate Hartley looking at GPS diversity and how that relates to daily fluctuations in mood. People have to self-report their mood, but you can start making proxies for that using things like the language they use on social media and other indicators.

So I think, yeah, adding the scale that we can get and reducing the amount of burden on participants is another win for this sort of ecologically valid research and even pointing to Zoe's work, that we need a lot, a lot of data if we're going to make personalized models for every individual. That's too much for a participant to give us. So the more we do ecologically valid research and use passive data, the more I think we can get there.

RUSS POLDRACK: Thanks. I think that's a really good point. So there is an interesting question in the Q&A from Agatha Lenartowicz, which points out that for many medical tests like for a glucose test, you're required to fast to ensure the the context for the measurement is valid. Is there any analogue for behavioral tests that might help us kind of like set the context of the individual for task optimization? Any thoughts on that?

Zoe.

ZOE HAWKS: I think in the neuropsychological assessment tradition, for example, you very often have patients in a room without windows with very little stimuli and the attempt there is to really standardize the experience such that everyone taking the test is in the same sort of context. So that's kind of one way to go about it, just to really control that, at each testing session, and then you're essentially removing that as variables that could be influential.

I think the other approach as we're starting to collect these data in more naturalistic environments is trying to measure both using passive sensing as well as using self-report scales, the factors that we think are important that could be influencing variations in the cognitive outcome that we care about, and we can both say, okay, can we experimentally or statistically account or control for these things if they're not of interest or on the other hand, can we understand the ways in which they're impacting performance. So again, that gets at the research question in terms of how you set up your design, collect your data, and think about the context in which data collection is taking place.

RUSS POLDRACK: Thanks. I want to actually ask a question that kind of ties back to the first talk that we heard, which builds on a point that Laura sort of mentioned and that I think is important which is these -- it also relates to this kind of naturalistic data collection issue is there are all these confounds lurking, right? Like correlations between the operating system that one uses and various demographic features, even the browser that you use.

And I'm wondering, you've done a lot of online data collection. Do you think that these are like fundamentally sort of changing results? Are there interactions between, for example, the operating system and my performance on a task, or is it more of like a main effect? I'm just wondering how much we have to worry about this.

LAURA GERMINE: I am trying to figure out whether I should give my real view or a more optimistic one. I think we have to worry about it a lot, and I think it's true in almost every domain when you move to a new method. Move to a new method, you have no idea what all the various sources of systematic bias are going to be. But there are huge, I mean, reaction time is a good one where there are systematic differences in how quickly a device can register reaction time. You can't norm on device, because Android for example has all different hardware and that hardware is where a lot of that variability comes from. You can't say it's noise, because people systematically choose devices based on their economic means, which relates to the degree of latency.

And even worse, let's say you perfectly were able to capture it all, now on Wednesday, September 28, 2022, by next month the landscape would have changed. And then you add in COVID, for example, where all of a sudden whole demographics became much more technically enabled and technically fluent than they were before, which of course changes how they interact with devices, in a way that is dependent on demographic, and whether you can work for home and what your age is and how isolated you are, and your risk-taking. It's easy to just become overwhelmed. But I think -- which I often am.

But I think the most important thing is people are not aware these problems exist.

So the number one thing is like we have to be aware that devices are factors in all of this.

It's going to be different for different groups. It's going to be different for different research questions. But going blindly in and considering a smartphone just a window, a portal to someone's brain, without having all that hardware and software in between, I think is the biggest mistake. So if we can start bearing in mind -- and knowing that digital technology replaces sources of variants that we couldn't account for before, administrators, things that were like whether a neuropsychological evaluator had their coffee this morning. We could never measure that, right?

At least here in theory we have basic parameters of the operating system and the hardware that we could have access to at least be like, okay, here are the potential sources, we don't know what to do with it, but we can at least measure it. So I'll stop there. I think there's a lot of work to be done, and awareness is the most important piece.

RUSS POLDRACK: Thanks. We are unfortunately at time. We're going to have to cut the discussion here. But I want to thank you all for a really great discussion, and we'll see everybody back after the break.

JENNI PACHECO: Welcome back, everybody. We're about ready to get started for session 2, where we'll focus on platforms to accommodate future tools, and discuss the key principles on making data friendly to computational analysis, the challenges related to accuracy and sample size, and some thoughts about the application of these in the use of mobile apps.

To start off this session, we have Dr. Laura Biven from NIH's Office of Data Science Strategy, Dr. Danilo Bzdok from McGill University, and Dr. Claire Gillan from Trinity College, Dublin. We're also delighted to have Dr. Joel Nigg from Oregon Health & Science University to moderate the discussion following these talks.

Thank you, and welcome, Laura.

LAURA BIVEN: Thank you so much. First of all, thank you very much for the invitation. I'm excited to be here, and excited to hear -- I was able to listen in to some of the talks previously. It's really interesting for me, especially not being an expert in this domain.

What I'd like to talk to you a little bit about is at a very high level, some of the things that you can think about as you think about establishing new data assets, new data repositories, putting yourself into the context of the broader NIH data ecosystem, and some of the best practices that have been developed from data scientists more broadly.

I want to talk to you first of all about the FAIR principles and concepts around the NIH biomedical data ecosystem. I want to talk to you about the importance of repositories. Repositories are really sort of the building blocks of this ecosystem. I want to talk to you also about the realities of a federated ecosystem and what this means in terms of thinking through interoperability and how you want to prioritize that. I also want to stress and put ethics on your radar right from the beginning. It sounds like it already is, but I'll sort of explain how this might percolate into your thinking around infrastructure, and then also highlight workforce development.

First, a little bit about our office, actually, the NIH Office of Data Science Strategy. We sit in the Office of the Director, and one of our roles is to implement the NIH strategic plan for data science. So first of all, we have one. You can find it online, and I've put a link here to our website.

The goal and vision of this strategic plan is a modernized, integrated, FAIR biomedical data ecosystem, and we attack this goal in a number of different ways, focusing on community engagement, workforce development, the data ecosystem, data infrastructure, as well as tools and analytics. And I think you're going to be touching on many of these topics today.

Let me focus in now on the FAIR principles. These are really key guiding high-level principles for what it means to have really an ecosystem of data and the focus here is on the R, which is reusability of data. And the FAIR principles are laid out in a paper but also in many different sorts of guides and standards, and they give high-level best practices and requirements around how to make data findable, for example, through the use of persistent identifiers; how to make it accessible, for example, through the use of standard communication protocols; how to make them interoperable with the use of, for example, shared standard metadata and ontologies; and then how to make it reusable through documentation, licensing, provenance, and so on.

So the FAIR principles give a pretty high-level best practices framework for how to make your data reusable, and this is a really key concept in terms of wanting to make, have the biggest return on investment for the data that are being generated through the various studies that you have.

I will say that many communities take it upon themselves to go further along the FAIR principles, for example, being more specific about what sorts of expectations that community should have, or different specific ontologies or metadata standards, and so that's something to watch out for, and one of the main forums for having that conversation is the Research Data Alliance or the GO FAIR organization. Both of those have opportunities for communities to come together and kind of work on some of these details of what it means for data to be FAIR.

Another hallmark of the NIH biomedical data ecosystem is putting data in the cloud. We've been doing this with most of our data research assets, and NIH has a program called STRIDES, which is a way in which NIH has negotiated discounted rates for researchers for storage and computing in the cloud, and so this is something you could possibly take advantage of as you think about making your data accessible to the most number of researchers and really facilitating largescale collaborations around data and largescale access to data.

Let me now just highlight the importance of repositories in this biomedical data ecosystem. There was a workshop from the big data interagency working group about a year-and-a-half ago, and it talked about the importance of federally sponsored data research repositories as the building blocks of this ecosystem, and it invited the community to envision a future, to think about key next steps for research repositories, and many of these observations of the opportunities that were surfaced in that conversation have to do with thinking about national infrastructure for repositories, also sort of a national community of repositories that spans domains. There are many different ways in which repositories have common challenges around thinking about impact or measuring usage, for example, and they could come together to share best practices and tools and techniques.

Then also having more proactive and greater engagement between the repositories and the user community. So thinking about different hackathons or ways in which the data could be presented to the user community that cuts down on some of that processing or cleaning up of data that is maybe ubiquitous across the user community.

There are also principles for repositories. I talked about the FAIR principles which are principles for the data themselves. There's sort of an analogue to that for repositories, and one of the ones that's pretty well-known across the NIH community in particular are the TRUST principles, so these are laid out here.

More broadly, many different agencies came together and they looked at the FAIR principles and the TRUST principles and CoreTrustSeal, which are other standards for repositories, and said we have to help our repositories find a path to some of these best practices. So the interagency community came out with a list of desirable characteristics for data repositories, and this is a way to think about how to mature your repository into some of these best practice states, and also, I think is pretty comprehensive in terms of the types of things that repositories should be paying attention to to make their data most reusable and facilitate, for example, reproducible science.

I want to talk to you now about federation and interoperability. Federation is just a reality of the world that we live in. This is a really simplified org chart of the NIH institutes and centers. There's the Office of the Director at the top, and then all the different institutes and centers, and if you can imagine each institute and center owns and sponsors and operates a number of different data repositories and data resources, and what we in ODSS and through the implementation of the strategic plan are trying to do is provide some of the connective tissue to weave all of these things into an ecosystem.

So we're not one big monolithic repository, we have to think about how these different assets will talk to each other and we do that by thinking about the data themselves and the FAIR principles help us with that, but we also have to think about that on a platform level, more from a computational level of how those different platforms can communicate.

One of the efforts that we have that is trying to bridge some of those gaps and create an ecosystem is our NCPI cloud platform interoperability effort, and this is a coming-together of a number of different primarily genomics-based platforms, they're shown here in this little cloud bubble. Each of those is sponsored, owned, and operated by a different institute and center, and they're coming together and saying what can we do in terms of implementing common standards for APIs or search capabilities that will really help users in terms of facilitating workflows that run across all of those different repositories.

And they are making use of some of what I would call common data services, that form these connective tissues across the NIH ecosystem, one of which is our researcher authentication service, or RAS. This is a single sign-on capability that different repositories can use, and then users can come into any one of those repositories and be able to move around the data ecosystem without having to log in again to the controlled access data that they have access to. The RAS, by logging in with RAS, you can then see all of the data that you have access to and access it seamlessly without having to log in multiple times.

Another data service that we are thinking about in ODSS is search. So we held a workshop on this earlier this year, thinking about, first of all, what are the key use cases for search that NIH researchers need? How are they looking for data? How are they creating new data products across the different NIH assets? So now we're thinking, now that we sort of have input from the community on that, we're thinking about the new tools and technologies that might need to be built to facilitate that. So that's another data service layer that will live across the NIH repositories to help facilitate these cross-repository workflows.

Let me just point out the overarching question of ethics. I did want to point out, this is a report, this diagram is from a report of the advisory committee to the director. It's their AI working group, and it was a report from 2019. This actually predates me coming to NIH, but I remember this report because it was the first time I saw data, people, and ethics at the same top-billing level in a report in terms of priorities.

Even though this report is specific to artificial intelligence and machine learning, I think it is applicable across data science that ethics needs to be a consideration from the very beginning, throughout the data lifecycle, throughout the tool development lifecycle. It has to be something that's baked into the way that you think about how you want to be governing and managing and using data. So this report gives some ways in which the NIH should be thinking about that, at least specifically to the AI and machine learning.

Now I want to talk about workforce development, and I did just want to highlight a couple of programs in ODSS. The first one is our data scholars program. This is an opportunity for researchers and mid-career or later career level people to come to NIH for one to two years to work on key problems and really help us with strategy or help us with kicking off new programs, for example. And here are the data scholars that are currently with us. It's a wonderful program, especially for academics seeking a sabbatical year, for example, but it's open also to people from industry, and I've worked with many of these people during the year and they're really wonderful sources of expertise that can help breathe fresh life into programs and really help us make sure that we are starting out new data science efforts with knowledge of the most current capabilities and technologies from outside of government.

With that, I will just mention that I put a lot of slides at the back of this slide deck, so I hope it's made available to you with a number of different pointers to, for example, funding opportunities in each of the spaces around FAIR data, interoperability, workforce development, and ethics, as well as some pointers to other policy documents that are more broadly available for the government.

With that, I'll say thank you very much, and I will pass it over to Danilo Bzdok. Thank you.

DANILO BZDOK: My name Danilo Bzdok. I'm a medical doctor and computer scientist by training, and the mandate of my research team is to bridge the machine learning activities at Mila Quebec AI Institute here in Montreal, and the neuroscience activities at McGill's Montreal Neurological Institute.

Today I have the pleasure to point out some of the challenges and potential solutions, maybe, in how we can graduate machine-learning solutions to enable single-subject prediction in the mental health space.

I work with and am compensated by different companies.

Why is this a particularly pertinent question today? I think we are really seeing the confluence of three different megatrends over the very last years, and on the left we have the neuroscience base, which in my opinion has been accelerating a lot. How I can see this, well, for example, in the number of NIH-funded PhD positions. As far I know, there are few other research areas that are as well funded as the neurosciences put together. Also we've seen over the least ten years or so the emergence of national and international largescale brain research initiatives, starting with the Human Brain Project, similar initiatives in the United States and other countries, and one important consequence of this is that we have seen the emergence of treasure troves, titanic datasets, such as the Human Connectome Project, the UK Biobank with half-a-million deeply phenotyped subjects, or the ABCD cohort, so we have by far more data than we had ever before in the history of neuroscience, and we have a huge workforce of active scientists in this space.

So in the middle, we have machine learning, a meandering history, but since 2010, 2012, we saw an exponential increase of innovation and application of machine learning and deep learning solutions.

Why is this particularly pertinent to psychiatry and medicine? For one, because we can uncover general principles, underlying series of observations without explicit instructions.

On the right, precision medicine, and this is a very large -- there's very large momentum in this space from a number of perspectives. One statistic I like from a Nature tutorial a few years back is that health data double every two months. Every two months, the world has twice as much health data, so this an exponential trend. Also, from an economic perspective, the market related to personalized medicine has exceeded expectations and is probably several billion dollars big. So this is why there's also a lot of interest by big-5 tech such as Google, Facebook Meta, and so forth.

So we have higher granularity, more precision, as you heard in the previous talks, about single human individuals, more than we ever had before, and I'm going to argue that this will probably lead to a paradigm shift, in not only how we analyze data but perhaps also what types of questions we ask.

Why? Because over the last decades, 20th-century biomedical research in general has perhaps been a lot focused on the notion of group contrasts between what we call a control group and a target group, and perhaps we'll transition always more to what I'm going to call single-subject prediction. So that could lead to a new era of evidence-based medicine tailored to single individuals.

To say it in yet another way, current research is really about a lot diagnostic categories, as defined in DSM and ICD, that is based on expert opinions who met somewhere in the world to decide what the diagnostic criteria and items exactly are, and oftentimes in the mental health space those are rooted in human experiential terms and not so much objective markers of disease, as in many other branches of medicine.

What this means is that a clinically distinct mental disease is not always underpinned by an identical biology, insofar as we can tell by the neuroscience instruments that we have. Maybe that is the reason why in roughly half of the times when we try to treat psychiatric patients with a drug treatment, that we fail and we resort to trial-and-error exploration.

How could we do this different? Instead of testing predefined categories rooted in experiential terms, in a clinical group versus control group setting, what we instead could do is what you can see in this schematic here. Perhaps we could use machine learning tools, tailor them to harness the combination of a lot of different types of data. We heard about genetic data, behavioral data, longitudinal, smart phone data, questionnaires, life experience markers, and so on. Machine learning tools are particularly well-suited to work with high-dimensional and very rich datasets of the type that we see more and more now.

So that could help us to really, quote-unquote, carve nature more at its joints and in a bottom-up fashion really use large amounts of data to derive biologically valid subgroup definitions, as we see in panel C here, based on algorithms, so by really letting the data speak for themselves, asking the data first, rather than imposing human-defined diagnostic categories from the get-go.

If we succeed in this, it is possible that we will be better at achieving the goals such as early detection, treatment choice, predicting how people will react to a particular treatment intervention, and prognosis of disease risks in the future. Machine learning tools are also more useful to ship them to other clinical institutions, for example.

A lot of the tools we're still using today, legacy solutions and research, but also in clinical practice, they were actually invented before the Second World War, before we had calculators, and also in the time when we were mostly using experimental data. Over the last decades and the second half of the 20th century, a lot of the tools that we now call machine learning have emerged, and those are much more computationally expensive, they have always more parameters, they're always harder to interpret, they have always larger memory footprints, so they're demanding from a number of perspectives, including the amount of data that we need to really usefully benefit from this new data real estate.

Challenge number one, related to what I just said, is that, and we alluded to this, in my opinion there is really a fundamental tradeoff between how interpretable a particular analysis tool is and what the model complexity is. So how complicated the predictions can be that a particular machine learning tool can actually identify and exploit for the purpose of single-subject prediction.

You see at the top left of this schematic, you see linear regression types of tools. It's always useful to remind ourselves that more than 90 percent, maybe more than 95 percent, of quantitative analyses in biomedicine, still today, are probably linear regression-based. Why does this make sense? Because we have naturally tried to identify single variables that we have picked on based on existing literature and carefully thought out theories of potential biological mechanisms, and in that space it makes a lot of sense to use these models such as linear regression.

However, as we transition to always more applications of machine learning, we enter a different regime. So an extreme case is probably deep learning on the bottom right here, where we can derive always more complicated empirically successive predictions, for single datapoints, for single individuals. However, it's fair to say that the explainability tools to unpack the meaning behind deep learning predictions, they're still in their infancy.

So we have a little bit of a tradeoff that you need to accept there, and it may in a certain way perhaps change the global reserve currency of why we actually run research projects.

The second vexing observation is that always larger datasets that unfortunately do not lead to always better classification accuracies of patients versus controls. So here you see that what accuracies we reached for different mental health conditions, and you see that the larger the sample sizes, there's a trend for always lower achieved cross-validated prediction accuracies.

So why is that? I think we don't really know entirely why this is the case. One way to think about this, I propose here, is that we mostly aggregated retrospectively datasets from different studies to arrive at some of these large data consortiums that we have today, and there's heterogeneities in them that are perhaps very hard to tackle.

Newer types of largescale data acquisition initiatives may put much more emphasis on homogenizing technically and various other levels, to have data that are as comparable as possible. But there's a second aspect here in my opinion, and that is, again, that framing the purpose, the style in which we do research in the big data computational psychiatry realm is just completely different because you're usually trying to look at largescale representative population cohorts, and that is just completely different from the idea of having very rigorous inclusion and exclusion criteria, and the types of clinical cohorts that we have been traditionally studying, maybe they have been artificially more homogeneous, and that is why it was easier to derive high accuracies in small clinical cohorts that have been heavily pruned.

Just almost done here. A third challenge is that, I'll just summarize this here on a high level, we can show that the diversity of these populations, I'm really talking in terms of social identity markers, demographic markers, socioeconomic status, and so forth, these differences between human populations have a huge impact on the accuracy of machine learning prediction models that we can achieve, and we are showing this research here that we cannot use the incumbent deconfounding strategies to solve these problems. They are insufficient. There are probably unknown knowns and known unknowns that we do not typically measure in the datasets that we have, and it's not entirely clear how to exactly integrate these external sources of information into our datasets to deal with sources of population stratification in our single-subject prediction studies.

To summarize, I think we need research and platforms to enable clinical translation of empirically justified single-subject prediction models, that we hope are going to be fast, cost-effective, and pragmatic.

Thanks. The next speaker is Claire Gillan.

CLAIRE GILLAN: This is a prerecorded talk, a short presentation about smartphone science. We live in an age of online research. There's been a massive growth in studies using online crowdsourcing platforms like Amazon's Mechanical Turk where real people inside the computer essentially complete tasks in exchange for money and help us scale up psychology research. I think this has been a massive net positive for psychiatry research, psychology research, because it increases sample size fundamentally, and we know there are major issues with reproducibility, and the median sample sizes that have been characteristic of research until now.

It also gives you other gains. Access, for example, to rare populations who are geographically or racially diverse or who have rare clinical conditions, or who are just typically not represented in research. People who don't live within a metropolitan area, people who can't come and participate in a study between 9 to 5 on a college campus.

Other aspects, other advantages, rather, of this online boom for research is anonymous participation. Particularly, I think, people with psychological problems may not always be willing or comfortable to talk about them face-to-face, and this gives another avenue for people to take part in the research process. As I alluded to before, I think it's really good for reproducibility in our field. I've seen a massive increase in exploration confirmation style approaches, which, absent preregistration, is another really powerful way to test new theories and really make sure that our findings are reproducible.

It also facilitates standardization, which is really important, and sometimes effects don't replicate, we don't know why, different labs, different procedures, this can eliminate all those sorts of issues, so that's another gain.

But of course, there are problems. There are ethical issues that have been raised with the use in particular of crowdsourcing platforms. Workers may be financially insecure, and there's also few checks and balances or conduct standards for requestors, in other words, researchers, on these platforms. And this has led to some, not all, workers in these platforms being dissatisfied. For example, it's not what it used to be two to three years ago, the pay is lower, and researchers don't treat you with courtesy, some will cheat you.

Issues around this with the incentives or motives of the people participating in research not being aligned with the mission of the project can lead to issues like data loss, so I think in a quite important study this year by Burnette and colleagues, they showed that data quality checks, as they started implementing them throughout their study, really basic ones like are you reporting the same age and gender as you did two minutes ago, or two screens ago, they had to exclude 90 percent of the participants in their study. And that's because of the way it was set up, actually incentivized participants to lie to say they were a member of a minority group in order to be allowed to participate in what was otherwise a well-paid study.

I used to think that issues around data quality might be unsystematic in the large samples we would get and online studies could somewhat mitigate against this. There's really good preprint from Sam Zorowitz and colleagues that suggests otherwise, so particularly when we look at aspects of mental health that are skewed, which many are, like obsessive compulsive disorder, for example, we can see that people who respond randomly in purple here are more likely systematically to score higher on those scales, and then of course they're going to score worse at any number of cognitive tasks that you give them, and that introduces spurious correlation.

The final issue is data waste, and I think this is less discussed in the online space, but we know there are many workers online who are completing the same tasks for different research groups around the world, or these super-participants who carry out a lot of or take part in a lot of psychology studies, but we don't have the capacity to link their data and to learn the maximum that we can from the time that they’ve already put into doing these tasks.

So if you'll excuse the disgraceful pun, this was our "Neureka!" moment, and it motivated us to try and think of a new way to do online research, scale up online research, and maybe we can leverage the people power of individuals around the world who are interested in contributing to research for research sake, because they care about the problem and they want to advance research in that area.

So what our app, Neureka, does is allow the public to act as citizen scientists or unpaid research participants. They donate really rich information about themselves in the thousands, so this is information about cognition from different tasks they play, health, lifestyle, attitudes, and more.

What we've learned from this and other studies that we've done outside of crowdsourcing platforms, but still online, is that when we can align the incentives with the people participating in the study, we get really good data quality. I talked about as study that had to exclude 90 percent of people for inconsistent age, gender information. In our case we measure height for no good reason, once, and then four weeks later, and we see really excellent reliability of that when 500 people had gone through a treatment prediction study. The reason that that works very well, we think, is because everyone who participates in the study resonates with the cause and really want to advance it.

With our smartphone app, we can amass really large samples of the sort that are necessary for us to build predictive models. With that we can get rich per-person phenotypes -- so, lots and lots of subjects, but also lots of data per subject. The way we have structured our app is that there are numerous different science challenges, or what you would consider experiments, and people can participate in as many as they like at their leisure, and new ones are typically introduced maybe every quarter. This means over time we can develop quite a really detailed picture of an individual.

Put another way, many of our science challenges actually are set up to specifically collect repeated within-person assessments of cognition. So, moving away from overreliance, I think, in our field on cross-sectional methods, towards looking at how cognition, symptomatology, mental health, might change, and change in concert over time within a person.

This allows us also to, I think, more ecologically valid research. We're studying cognition in the real world, mood, affect, compulsiveness, in people's real-world settings and not in the confines of a dark windowless testing room on a university campus.

The final thing is that this app allows us again to standardize, and the tools that we're using across many different sorts of experiments that we do in the lab, looking at different interventions, different populations, and also to share this technology with other people very easily and for free.

So, how are we doing? We launched our app in June 2020, and we have over 18,000 registered citizen scientists currently across 100 countries of the world. The app includes many different cognitive tasks. These are some examples, measuring things like model-based planning, processing speed, short-term memory, and metacognition.

This will give you a flavor of one of the tasks, and I'll show you a more difficult level here, that measures memory. You see these symbols on the screen, kind of a classic gamification. And then you have a certain amount of time to identify the color and the correct symbol that you saw before, and then you get a score, and there are some lives implemented there. So this gamification encourages people to play these games for free and to play them multiple times.

Alongside these games, crucially, we get really rich self-report data concerning people's lifestyle, environment, mental and physical health, and we also take the opportunity to give something back to these citizen scientists, so we do brain health advocacy and also science literacy training within the app, as well.

The final thing I think is really interesting with this app is that we have a focus now particularly on repeated within-subject assessments. We do momentary assessments, and we give people graphs of how their mood or their compulsivity changes, and day-by-day and hour-by-hour, and this really helps people to stay motivated and want to continue to contribute.

So how are we doing? We have some games, I showed you on the last slide, this is a more extreme example of gamification of a well-known model-based planning task. So we can have a game like this where people are shooting diamonds, but we're really interested in which side of the cannon they're shooting from. Using this extreme gamification, we can replicate classic effects from the field, so it's a classic behavior of this task, versus our Neureka subjects, and crucially we can show some external validity of that.

So we can find associations with transdiagnostic dimension compulsivity here, so the more compulsive you are, the less model-based planning you exhibit, and this is something we've seen in many different datasets.

Originally, on Mechanical Turk in 2014, later the same effect in students studied in person at Trinity, and then finally in patient studied online in 2016 who were screened for psychiatric disorders.

So to answer what I think is possible for smartphone science. What's really interesting, as was dealt with in an earlier session, that the idea of optimizing tasks for problems that we care about, detecting illness, or discerning what treatment might be best for an individual. So we can look at things like task parameters, optimal trial numbers, and more. And crucially, we can use this large data to develop multivariable risk models, so we can get lots of data per subject and lots of subjects, that allow us to figure out the signal from the noise and implement some of these things in practice.

Part of that, or what makes that possible, and this is very important I think for today's discussion, is moving again from this cross-sectional methodology towards treatment-oriented research, and there's particularly scalable forms of treatment like internet-delivered CBT or at-home tDCS that we're using the same kind of challenges to understand.

Finally, we're interested in sharing it with the community, and we can talk about this more in the session, in a way that's safe, but also promotes other people using this for great research projects.

With that, thanks very much for your time. These are the people who do all the work and my funders. Thanks very much for listening.

JOEL NIGG: Thanks, all three of you, for really rich talks and lots to think about here and some really fantastic information about some of the cutting-edge possibilities.

We're open for discussion now, so if anybody has questions, please type it in the Q&A box. If you're a panel member, go ahead and raise your hands. I have one question that I'll start with since there's no questions here -- there actually is one question in the box, too.

I'll throw a question out actually to Claire, which is a question I'm sure you've thought about a lot, but you didn't mention it, and I wonder if you could comment. Anytime you have a self-selected sample, you obviously have the issue of who do they represent and who do they generalize to. Could you just comment on that in terms of this approach you've taken which is really a fascinating approach?

CLAIRE GILLAN: Yes, I think it is an enormous problem. The way we approach it is that we accept there's limitations to every method and we try and get convergence across the different samples that we study. So the Neureka data is a mean age of 45, a different gender composition than the students who come in and they're 21 and we observe them in person. But it doesn't solve the problem. It helps. We can see the same effects across different populations and it helps, but we do have to really start thinking about this. I really enjoyed Abhi's talk at the start of today, and that's something that we're trying to take onboard. Not so much in the case of racial or ethnic diversity within a country or population, but actually how much the findings that we get in our western samples apply in low- and middle-income settings, and I think that's a really interesting potential that smartphone apps offer, so we're undergoing the process now of doing a translation and having a Chilean version of this app. Understanding if the same kind of risk factors in countries that have different demographics, have different rates of cardiovascular health and all these other sorts of issues, if the models that we develop in our WEIRD datasets can apply equally to those settings.

Apps are great because you can just extract all of the text, you can translate it relatively easily, and in theory we could have people all over the world doing really comparable experiments, we can get at that.

JOEL NIGG: Thank you. There's two other questions in the chat right now. One question here, could the inverse relation between accuracy and sample size be some kind of an artifact, perhaps related to averaging across multiple samples of different variance?

I wonder if, Danilo, you want to tackle that.

DANILO BZDOK: The question is sample size and accuracy?

JOEL NIGG: Is that lack of relationship that you've shown or trend toward negative, is that due to potential artifact of some kind from averaging across multiple samples? For example, multiple samples with different variances.

DANILO BZDOK: Artifact means all sorts of things. As I said, it's just an observation at this point. The observation has been made a couple of times, by Tor Wager's lab, Klune lab, and also by a couple other labs. So we are just in the process of trying to uncover what could lead to these, what exactly contributes to them. I've also said that, you just need to think about most datasets that we have, what do we typically use as, quote-unquote, covariates of no interest? From my limited experience, most of the time we have age, sex, and then perhaps some measure of socioeconomic status, perhaps IQ, and that's pretty much it. And it stands to reason that as we have always wanted samples, this is not enough.

There's a lot of other variables that we could have measured. Those could be variables related to the demographic social identity factors, but those can also be technical measures, as we have heard just a couple of minutes ago in the previous session, what (inaudible) do people use, when do they use a certain tool, and so on and so forth, contextual information. The more pessimistic answer would be that, as I said, there may be a lot of known unknowns and unknown unknowns that you can call variances, but we just don't have a lot of the information that is perhaps necessary to really faithfully estimate a lot of these sources of variation that contribute to the prediction models that we try to estimate.

JOEL NIGG: Thank you very much.

CLAIRE GILLAN: Can I just ask a follow-up, Danilo? I think it's maybe a silly question with a quick answer. That study that you presented is based on analysis of published data. Is it not possible that all the small-sample research was just done and overfit, and there's some extra bad practices that weren't included in the original papers, and that just explains it? Like they tested a few more models than they included, more space to overfit?

DANILO BZDOK: If you assume that there's as many bad practices in the small datasets as the analysis of the big datasets, then it cannot explain that.

CLAIRE GILLAN: But I think like Brenden pointed out, with small samples you have the potential to be much more wrong than you do in larger samples, right?

DANILO BZDOK: Wrong?

CLAIRE GILLAN: As in you can overfit, you're more likely to find some nonsense variable with a ridiculously high correlation than you can possibly if you have a really large dataset.

DANILO BZDOK: Yes and no. It really depends on the particular analysis setting and the analysis paradigm. One aspect that contributes to this that I personally see is the one with how we actually pick the subjects. If you have a lot of, for example, mental health patients tend to have certain comorbidities, and so if you try to have a clean sample by having a lot of inclusion and exclusion criteria, you are kind of making the cohort artificially more homogeneous, and you don't actually have these various comorbidities that this patient group actually has in the real world.

That would be one reason why, in a sense, you do overfit, because there are not all these challenges from all these concomitant conditions that tend to also be present. Whereas if you go to a population dataset where 1 percent of the population carries a diagnosis of schizophrenia, as we know, across cultures, you will have a sample that also has all these comorbidities if it's a really representative sample. In that sense, yes, you're going to be likely to overfit more in a smaller sample, especially if you had a lot of inclusion-exclusion criteria, which is a common practice.

JOEL NIGG: Maybe here's a follow-up question for you Danilo from Claire's question, is if you think about this from the point of view of a mathematical, is there a mathematical explanation as opposed to a sampling explanation? We know from linear statistics you generally have the principle of effect size shrinkage with larger samples, as you know, and of course greater variability of results with small samples. So if the true effect size is zero, you're going to have small samples all over the place from -0.2 to 0.2, or whatever, for your effect size. Whereas the biggest dataset should have a smaller variance across samples, in general, all other things being equal.

Does the same hold for machine learning, or is machine learning a more complicated predicament for how you'd mathematically emphasize shrinkage with large samples, in terms of you get closer to the true population variance and away from that wide range of effects that you see in small samples?

DANILO BZDOK: So it's useful to distinguish, first of all, that the theory that backs up machine learning tools is entirely different. It's just that people try to see machine learning tools through the eyes of what they have been taught in undergrad. Null hypothesis testing and T-tests and so on and so forth. And that is I think one reason why it's kind of difficult to ask and answer these questions, because the frameworks that really underpin why predictive models work, they are things like Vapnik-Chervonenkis dimensions, probably approximately correct learning, and things like that, and they just really have nothing to do with notions of null hypothesis testing, P-values. There's just very little of that kind of focus on confidence intervals and everything you can derive from confidence intervals. Multiple comparisons is not in the table of contents in a machine learning textbook, just as one example. So it's just an entirely different thing, and you can't really look at that through the lens of classical statistics.

JOEL NIGG: I recognize that, which is why I asked the question. So I'm wondering if you're thinking the solution here requires simulation modeling, or how are you planning to get at this?

DANILO BZDOK: What do you mean by simulation?

JOEL NIGG: How do you think that this mystery that you've identified would be best addressed?

DANILO BZDOK: One idea would be to use population datasets such as the UK Biobank and to try to empirically get at at least some of these sources of variation. Why the UK Biobank? Because it's the largest existing biomedical dataset, not only mental health. It's probably also one of the cleanest datasets, from my personal opinion, and it has more different types of biological data also, as pretty much any other dataset. So you have more than -- just imagine you have more than 15,000 variables that you can use in various analyses of behavior, genetics, and brain imaging.

However, to really kind of get at these questions, unfortunately, it will be hard to completely ignore causal relationships and unfortunately, machine learning tools are not super good at identifying causal relationships in data. So I think it will be tough to tackle this kind of dark matter of this question, the unknown unknowns, because we kind of pre-assume that we really know the causal graphs of how diseases arise mechanistically and so on. And I think we are far away from that standpoint.

JOEL NIGG: Thank you. Any other questions at this point? Brenden, go ahead.

BRENDEN TERVO-CLEMMENS: Such fantastic talks. I think I just wanted to make a point and highlight kind of a possible link between what Claire was talking about with what Danilo was talking about, is that I think the target population for the inference really matters, and how we hope to deploy these models in clinical care determines how we think about them.

So if it's a homogenized group and you happen to see those patients, then perhaps these complexities of all of the possible confounding factors don't matter as much. But if we want to make predictions at the population level and all those complexities, then we need, I would think, larger and more representative samples as well. Perhaps the two ideas are related. Particularly trying to keep in mind what is the target inference, and what is the target population that we're aiming towards in the modeling frameworks.

JOEL NIGG: Thank you. Other questions or hands up that I'm not seeing?

BRUCE CUTHBERT: Dr. Biven, I have a question for you. Thank you for your talk about all the database issues. Are there any particular snags or things that you particularly need to think about in setting up these kinds of machine learning-compatible databases? It seems like to get started it seems a little bit daunting to figure out what you do, what you need to worry about, and so forth. Are there any just straightforward, practical things that you can mention that would be the first things to think about?

LAURA BIVEN: Yes, sure. I hear a lot about research groups that have done extensive studies, sometimes over a decade or so, and they want to gift their research set to the machine-learning community, and there isn't enough information in that dataset in order to make it usable to machine learning technologies. So I would say thinking early and interacting with the machine learning community as early as possible, having an idea of at least some of the use cases and applications that you'd want to make those data accessible for, and iterating and testing early is really important to make sure that you have the information that you need. If you don't collect the information up front, that's very hard to reverse engineer.

Doing those sorts of early tests, even in the research planning stage, I think is really important. I would also say obviously there are computational efficiencies, so also working with the community at the point of data sharing and making sure that you have the data in the most computationally efficient form, that's also important as well. But I think really early planning is the key.

EIKO FRIED: Thank you. Danilo hinted at the point already, so I just wanted to reiterate that, and pitch the work of Na Kai(ph.) on the topic of deep phenotyping. One reason for the accuracy/sample size trade-off, and again, Danilo alluded to this already, is the quality of the measurement. And going back to the UK Biobank, for example, I don't think they have the best, most detailed assessment of the phenotypic variation that we are interested in, depression symptom, anxiety symptom, and so forth. They used rather short scales that are not that, that don't have the highest resolution. And going back to Na's fantastic work who has published on the depression phenotype largely, showing that the genetic signal is quite different if you use a one- or two-item questionnaire as 23andMe have done, for example, or if you use a really deep phenotyping approach and get at a much more detailed picture of the phenotype.

So that might -- Danilo, you sort of said that already, I just wanted to pitch Na's work here quickly, which is quite relevant in this discussion, I think. Thank you.

JOEL NIGG: A very quick answer, maybe?

DANILO BZDOK: One is that I agree that the UK Biobank is not the best we can do, necessarily, for mental health. I really meant the question in general, biomedicine in general.

From what I can see, the UK Biobank just wasn't designed for mental health. It has a much more population and kind of general physician type of focus. So a lot of the measurements, as everybody knows who works with this resource, the measurements that touch on psychology and cognition, they are not the ones we usually like, or they are not at the granularity and quality that we typically used to, so I agree with that point.

The other one was, yes, the genetic markers, the GWAS sets, the significant signal, nucleotide polymorphisms, I'm sure they are not the same, if you change the target phenotype. It's interesting to just remind ourselves that even across the lifespan, the genetic markers of a given phenotype are not constant. So if you look at educational attainment in teenagers, mid-age, and in retired individuals, they are not necessarily exactly the same, and as far as I know, it's not entirely clear why, so yes, that's not as stable as we would like, maybe.

JOEL NIGG: There's another question in the Q&A that's going back to this issue, I think, of the sample and the sample size. This question reads as follows. The sample should be a reflection of a given target population, obviously on a partial or a small scale. The larger the sample size, the closer the parameter has to -- to the target population parameters. I think this is a basic assumption of sampling in statistics. If the sample is taken off the target population, then the estimates could be anybody's guess, regardless of the sample size. I was wondering how this sample assumption would be different in machine learning or AI.

DANILO BZDOK: It's a very, very broad question, but just very quick, if you only sample people in a certain city, that is easier to have high quality predictions than if you sample people from the entire country, versus if you sample people from all continents. So because the so-called target function complexity, machine learning is just higher. So in particular, it's useful to kind of think about the European descendent population, which we study most of the time, they're actually genetically way less diverse than people from other continents. So the African population is much more genetically diverse, so in a certain way, if we study European ancestries, which we do most of the time, we are already making it much more simple than it actually is.

For example, the GWAS hits on something like alcohol dependence, in the African population, are not necessarily identical to the ones in European descendants, so in that sense, yes, if you have a sample that is more narrow in terms of genetic background, yes, it will be easier to converge to the true estimates as the sample size increases. But if we increase the diversity of the sample in terms of genetic background, we need always larger datasets to converge to the true parameters. And they may be very different between these genetic background cohorts.

JOEL NIGG: Other questions or follow-ups?

JENNI PACHECO: There is no more hands up, so if there are no more questions from our panelists or any comments that anyone wants to add, we can save ourselves a few minutes and go to our break now. We'll keep the break at about five minutes, so we can come back at 3:15 and get started for session 3. Thank you, everyone.

JENNI PACHECO: Hi, everyone. We are ready to get started on our last session of the day where we'll focus on existing data sets and tools and what data we currently have available from health system records and research data sets. What's the feasibility of these data sources for clinical prediction and decision-making, and where do we need to focus on with new efforts of data collection and tool development?

To start the conversation around these topics, we'll heard from Dr. Greg Simon from Kaiser Permanente in Washington, from Dr. Eiko Fried from Leiden University, Dr. Raquel Gur from the University of Pennsylvania, and Dr. Joel Nigg from Oregon Health and Science University.

Following these talks, Dr. Laura Germine will join us again to help moderate the discussion. I want to one more time thank all of our speakers and turn it over to Greg to get started.

GREG SIMON: Thank you very much. So I will be talking today about use of health records data and whether those records could help at all with this broader question of informing precision diagnostics.

The work I'll be describing has been conducted within our Mental Health Research Network which is a network funded primarily by National Institute of Mental Health and involves literally more than 100 people across 14 health systems. So I won't list all of their names here, but certainly many people have contributed to what I'll describe to you.

To outline, I'm going to talk about the current landscape of health information or records of health data that might be used in this area, what we can learn from the data that are now available, what has happened with the shift toward virtual or online delivery of mental health care prompted by the COVID-19 pandemic, and then some thoughts about collaborating with health systems to improve measurement to inform both clinical practice and advanced research.

So I'll be talking about what I call the data exhaust of health care delivery and I use the term exhaust to mean that which is spun off by the normal healthcare operations, healthcare delivery and billing, which might then be available for researchers like myself and others to use.

There are some areas of health data or data-exhaustive healthcare that are well established. For decades, people have used billing diagnoses and procedures, pharmacy or medication dispensing data, clinical laboratory data.

I'd say now we're fairly comfortable with and fairly well-explored using these, maybe it's no longer right to call them emerging data sources, but patient-reported outcome measures extracted from electronic health records and mining of clinical text created by clinicians and sometimes by patients to try to identify clinically meaningful constructs.

Then I'll talk a little about what I call the new territory, the leading edge, or bleeding edge, or frontier territory, which is whether we might be able to utilize keystroke or mouse click data or audio video streams, they're created by this new shift to virtual delivery of healthcare.

I'm most familiar with something called the HCSRN common data model, but the general point of the slide is certainly not to have you memorize all the different data categories, but to say that health records data that came from different sources within a healthcare systems somehow need to be wrangled into some common data model that represents them so that they can be used or harmonized across different health systems.

The HCSRN common data model is a common one. There are others that are all fine, but certainly someone who would hope to use these data would need to become familiar with the common data model that represents the data that they would hope to take advantage of. As you can see here, there are many different categories which represent some of the most sort of clinical areas and original data sources.

It's also typical in the environment I work in, these data are housed in what we call a federated data structure, which means that the source data, original data, remain with each healthcare system, but they are organized in a common or harmonized format so that any research project might be able to either sometimes execute the same program across databases, across multiple sites, to extract data for centralized analysis, or in some cases, a program that would actually conduct distributed analyses, and then only stitching together the findings or results. The idea is retaining the original data at the source to protect privacy and that only the necessary data for completion of any research project would ever be shared.

There are now several of these, you could say, national federated data networks. The one that I work in, the Mental Health Research Network, which is sort of a subset of something called the Healthcare Systems Research Network, now includes data on about 19 million patients.

The FDA Sentinel Initiative, which may be familiar to some of you, which is somewhat overlapping, covers now about 74 million people across the United States. The PCORnet, which was funded by PCORI, includes about 31 million people, partially overlapping with the Sentinel network and the MHRN network. The PCORnet network is primarily based in electronic health records data, the FDA Sentinel network is based primarily in insurance claims-type data, the HCSRN or MHRN network incorporates both of those, so both data sources are available for most people.

So that's sort of the general data structure or data architecture that we're working with. One thing that's become increasingly important in the research we do and I think a potentially valuable resource is the systematic use of patient-reported outcome data in these healthcare systems. So to be clear, these are not questionnaires that are administered for research purposes. They're questionnaires that are routinely administered in the course of care according to health systems policies and then recorded in the electronic health record.

So just to give you a sense of the volume, in 10 of the 14 healthcare systems in our network that are now systematically collecting and recording those patient-reported outcome data, these are the volumes, and these are per month. So PHQ-9 depression scales, about half a million measures per month for about 400,000 unique people, some people being measured more than once in a month. GAD7, or Generalized Anxiety Disorder, measures for almost 400,000 measures for 240,000 people, AUDIT alcohol questionnaire, similarly very large numbers. So these are per month.

If you blew those up to per year, the volume of measures would be times 12. The number of people would not be times 12, because some people might be measured many times in a number of years. But clearly, compared to what we often think about in terms of clinical research, these are two to three orders of magnitude larger than the volume of data we're often thinking about.

So having described the data that are currently available, think about what we might be able to learn from data that are available now. I'll be very honest that we can learn a lot more from practice-based or practice-sourced data about things that must be lumped. We can learn a lot about patterns of diagnostic overlap and drift. How do people receive multiple diagnoses, or diagnoses change over time?

We can certainly learn about therapeutic overlap in terms of drug classes or even specific individual drugs and how those cross people, how they change over time, how they relate or don't relate to particular diagnoses. This certainly would tell us a lot about our current diagnostic categories and which boundaries, to be honest with you, are false or simply don't exist, those things which should be lumped.

Right now, I'd have to say these data are probably not that useful about things that should be split. The basic patient-reported outcome measures that we have are fairly coarse and might not tell us much about being to, say, parse different things which are called depression into different categories.

One of the things we're very interested in doing with the data that are now available, is being able to use response to different drugs, drug treatments specifically, as potential probes to try to dissect diagnostic categories. One advantage of the kind of data that are available in records is that we can observe any individual's response to multiple similar and different therapeutics over time. If this person who has a diagnosis of depression was exposed to this treatment at one point, another treatment at one point, how do we look at the patterns of similarities and differences of response to similar and different treatments to try to parse those? We can also use this same strategy in terms of looking at off-target effects of nonpsychiatric medications on mental health symptoms.

So this was all sort of where we were up until early 2020, and then -- well, one more thing before I talk about the pandemic change is a couple of cautions about things that we have learned to not trust very much. If we're looking at ICD-10 diagnostic categories, I'd say we've learned not to trust things that go past the decimal point; those of you who are informatics nerds will know what I mean by that. But for instance, the distinctions between whether recurrent depression is currently moderate or in partial remission is one that I would not be very confident about the recording of billing diagnosis and encounters to make that distinction.

Electronic health records have created marvelous efficiencies for clinicians to carry forward text from one encounter to another, which means often the text which might be present for any clinical encounter has nothing to do with that encounter. It's been carried forward for years. I sometimes joke in my clinical life that I would like the electronic health record to actually highlight in a different color the text that was actually entered during that encounter, because that's what I would like to read. But I think we really need to pay attention to this sort of cloning or templated text as being sometimes uninformative.

Electronic health records often use problem lists. They're often not up to date. And I've heard about people trying to use problem lists as being able to tell whether a condition is improving or not or has resolved. I wouldn't trust that myself. Similarly data on medication reconciliation recording when, why, or how a medication was stopped is often not very reliable. Those of you familiar with more day-to-day healthcare operations recognize that problem lists and medication reconciliation data are not tied to billing or reimbursement, and things which are not tied to billing or reimbursement often get less attention.

Now to March 2020. What this graph shows is what happened to mental health specialty visits in Kaiser Permanente Washington, the health system where I work directly, through 2020. What you see there is the number of visits that happened in person, and that precipitous drop from February to April 2020, it's actually more precipitous than that -- we could make a graph by week, and essentially the in-person mental health visits went to just about zero over the course of four days, literally.

What we saw, though, was a pretty rapid and dramatic increase in video visits and some telephone visits, so the total volume of visitation was actually somewhat higher at the end of 2020. What we've seen, if we've carried that picture out, is increasing video visits, some return of in-person visits, telephone visits shrinking some. But what we're seeing -- and the leaders of these healthcare systems expect, depending on the healthcare system, that video visits especially will continue to be likely 50 percent or more of mental health specialty visits for the foreseeable future.

What this picture shows is what happened to collection of patient-reported outcome data during that time, so similarly what you see is the proportion of -- these are PHQ-9 depression questionnaires that were collected in person -- dropped precipitously in spring of 2020. We saw more and more of those being collected during telephone visits and being collected or recorded in video visits, where they were primarily being presented online prior to the commencement of the video visit, so that the total volume -- it was not quite completely recovered by late 2020, but was getting close. The total volume back to where it was prior to the pandemic.

What does this big shift to virtual, primarily video delivery, of mental healthcare, mean? What we're seeing is much larger volumes of patient-generated text, that there's much more communication happening by secure online messaging, asynchronous chat or even live chat, with mental health providers. So this is potentially a very interesting and important data source.

While patient-reported outcome measures were being used systematically prior to the pandemic, what we're seeing health systems at least start to consider is more personalized PRO measures. When people completed these measures on a piece of paper on a clipboard in the waiting room prior to a visit, it was not possible to say each individual could complete a slightly different set of measures, but once these are being delivered online prior to a visit, it's certainly possible to personalize.

There is in this, I think maybe the most exciting to this group, at least the potential for task-based assessments. If people are already completing assessments prior to visits online, they're already in an environment and using a tool that might be able to be used for some sort of task-based assessments.

There also is the potential for processing audio and video streams. This to me is a very interesting potential, but we have to be very careful about how we would ever start to do something like this and explain it to people. Interesting that when our health system first switched to video visits, there was very specific scripting and communication with our members about we are not recording any of these visits. And certainly if we were ever to say that we should record and process recordings of visits, we would want to do that only with clear communication and likely explicit permission.

Although, as many of you know, there certainly is the possibility for processing certain aspects of audio and video streams without the necessity of keeping recordings.

So certainly some big opportunities here. If we are going to realize some of this potential, how do we work with health systems to expand and improve measurement? I think of this in terms of what motivates healthcare systems to improve and do more and better measurement, and what might be the barriers. Most important, health systems are interested in improving care. Healthcare systems do care about more accurate and efficient matching of patients and treatments. So this is sort of a paradox. If we were able to show that better measures could better match people with treatments, healthcare systems would get very excited about that, but we will probably need to start using better measures before we can prove that those better measures would lead to better treatment matching.

Healthcare systems certainly care about clinical communication. Across the healthcare system, how do people communicate with each other about severity of depression, about anxiety symptoms, about substance use. So developing a sort of common metrics and common language. And no small thing, quality report cards, including HEDIS, which is probably known to some of you, but one of the U.S. national quality report cards, will now start to report the rates of measuring depression symptoms using standard measures as an indicator of overall quality of mental health care. Healthcare systems care about that.

In terms of barriers, it's really important to recognize that healthcare systems, and especially mental health care systems, and especially mental health systems since the pandemic, are really overwhelmed. And even small amounts of extra time asked of the clinicians are a real barrier. The simple arithmetic, what this means here, is that even if we said, if we looked at these healthcare systems are generating half a million depression measures per month, wouldn't it be great if we could add this to every one of those measurers, and if we could add something that we could absolutely guarantee took only 60 seconds, one minute of extra time, that would add up to, the math is 54 new full-time clinicians for that one minute per assessment. And to be honest with you, I'm not sure we have 54 new mental health clinicians to spare across the entire United States, so we have to be really careful about what we ask in terms of clinician time.

Integration with the electronic health record is really mandatory. There have been opportunities I think when people have sometimes approached us and said we've got this great new measurement tool, we'd like to try it out, and all it will take is for your clinicians to have a separate login that they use during the visits, and a different password. That really is a nonstarter.

Similarly, security concerns are a real deal breaker, whether those are real or just imagined. Similarly, when people have contacted us and say we have this really great assessment tool, and all you would need to do is build us a backdoor through the security of your electronic health record system for it to work. That really is a nonstarter.

These health systems are very risk averse, and it's not just a concern about the fine of a HIPAA violation, but really the reputational risk of the healthcare systems in terms of being good stewards of their members' health data, especially mental health data.

So that gets me to my final point, which is illustrated by this slide here, which is we always need to remember as researchers or as mental health researchers, in a large, complex health system, what is our role. We are not the dog; we are just the tail. However, it could be a pretty good ride as long as we don't upset the dog too much; the dog might take us to some very interesting places.

I'll stop there, turn it over. I think Eiko is our next speaker, and then we'll come back around for some questions and discussions at the end.

EIKO FRIED: Thank you so much for having me today. I am going to move from the amazing data that Greg talked about to system momentary data, ecological momentary assessment and smartphone data that Claire talked about before. I'm looking forward to giving a short update on the work we and others have been doing.

Many of you know who this is. This is Ken Kendler, scholar of psychiatry, history, genetics, and other things, but many of you will not know three other things. First, that I am a huge fan of Dr. Kendler's work, so everybody else get in line, I was here first. I get nervous around him in a ridiculous way, so just to put this out there. I love his work.

Second, Ken Kendler is an avid cyclist, and third, when I visited him in Virginia a couple years ago, he was willing to give me, of all people, his secondary bicycle. He said, Eiko, it's a vintage bike from the 1960s or something, take really good care of it, and you see where this story is going. I was heartbroken to tell Ken after a couple of weeks that I had broken his amazing expensive vintage bicycle.

Luckily for me, and for the scientific community, reductionism is a powerful framework that works exceptionally well in simple, for example, mechanical systems. So bicycles consist of a number of parts, of components, and you can decompose the bicycle system into these parts. The pedal moves the cogwheel, which moves the chain, and so forth. And you can figure out the macro level of the bicycle by looking at all micro components. And in that case, I could repair the bicycle, because once you analyze and repair every single component, the whole system works again.

So Ken Kendler and I are still friends, or colleagues, and everything is great. However, there are limits to reductionism in complex systems. That's one of the things we've learned over the last three or four decades -- the stock market, the global climate, the internet. Things are much more difficult there than in bicycles. So one example from where I live right now in the Netherlands, we have many beautiful blue lakes and ecologists have been working tirelessly for many decades here, there's an entire discipline in the Netherlands on this, on forecasting transitions in lakes that turn from these blue beautiful states into these green turbid smelly gross states, that many of you all know.

These folks worked for a long time on the components of these lakes: oxygen content, sunlight exposure, pollution coming in, external influences, the number of fish, and so forth. But only once they took into account the dynamics of the system, and with that I mean the relationships between these components, the causal processes, the causal mechanisms between components, were they able to both forecast transitions of the lakes successfully, as well as implement in part paradoxical interventions. Interventions in complex systems to achieve desired things are not always intuitive and can be figured out by just thinking about it.

This goes to the heart of my talk. I think that mental health problems that we're talking about are much more like lakes and not very much like bicycles. I think they are complex systems, they have many biopsychosocial components, and RDoC has been doing a fantastic job in highlighting those and working on a framework to organize these components into groups and pillars. These systems are complex. These systems are dynamic. And my claim is that understanding, predicting, explaining, and treating those systems will require studying the systems from which these problems emerge, and that means components and the relations.

But for that, and that goes to the heart of this symposium today, we need to measure the systems from which these problems emerge. And this is so important because the system dynamics are often those, I think, or there's some evidence that give us important information about, for example, forecasting, and not just the severity of problems. I'm going to give you an example.

I hope everybody can see my camera. I'll describe it a little bit for those who can't, because I'm sure people are using this as a bit of a podcast, maybe they're cooking right now, so I hope you enjoy this, but I'm holding up a playing card here. And I'm pushing two fingers down on the playing card so the card bends. This is a bistable system. Right? There's only two possible states. Either this state or when I push hard enough against it, the other side of the playing card flips over, you have a bistable system, two states.

And there's evidence that we can measure the resilience of systems by monitoring them carefully. If I push down my fingers very hard, this is an extremely stable state. It's very resilient. I need to push incredibly hard to flip it to the other system. But if I only push very slightly, I can flip the system very easily to the other state.

And we can measure the vulnerability of these systems -- I'm just going to give you one of many examples -- if I were to push very slightly here, these are the daily life events that hit people who might get depressed or might not get depressed, minor perturbations. And if you were to take a slo-mo video, if the system is very resilient, the card would flip very quickly back. It would go to its resting state immediately.

But if I have a very non-resilient system, very vulnerable, and I take a slomo video of me pushing it around, the card would bobble around a little longer. It would take longer for the system to recover from the minor perturbation. And that is what we can measure.

So in these low resilience states, you have lower autocorrelations and you have other critical slowing down and other warning signals. I can't get into detail here. But these are really from the system dynamics, not from the severity of problems.

Here's a paper on forecasting depression transitions by Marieke Wichers and colleagues, 2016. You see one person here tracked over many months, and this person transitions on day 145 into depression. The y-axis here is just depression severity measured by the SCL-90, x-axis is days and time.

You see that there are no really early warning signals on the severity level, where you could say, oh, this person will soon get depressed. But there are system level dynamics that forecast this transition, in this case, the increase in autocorrelations, and that is why we need to replicate this first, and so forth, but that's why I think we need to measure system dynamics and use them for personalized medicine and precision diagnostics.

So there are two main ways to do this, to keep this talk very short and Claire talked about this a little bit before. The first is more on the self-report side, for example, people call it ESM or EMA, ecological momentary assessment, where we can get at people's moods and systems, impairments, life stressors, self-reported sleep, and so forth. And we can do that quite easily with, for example, smartphone apps.

This is from our current study where we ask about moods and affect states such being cheerful or happy. We ask about who people spend time with, social contacts, we think that's quite relevant for the mental health system. And we ask about current activity. Again, these are just three example items from a bigger study.

And then more importantly, we can get at system dynamics over time. So you can see that this person, for example, had a bit of a bad week here, or two weeks. You can see the positive mood goes down, tiredness goes up and negative mood goes up. But these are minor fluctuations. This person returns to their own baseline very quickly, no big deal.

The third step we can take with these dynamic data is estimate system-level information. People use system science or network science tools for this. these are simple network models from a tutorial paper we wrote on these topics, and then these network model or system model information provide you with tools to monitor things like early warning signals, critical slowing down, and these sort of things that might help us in precision diagnostics.

Going back to what I talked about before. I'm skipping over a lot of details here; I'm just trying to convey the core message that these system-level information and parameters might be relevant for what we're trying to achieve here.

The second type of data source that I think we could utilize more, and it's certainly happening in the field, and I just wanted to touch that again briefly, is digital phenotyping data, or actigraphy data, using a wearable device. We can get much of this from smartphones these days. People have also dedicated other devices such as, we use a smart watch in our work, where you can get lots of information that again, Claire talked about -- light exposure, heartrate, stress, sleep, and so forth.

These are not objective data. They're sort of more objective to some degree. There's huge measurement error in these devices, and there's a whole decade of measurement work to be done, but in principle these are interesting data sources, I think.

Now, I should spend 10 minutes on this. I don't have the time for it, so I'm just going to give you two examples why these data sources offer utility for description, explanation, prediction, or control. And I have a paper on this I'm going to show in a second, and this paper sort of tries to make that point better than I can make in 30 seconds now. Wolfgang Lutz and his team in Germany use system-level predictors to be able to forecast about much better than using your typical data sources and models in a clinical sample, and Nick Jacobson and his team in Dartmouth have a paper using actigraphy to achieve 89 percent prediction accuracy in identifying folks with depression.

I love both Wolfgang and Nick and their work, but I'm very skeptical if this will replicate exactly in this manner. I don't think Wolfgang and Nick would believe that, so there's work to be done. This needs to be replicated and extended, but I think this is very, very cool and promising work. Again, I have a short summary paper on these things if anybody's interested.

Last point, to not just talk the talk but also walk the walk here, how do we use this for personalized precision diagnostics? The European Union was kind enough to give me some money for a big study here, in which we utilize 2,000 students and we follow them for two years, to build an early warning system for depression.

We have a large baseline battery. This is all online. It's all self-report, because this is meant to be an app later one, and I just can't get genetic testing for these sort of things. So self-report online tool where we measure lots of transdiagnostic risk factors, and we follow these folks for three months using smartphone and smartwatch data. It's a quite intense period for people, but we have about -- we have 350 measurement points using the smartphones, and people on average do 290, 300 measurement points of these 350, so it's quite a rich data.

And then we follow folks for two years, every three months, using a short online questionnaire to see if folks transitioned into a mental health problem, and then we'll build a warning system for that is the idea. Like a weather forecasting app or a period-tracking app, period forecasting app, called WARN-D.

I'm going to run it off here. Thank you so much for the other speakers, the organizers. I just wanted to highlight how impressed I am by all the early career scholars here, so many postdocs doing fantastic work, really humbling, and I am going to hand it off to Professor Gur for the next talk.

RAQUEL GUR: Thank you very much. I am going to emphasize the need for integration for convergence level of analysis in order to achieve greater clarity of the underlying neurobiology and neurogenetics of the disorders we deal with.

So the major challenges that I don't have to convince this group of advancing precision psychiatry. Brain is very complex. We're never going to be able to transplant it. It's who we are. Multiple interconnected regions, necessity for multimodal parameters for structure and function, protracted maturation, hence longitudinal studies are important and critical period of development, such as adolescence, that was mentioned in earlier talks.

The behavior that we got a good presentation of today, the product of the brain, is also very complex. There are multiple domains. You choose one, how does it relate to other domains? They're all related, and how do we measure them efficiently and in a way that will not be a burden to study participants? To meet the challenge, there's not one way. It requires a multipronged approach, capitalizing on advances in genomics, advances in phenomics that were beautifully highlighted today, and I will add a little bit more on the exposome, and for this it requires collaborative research with multiple expertise to create the mosaic, a coherent picture, as much as possible.

I will illustrate the approach, emphasizing the computerized neurocognitive battery, based on the Philadelphia Neurodevelopmental Cohort. This is a cohort from Children's Hospital in Philadelphia on campus that does not include individuals, children aged 8 to 21, who came through psychiatry. They came through pediatrics.

They were all genotyped and gave permission to be followed with access to electronic medical records. So it is capitalizing on the points that were raised before, how do you work within a health system? How do you obtain consent and guard their confidentiality and work with them carefully?

They all underwent, because they were young, face-to-face -- pre-pandemic -- structured, semi-structured, clinical interview for children, a computerized battery and 1,600 of them multimodal neuroimaging. All the data is in the public domain.

The computerized cognitive battery was based on over a decade of research applying functional imaging. So multiple tests were applied in many individuals were healthy, carefully assessed, individuals with schizophrenia, depression, et cetera, through collaborative efforts at Penn and outside Penn, and at the end of it, we said, you know, we wish we can image every person we study, but this is prohibitive many times. Can we use all the tasks in a computerized fashion and give them, as a new behavioral test, because we know already the underlying circuitry.

This is how the battery was created. So it taps major domains of cognition that relate to executive function, episodic memory, social cognition. We spend a lot of time on that, and sensory motor function. Kids as young as seven years old can participate with appropriate instruction and mild modifications.

So the battery therefore probes circuitry based on functional imaging, multiple domains are assessed, although an investigator can be selective. It measures both accuracy for each item and the response time for each item with participant and norms from age 7 to people in their 80s. So it can be applied in developmental research and in aging research. Because it's computerized, it's highly efficient and there's an automated algorithm that measures -- provides immediate quality assessment.

Hence, if a participant keeps pushing impulsively at one bar, yes, yes, yes, to be done, we detect it immediately and we can look at it.

It has been applied globally in multiple NIMH primarily supported study, but NIA studies as well, and it has been translated to 20 languages, carefully back and forth translation. So studies that are globally done can all apply the same measures, and as I said, they're extensive normative data.

It has been applied and I will illustrate it in collaborative genomic studies. It has been applied, currently being applied, in experimental therapeutic studies. I will not have time to go into a rare structural variant. So it's applied with kids with ongoing studies.

The pandemic forced remote administration and I can say happily that in samples studied, both healthy people and individuals with neurogenetic syndrome, remote and person-to-person administration is the same. You get the same data. The remote is virtually -- it's proctored. It's not self-administered. Somebody needs to be there to administer the test and make sure that the data are obtained with fidelity, no dogs, no cats, no phone ringing. The individual is directly assessed and encouraged to move on.

Because at least in the United States, to conduct cognitive assessment for psychiatric patients and for many settings is not reimbursed, it's thousands of dollars, we provide a report done by neuropsychologists and give back to the referring physicians and with permission to the participating individual and family, depending on the age, assent, consent, clearly adhered to. It's very helpful to them.

We are now finishing up a study supported by NIMH with adaptive version, and this will cut the administration in half. So it can be done rapidly in about half an hour, the entire battery.

This is just to show that with a factorial structure, we measure memory, complex cognition, executive function, and social cognition, and of course, can probe more carefully if a specific domain is of interest, and we can measure efficiency which is correct responses over time to respond correctly, and the age, sex, appropriate norms for each of the domains. This is an illustration of a study conducted covering the Commonwealth of Pennsylvania, NIMH-supported, with University of Pittsburg, where we evaluated multiplex families, two first degree relatives with schizophrenia, and the rest of the family going to kids from age 15 and older, and what you can see is that across the domains, red individual with schizophrenia function more poorly. First degree relatives, it's highly heritable, similar. Other degree relatives, and these are the controls. So it can be administered in very largescale studies and the cognitive domains, we know it, are heritable but also can be integrated as we are doing it now with GWAS studies.

In an ongoing study, this is an experimental therapeutic, we are multisite studies between Columbia University, Penn, Stony Brook and Yale. We are examining the D1 partial agonist in individuals with schizophrenia on letter n-back. What we can see is that you can look at -- these are the repeated measures that each individual gets, and you can look at site effects and individual effects. So it provides very extensive data that can be related to the potential effect of the agonist, the partial agonist.

At the same time, the cognitive battery is administered, there is a screener for psychopathology, because we are interested in relating -- this is a topic that has come up before -- the cognition and psychopathology, and with this, the four major component in a bifactor modeling, fear or phobia, anxious misery, anxiety depression, highly correlated, psychosis, and externalizing behavior.

This a recent analysis that is submitted for publication. It's a busy slide. But it shows how we can look dimensionally at the domains in females and males, at psychosis, anxious misery, externalizing, and anxiety or phobia, for each cognitive domain, and for example, you can see that for psychosis, both in males and females, is that the higher the level of psychotic symptoms, the more impaired are the individuals across domains, provides detailed analysis of integration for psychopathology and cognition.

In an effort to be more efficient and reduce burden to participants, we are completing now a study with computerized adaptive testing on 300 participants, and it will be available as soon as we are done with the analysis. It will be in the public domain. The battery after thousands of participant have undergone crowdsourcing, and we can look at items by level of difficulty, analogous to GREs for example that you don't have people who are highly capable go being bored and people are challenged being frustrated. So we move along the dimension of difficulty.

This just shows that we're getting pretty good preliminary results showing the complete battery and the item response battery, not the entire sample, but this is half of the sample looks good and now we are completing the rest.

So the computerized battery can be applied. The norms can be applied widely and now can be done, will be done shortly, shorter, and can be done virtually unproctored.

Another domain that I will quickly illustrate is the exposome. Where we live is very, very -- when we are born, where we live, is very, very important, and this is something that we assessed and continue to assess among in our study participants, and we assess also trauma. Exposure to traumatic events that relate to, you know, being attacked, seeing somebody killed, et cetera.

Natural disaster, being part of a bad accident. So these are personal, but also the neighborhood, in the United States, every ten years there is a census, and you get data on block-by-block of how many people live on the block, crime on the block. So we are able to derive an environmental risk factor.

Don't get scared about the other slides. I will just show you it's busy. These are several papers that have examined the relationship of the exposome to imaging, to neurocognition, and to diverse groups of psychopathology. So it relates to obsessive compulsive, to suicide, depending on the interest of the investigators, and as I said, it's all in the public domain.

This is just to show you how complex it can get, but highly informative, and it is important to add the G-by-E to all studies that we conduct, and many of my colleagues in genetics agree to that now.

This is just to show that symptoms -- these are the domains of symptoms. The cluster of symptoms. Low socioeconomic status doesn't necessarily influence increase in symptoms. You can be poor but have a loving and supporting family, and you'll do just well. However, trauma, traumatic, stressful event, in males and in females, are critical, across all diagnostic categories for depression, anxiety, phobias, externalizing behavior, and psychosis.

It impacts puberty. There's a very large growing literature that kids who are exposed to traumatic event and poor socioeconomic status mature faster. Cognition is impacted by socioeconomic status. Traumatic event impacts psychopathology. The more traumatic events, the worst the psychopathology and less the cognitive. So look at the difference between cognition and socioeconomic and trauma.

Similar in imaging. If you do multimodal imaging and you look at volume or gray matter density, mean diffusivity, FA measures, et cetera, you break it down by socioeconomic status and trauma, if it's not done, of course we won't know. But these are important measures that can be obtained from an address of a person and a few questions that are part of the assessment. And of course, it does have therapeutic implications.

So when we think about precision psychiatry, impact on diagnosis, and treatment, the phenome was beautifully shown today as multiple component. Not all studies can measure everything, but there's a lot of commonality in studies that create a large dataset, big datasets for novel computational approaches. Genome, of course important, in common variants, rare variants, and epigenetics, and the exposome is important. It calls for a dynamic approach to what we measure and levels of integration that will help us move forward and from the get-go, namely from pregnancy on, the brain undergoes important development, intrauterine, the first two years of life, adolescence, as well.

So all of these measures, the literature is growing, are available, and should make a life interesting and can serve the field. All of the effort has been done with NIH support and with many, many colleagues over the years. Thank you for your attention.

I'm going to pass it on to Joel Nigg now.

JOEL NIGG: Thank you, Dr. Gur. I appreciate all the prior talks. They've really been very interesting and helpful, and because of those, I'll probably skip over a few things I was going to talk about, because they've already been explained even better by other speakers. I'm going to just try to highlight a few different issues that come up and actually try to do some of this work from different perspectives, particularly using relatively modest sized samples but that are deeply phenotyped, which is a bit of a tradeoff, among the many tradeoffs that have been talked about today, one that I'm going to pick up on here a little bit.

So I'm just going to try to take the following points. First of all, that we've talked a bit about laboratory measures of cognition, which we also are very interested in and think they're very promising. We're going to introduce some additional features here related to personality and temperament traits, which is another approach to phenotype refinement that is growing in use in the field. I'm going to try to highlight the issue of using machine learning and other approaches in samples of different -- with different qualities, and an approach that we're using in terms of comparative modeling and Bayesian approaches and a relatively modest dataset with relatively few features, just to illustrate some of what we see, with a highlight on the potential for reproducibility, in a way that might be directly translatable into various clinical settings that are not large, but that are very situation-specific and therefore could be potentially tailored to particular clinical contexts.

I want to highlight that in our process here of solving these problems, we have both easy problems and hard problems that we can try to tackle, and one of the strategies in the field may be to look at some of the low hanging fruit. Then I'll conclude with just a few challenges and perhaps more questions than conclusions.

There's lots of kinds of publicly available datasets for this type of work, and we've heard about many of them. I'm going to, following Dr. Gur, stay on the theme of distinctive local datasets that have depth and breadth of exhaustive phenotyping to them, in this case the Oregon ADHD 1000 cohort where I'll illustrate some features of what we're doing in that cohort.

I want to note here that in looking at the talks today and at the initiative here, there's lots of elements to the goal that are partially overlapping. So any one study will pick up two or three of these features but not all of them, and I'm going to highlight a couple of different angles that we're taking that may capture two or three of these elements at a time and illustrate how progress may be made without having to use all the elements at once.

This is just an example of giving mechanism using a trait perspective. In the cases here, we looked at primarily ADHD as the complicated phenotype that needs to be parsed better, and here this just illustrates that the genetic liability for ADHD extends to other traits that overlap with ADHD and that can be used to refine the phenotype. We have done that by using a variety of mathematical and computational techniques to try to create more homogeneous subgroups of children who all have DSM ADHD that's very well characterized, all of whom have high impulsivity and poor inhibitory control, but vary a great deal on some of their temperament traits, and these are children in a 7-to-13 age range. Things like their activity level, their irritability, their anger outburst, and on this issue of resilience that Eiko highlighted, near the righthand side of the screen, there's a circle around the scale for anger. It's actually off center a little bit, and then soothability.

The group that we have here that we call irritable, the green group, are distinctive for having very high anger, but also not being able to recover, what they call soothability is the speed at which you recover from anger. And you can see the red group has high anger, but they recover quickly. So this is the issue of system resilience that was so nicely highlighted in Eiko's talk and is a feature of children with ADHD that we're very interested in.

When we differentiate the children this way on a little bit more refined phenotype using some temperament traits that are examples of RDoC features, things like sadness or negative affect or anger that can be readily related to hypothesized neural networks from various functional imaging studies, we do this simply to enrich the phenotypic description. We can now get quite a bit of mechanistic clarity that we didn't get just with DSM diagnosis.

This particular subgrouping yields clearer pictures in terms of ERPs, in terms of eye tracking to a negative affect task, in terms of functional circuitry in the brain. This difference applies to all group comparisons and, as far as clinical prediction, we looked out a year or two later and we see that one group has twice as much risk for developing new psychiatric problems, a level of differentiation that we don't yet through other methods of differentiating with children, and that outperforms typical clinical ways of predicting future problems, like current impairment or current symptom severity or current comorbidity. All of that taken into account isn't nearly as powerful as simply looking at these trait profiles within the ADHD population.

So this illustrates the idea that we don't necessarily always have to have very fancy phenotype refinement to get significant steps forward in both mechanism and clinical prediction at the same time and potential treatment targets here that are mechanistic. So I think that's hopeful. Obviously, the mechanisms here are crudely described, and the prediction is also very simplistic. So it could probably be improved upon significantly by bringing further tools to bear, but it illustrates again the many tools that are being used in this approach. We don't always need the computational models to be advanced.

At the same time, there's a number of challenges that we see when we do go to doing clinical prediction. These are issues -- and this illustrates the point made earlier by Danilo that when it comes to machine learning models, they're very different than standard statistical models. For one thing, how to handle missing data is not agreed upon or resolved, unlike in linear models. The right ratio, sample sizes, and feature number remains a matter of considerable discussion. How should cross-validation and hold-out be done? The problem of base rates for the diagnosis. No matter how we do our analysis, how do we define what's good enough replication and good enough generalizability?

What are the pros and cons of big data versus small data, and I'll talk about that a bit more here. A key issue that I want to talk about that's come up in the literature recently is how good is the phenotype as a gold standard for training the machine learning algorithm? Is it important to have a really good ground truth phenotype that then allows the algorithm to really know what it's going after? This gets back to some of the questions and discussion earlier.

And then which algorithm, of course. The machine learning people in the group are aware of this, but just to kind of remind the group that one of the issues that's been talked about in the machine learning literature is that with small samples and small datasets, the more traditional statistical or ML models may not perform something like a deep neural network, but the advantages of the more advanced models really grow with larger samples, and so we're working here in the case I'm going to illustrate with relatively small samples. So we don't actually know which models it's going to work for the best.

So in this first example I'm going to give, though, we're still using a relatively large sample. This is the National Survey of Children's Health, just to illustrate yet another point which is the modest size of effect increments. This is a sample of 65,000 children. We made it very, very simple, just to get, illustrate a very simplistic three-level artificial neural network. Of course, modern deep learning, this is not deep learning now. Modern models often have 100 levels. But this is just a simplistic illustration here.

The base rate of ADHD is 12 percent. Can we detect it in this case using health and demographic features available in the dataset? To make the problem a little harder, we excluded some of the strong predictors like conduct disorder that are heavily overlapping with ADHD. So we made the problem a little harder by taking out obviously correlates and just using mostly other general demographic and health features that might be available in many clinics.

You can see here that with this base rate, here's what chance prediction looks like. A simple one-layer network which is not that different than a spline regression improves on that a lot, brings sensitivity to 68 percent. So here's an example where relatively simple model gets you a long ways.

What can a simple three-layer neural network model add? In this case, it brings the prediction, the sensitivity up slightly by about 4 percent, 3 or 4 percent, which is pretty typical in a lot of these types of applications. You get a small percent increase. If you're Google or Facebook, this is worth a lot of money. In clinical, population-level clinical policymaking, it's probably extremely important.

How big of a gain is enough to make a difference in an individual clinical practice or an individual clinic is going to be an important topic for our field, decide how much gain do we really want to see as we look at these more and more advanced models.

I'm going to move on now to a second issue, which is the issue I highlighted earlier about what's the gold standard of reproducibility across sites. This is the ABCD sample. This just illustrates the correlation of the ADHD polygenic risk score with ADHD across both all of the ABCD sites, you can see there's some degree of variability here. Some of the sites are showing no association, a couple even have a small nonsignificant negative association. There's obviously variation in sample size here.

Two independent sites that are not part of ABCD, the Oregon-1000 and the Michigan-1000 at the top that I'm going to talk about, just partial subsamples of them, also show values that are similar perhaps to the composite model here, but the question here that comes up is to what extent if you were going to create a model at one site would it generalize to another site? This goes back to the earlier talks about the degree to which we want site-specific models, person-specific models, that generalizability is an important problem when we get to real precision clinical prediction, and a limitation of a large sample is it may not generalize to a particular site. The ABCD composite score here of about .11 isn't what we see at some of the sites.

So this also is a question for our field. A related point, another one of the tradeoffs that's analogous to some of those that have been mentioned earlier is there's often a tradeoff between the sample size and the depth of the phenotyping. Of necessity, because cost is -- our dollars to spend are limited.

So if you want a giant sample, you're going to have to settle for more simplistic phenotype description. If you want to get a deep phenotype or perfect validation of a phenotype, you're going to have a smaller sample.

So where's the sweet spot there to make sure you have a good enough gold standard phenotype for training a good machine learning model, at the same time a big enough sample that you can actually run a machine learning model and get something out of it with a reasonable number of features that might apply to a real-world clinic.

So this something we have been thinking about a lot. I'm going to just illustrate this here again with a fairly easy problem and a fairly small sample and a fairly small feature set. An easy problem is can we identify ADHD right now cheaply and at low cost with high accuracy against a strong ground truth clinical assignment. So these are our two cohorts, the Oregon ADHD 1000, you can see the sample sizes here, fairly modest by typical machine learning standards, but however the phenotyping was extremely detailed and careful. Those are the best of the field that we could think of for how to do it with a case finding procedure in the community to minimize sampling bias, multimethod multiformat, trained interviews, multiple reliability checks and drift validity checks on interviewing and application of tests, and then multiple expert clinician best estimate diagnostic assignment after review of all data.

Fairly standard methods here. We did, however, emphasize a Bayesian classifier as our approach, the logic being that the way a clinician proceeds, even if it's implicit, is typically to say they get a little bit of information and then decide what other information to obtain, and they go and can repeat that process over and over.

So the clinicians are implicitly creating priors, achieving posterior estimates of what's likely to be going on, creating a new prior, getting new data, until they arrive at a decision. So we wanted to use a multistep Bayesian approach, because we thought it would help us model when a clinician has to get more information and when a clinician can stop. We also did competitive modeling, however, for reasons I explained earlier, that with these modest size samples, it's not clear in a particular sample which model is necessarily the best algorithm to do prediction.

We didn't just look at accuracy but also at the confidence of the prediction, because when a clinician is going to make their decision based not just on the accuracy of their information but also on their confidence that it's right, and so we made a threshold of 90 percent probability that the prediction was right to say that it was a confident prediction, then looked at accuracy and then looked at our classifiers so that we can try to mimic some of this.

Here, before we proceeded, we looked at the missing data issue. As I mentioned, there's not consensus in the field here for how to do this with machine learning, unlike in a general linear model where we have quite a bit of a practice has developed, and this shows that we examined here seven different methods of imputing missing data in a simulated dataset, simulated dataset that mimicked the pattern of missingness in our data.

The bottom line here is that there was no obvious winner here. We ended up settling on the one of the simpler approaches because it was easier to explain, but they all were relatively similar in their effectiveness.

The competitive models that we used are just listed here. We went from relatively simple models, including logistic regression, a decision tree, again reminiscent of some earlier slides here that were used, to relatively complex models. We didn't think we seriously had enough sample size to do multilayer neural network models. So we skipped that, but we had some other models here. And then we did include an ensemble.

So first of all, in the first cohort, which was the Oregon cohort, what the left-hand panel shows is that just in terms of accuracy, if you disregard confidence, that all the models performed very similarly. The logistic regression was practically as good as all the other machine learning models. So this illustrates maybe perhaps to some extent that earlier graph that with relatively modest sized samples, the gain from a very advanced model may not be very great.

Again, this isn't going to be true of all small samples, but it happened to be the case here, and there's a lot of sample specificity to what the right algorithm is.

However, when we brought accuracy into the equation, there was no comparison. The Bayesian model was far more accurate if you required confidence also. So there was a lot of low confidence predictions in the other sample so that they might have been correct, but the clinician would not have had confidence in that. Here they did.

So I want to go a little deeper on the Bayesian model results. This was just a simple tree augmented naïve Bayesian classifier for those interested in those details. But what the model was was simply at step one was just a parent ADHD rating scale, and here 70 percent were predicted. We went to step 2, adding a second parent rating scale, and at that point, we had a pretty good rate of confident accurate predictions.

With step 3, the accuracy increased further, about 15 percent further, from about 70 percent accurate to about 85, 86 percent accurate, and then very interestingly, there was a small number of cases where the accurate and confident prediction was achieved after adding a cognitive testing battery, not as a sophisticated or advanced as the Philadelphia battery that Raquel Gur explained, but nonetheless a fairly comprehensive neurocognitive battery that then highlighted the cases that were difficult to diagnose correctly but that were then resolved with this added battery.

So the purple line illustrates that additional gain. So what was interesting to us is that there are easy to classify and hard to classify cases, again getting back to the theme of very personalized prediction, and that additional measures were very helpful in a small subset.

How well did this generalize to a completely different population? You can see some inevitable shrinkage when we moved over the Michigan sample. That's the dotted red line on the graph. This is a completely new population. The Oregon and Michigan, central Michigan and the valley in Oregon are very different populations in terms of socioeconomic status, in terms of economic stress, in terms of racial mix, in terms of severity of clinical problems and comorbidities.

So this is a good example of generalizability, because we had a whole new population, but we used virtually identical methods to select our cases and to evaluate them, and we ran the exact same algorithm. So in that sense, it's encouraging that we get reasonably good prediction in the generalizability sample, and this is the test sample here. We had a holdout sample, of course, in Oregon, in addition to a training set, but it's also notable and important that there's also shrinkage and that that has to be quantified across sites.

A hard problem is to predict several years into the future whether someone is going to clinically deteriorate, and this is just a preliminary look at this, and you can see here though that there is here some differentiation across models, the regularized models in this particular case doing a little bit better.

But the bottom line is that this is a hard problem where we aren't nearly as good at predicting. We can beat chance, but hardly good enough to write home about, and so we're still working on these models. We think we can do better than is shown here, but I want to highlight again that there are some problems are relatively easy, some are relatively, and a strategic decision for the field is should we pick off a bunch of these easy problems right away and really make some quick gains in clinical accuracy and practice, while setting our sights on some harder and maybe bigger problems for longer term.

I'm going to conclude this by mentioning a couple of overarching challenges and some concluding questions. First, the field has to think about this question of how much to invest in incremental gains. They may be very important. On the other hand, maybe they cost too much. What do we want to gain and on what problems do we want to apply our tools?

There's already been a lot of discussion here about the critical importance of data sharing and the difficulties there. We have the challenge of large but clearly labeled datasets. How do we get high quality gold standard phenotypes perhaps in -- and one strategy I've thought about is to take a subset of some of these monster datasets and really deeply phenotype and then use things like propensity weighting and so on to try to generalize to the whole dataset. That can be done in a planned a priori way of course in future data collection.

It's already been talked about that you need a multidisciplinary expertise to do this work. These are some other well-known issues. Then I think a real challenge is what's the best algorithm, what's the best way to handle missing data? These things are going to often be sample-specific, and so how to generalize here I think is an important question for the field to keep talking about, and then again, what are the problems that we care about? Which phenotypes do we care about?

I will conclude with just what I was going to say are conclusions, but I realize that most of these are actually questions. First, I will say that I think there are different ways to make progress. We can refine clinical futures. That alone can make progress, even without changing our algorithms, but we can also improve the algorithms without changing the clinical features. We can try to improve both at the same time, of course. I think that there's ways to improve the labeling on our subsets of our large samples to make them more powerful.

And that multiple small datasets can be useful because they can show sampling variability and generalizability in ways that a single large dataset offers different set of strengths. So pros and cons of different approaches.

I've already highlighted that when you think about diagnosis versus future prediction, easy versus hard problems, and then one of the challenges I think is that we don't really know a lot about how accurate clinicians are now, their current performance is known, and people like Greg Simon and others may already have some good data on that, but I think this is an important question is what are we trying to beat? What's the benchmark that we have to do better than to say that we've got real progress to offer the clinical world? What should the feature set be, the value of small versus large feature sets.

And then how much improvement do we care about? I think this is going to vary for different questions for different problems obviously, but still requires some reflection.

And then to what extent is it important to mimic the clinician, to support the clinician, versus replace the clinician? We have a literature in psychology dating back 50 years trying to identify which clinical problems are best handled by an algorithm and which clinical problems are best handled by a human decision maker, and I think that's a good place for us to think about augmentation of human decision-making versus places where an algorithm can do the job.

I kind of conclude with the question of what does a clinician need? We have talked from the beginning on today about usability, accessibility, about the burden on clinicians that again Greg Simon mentioned this, extremely important. Is a decision tree, a simple decision tree, the best? Is a machine algorithm that can be plug and play the best? I think looking at this end game of what we want to end up with is going to be valuable.

I'm going to conclude there, thank a couple of key collaborators, although there's many more, and thank everyone for just a fascinating day, and I'll hand it over now to our discussant, to Laura.

LAURA GERMINE: Thanks for that.

So, a broad series of talks, all really wonderful, and I think highlighting the many different ways we can approach questions related to precision diagnostics using many, many different tools at hand. I'm going to try to process some of the questions that are coming from the audience, but also use my moderator's discretion to generalize them a little bit.

So first one I think was directed at Eiko, but I'm going to make it more general. So it's about acknowledging that when it comes to diagnostics and psychopathology, there is complexity and nonlinearity. But what about adaptability? What about the fact that people do change, and so those models might not be true for a person over time, and is that something that we can include in our models, or is that challenge to generalizability and how do we approach the fact that people do adapt and change over time?

EIKO FRIED: I can just start by saying that there are other adaptive systems that respond, like there's fantastic work on bees for example, colonies of bees that respond adaptively to behavior, lakes respond adaptively to behavior. I've learned a ton from my colleagues in ecology over the years. I can only recommend everybody here to talk to your resident ecologist about systems.

So yes, I think we can take that into account. We can add constraints on these models. We know that many people respond in adaptive ways to many things, but we also know and there's lots of cybernetics literature in the 1990s that started this that some adaptive systems get stuck in nonadaptive states. Fever is highly adaptive in most cases, but some people die of fever. So adaptive systems, at least ontogenetically, important feature, tricky to study, but there is some work in ecology and biology that shows us how this can be done. That's at least my perspective.

JOEL NIGG: I just had one very simplistic comment here, which is that also adaptability may be one of the features we measure in people, and one of the predictors that we look at. So I think that at least from personality point of view, but also from a learning point of view, it may be a rich target.

RAQUEL GUR: I agree with that, and I could consider to use resilience can be measured. The articles now on heritability of resilience, and longitudinal studies, when we follow individuals at risk for psychosis, those who do not transition to schizophrenia have a higher resilience at the get-go.

And also for following people with suicidal ideation, teenagers, those who have no attempt have higher resilience. So it's something that dynamic and can be carefully assessed and add to the clinical prediction.

LAURA GERMINE: And when we think about adaptability and resilience, now it becomes fundamental in the concept of psychopathology, not just another complication.

So I'm going to add on to that question a specific type of adaptability that came to mind when I was reading this, which is specific to the research setting or to measurement, which is what do we do about the fact that many research participants are also observers and consumers of the fact that we're conducting a research study.

So the two situations I thought of is in the context of EMA, we've seen behavior change over the course of an EMA in what seems to be a response to the EMA, and then in the EHR setting, the transition to open notes, for example, where now patients are reading the things that are being put in the record and much more aware of that, how does that change the maybe the way we think about the measurement or how we generalize and the sorts of models we build?

Raquel, did you have your hand up from before?

RAQUEL GUR: I will just say that you need multiple sources of information when you collect the data. Sometimes it's quite amazing to see the discrepancy between a child report and a parent report. It just -- we go back and check it and the children let us know, but they do not tell the parents. Similarly, pregnant women will not tell postpartum how depressed they are, but tell us in research because they're concerned of their role as mothers and what will be done. So you need multiple sources and sensitivity to the topic under investigation.

LAURA GERMINE: Eiko, I saw you had your hand manually up before.

EIKO FRIED: I think your first question goes towards reactivity of these assessments to some degree, and we see that in research, and it's not an easy problem to solve. As I said before, we're trying to run an observational study trying to collect data to forecast transitions into depression, but our people, our participants use smartwatches and the very first thing one of my coordinators called and said, Eiko, I've been wearing this watch for three weeks now. You know, I really changed my habits. Like, no, no, it's not meant to be an intervention. And we get the same from some, so we give people these personalized data reports after the three months, just to motivate them to keep participating as sort of a citizen science thing as well. But yes, for some participants, although we only feed them back data that they literally told us in the app, there's nothing new here. Of course, seeing it from a bit of a distance shapes people's perspective on like, oh, Mondays I don't seem to do very well. This can have an intervention effect.

So the trick to circumvent this as a problem in our situation is that when people then use the forecasting app in three or four years, people will have the same situation that the observation is an intervention, and in that way, our models that we build now will cross-validate in whatever people are doing in that case, because the intervention effect is the same, but it is still reasonably uncontrolled, and I think we need a lot more work on the sort of free activity.

Just to add one more sentence here, colleagues of mine in the Netherlands are doing a lot of work on EMA tracking in clinical populations, and this is usually quite well tolerated, but I've recently seen a report where they did CBT for rumination in particular, and let me just tell you that it was not nice for people who are trying not to ruminate being asked eight times a day having you been ruminating in the last two hours, because they started to ruminate when they got this, of course. So it's a really tricky thing to navigate right now.

LAURA GERMINE: Joel?

JOEL NIGG: Thank you. I wanted to comment on that last point, because as I heard this question and then as I heard Eiko's comment, I was thinking that the research can be both -- it can be iatrogenic, it can be harmful, or it can be beneficial. It can be therapeutic, even though we intend it of course to be -- the metaphor of the Heisenberg uncertainty principle or whatever you want to call it, we intend to have no effect on our part but of course we do, and of course, participants respond to the research as well as intervention, although we don't want it to be, but I think this is a really valuable question to raise, because it raises a question as we engage in more observation and more reporting with participants in our research and hope to apply it to the clinic, which kinds of observations actually are beneficial in and of themselves in making patients more aware, more self-monitoring, and so on, versus which ones may for some patients be negative as Eiko astutely pointed out. Really potentially useful again. I just think a lot of these things can be turned to our advantage with the right thought, but a really important point to raise.

LAURA GERMINE: Yeah, I like the reframe of the problem as thinking about the fact that research is almost necessarily some sort of intervention and thinking of that as an explicit part of it.

So I apologize to the person asking this question if I get it wrong, but my parsing of this question is I did my best. So one problem that perhaps we have in research is -- I think we variably might agree with this -- an excess of information. There are many different methods we might use to approach the same question, and they've yielded different nosologies, so DSM, RDoC, HiTOP. What do we do when these different approaches such as the ones that you all discussed here yield different results, different answers? Not just from the perspective of reproducibility, but from the perspective of, like, the way you would carve nature at its joints, and how do we integrate across those different types of results, which seems like a likely thing that would happen if we have different approaches.

Joel?

JOEL NIGG: I would say that this goes back to what's our ground truth, what's our final goal? We want the patients to -- we want people with clinical problems to feel better, to function better, to be happier, clinicians to be more successful and so on, and obviously there's lots of secondary goals like lower cost and that kind of stuff, but it all serves the larger goal of improving population health.

So I think really we can always return to those fundamental larger goals. Is that being achieved? Things like whether we have a better nosology, I find it fascinating, but ultimately isn't useful. That becomes the rubber meets the road question that I think all these methods are designed to try to address. So I would go back to that as one way out of that dilemma, because I agree that it's the problem of if you want to do classification on a sample, of course, there's an infinite number of solutions. All of them are right. Which one is most useful becomes the important question. So I think that becomes the perspective that I would bring.

LAURA GERMINE: Greg?

GREG SIMON: I would say pretty much the same thing. Ultimately it comes down to utility. What we hope to see is sort of a virtuous cycle where a better nosology both prompts more better use of and development of new therapeutics and then better use of new therapeutics prompts a better nosology.

You know, to me it's interesting to look at the areas of medicine where we've seen really fantastic progress over the last decade or so, mostly sort of in immunologic disorders and cancer, where the breaking of nosology has been really essential to progress. The discovery that the nosologies, that where you previously described diseases by their surface manifestations or cancer by their organ of origin, really make no sense anymore.

And that the new therapeutics breaks the nosology which leads to better therapeutics. The question I often ask myself is this goes back to the sort of cancer analogy. If there were such a thing as a Gleevec for a particular type of bipolar disorder, how would we ever know?

Are our means of sort of understanding subtypes and actually of assessing outcomes good enough that if there were an existing or a new therapeutic which were highly effective for a category of people that we have not yet discovered, would we be able to find it? That's the challenge of it.

LAURA GERMINE: Eiko.

EIKO FRIED: Just to highlight from the perspective of somebody who's still sort of early-ish career researcher how happy I am to see that so many authoritative important people in the field have a sort of pragmatic angle on this and acknowledge that we don't need to fight between DSM and ICD perhaps, because these systems serve practical purposes for a reason. There's a great rebuttal from Geoffrey Reed to a HiTOP position paper a couple of years ago where Geoffrey says, well, no, HiTOP is not ready to replace ICD, because the ICD literally has the job in Europe to estimate prevalence rates, and for that you need a categorical system. You can't do this with severity, right?

So we've been tossed this is not science; this is also science policy. And I think it's really great for me to see that that is more and more acknowledged. Allen Frances always talks about this in the DSM-IV, it was very clear to everybody doing that that these are pragmatic categories to some sense, that aimed to maximize clinical utility. So I just wanted to highlight how nice it is to hear publicly and very outspokenly, and I fully agree of course.

LAURA GERMINE: I feel the same way, and actually I'm surprised at how consistent the view is across this panel. So that reflects I think maybe some of the changes, too, in the way we think about the purpose of the nosology.

Anyone disagree with that?

RAQUEL GUR: I would just add that to have data on treatment response will be very helpful to integrate, if the goal is really to parse some of the heterogeneity and come up with better treatment, and at times, for this you need good EHRs.

LAURA GERMINE: That actually leads nicely to another question that was in the audience questions. So it was a question about using EHR notes as a way of validating the RDoC, but really maybe the broader question is what is the role of EHR notes in thinking about nosology, and of course I think in clinical prediction maybe it's a little bit clearer. But anyway, I'll stop there. How do we use EHR notes when thinking about nosology or thinking about classification?

GREG SIMON: I'll start. Actually the Mass General group, Roy Perlis and colleagues, actually have a couple of papers. I'll try to put one in the chat. About trying to use processing of clinical text to detect RDoC-type constructs. It's really more sort of proof-of-concept work, interesting, but I think telling that they did this with inpatient notes. I think they did this with inpatient clinical notes, likely because they include much richer and more varied information and interestingly include multiple observers. So someone who is hospitalized for a mental health condition typically has clinical text recorded by multiple observers and over some period of time. So you're probably more likely to get useful information.

You heard my sort of lament earlier about the sort of templating and carrying forward and cloning of notes from one clinical encounter to another in our current EHRs and how in some ways we magnify the word count by not necessarily increasing, possibly by reducing the novel or true information content.

But I am actually more interested in, as I say, some of these other things about, say, patient-generated text, because the more that we can use information that was directly created and recorded by patients rather than processed through the preconception of a clinician, if clinicians are using a DSM checklist in their notes, you're going to get DSM-formatted information obviously.

Patients by and large when they're sending messages to their clinicians about how they feel or chatting with someone about how they feel probably are not using a DSM checklist. So you might more likely find information that would actually be useful and tell you something beyond a taxonomy that we're currently using.

LAURA GERMINE: That's an excellent point. Joel.

JOEL NIGG: I want to weave those two questions together a little bit to maybe highlight a little more the distinction between mechanism and prediction, and so we can predict ADHD, but it doesn't tell us what's causing the ADHD or how to treat it. Or we can predict that somebody is going to get worse, but it doesn't tell us why. We can even predict what treatment they'll respond to, but it doesn't tell us why or whether they'd respond even better to something we haven't discovered yet.

So there may be ceilings on the utility of prediction without mechanism, and in that sense we can look to utility. So this may be the balancing point that, Laura, you were kind of surprised not to hear in the comments, that we can look to utility, but recognize that it may have a ceiling on it if we don't know the mechanisms. So we need the basic discovery to understand mechanisms to be part of that virtuous cycle that Greg described, and then we need the prediction work to see if it works, but we also need to know what we're targeting for treatment to make the treatments more effective. We all are unsatisfied with the effectiveness and the side effects and everything else, and some of our best treatments are psychological treatments, because there's been a mechanism that's theoretically known and understood, whereas some of the pharmacological treatments remain imprecise, partly because the mechanism isn't well understood, and of course that's rapidly changing with our rapid advances, but that's been the status quo to struggle with.

So I just want to highlight that we need a kind of nice dialectic of mechanism and prediction, I think, here.

LAURA GERMINE: A sort of related point, audience question: how do we address the challenge of using one biomarker to validate another? By definition, comparison to a gold standard will degrade the standard even as we need to shift from categories of disease to domains of disease. This idea of using one thing to validate another thing that we're not totally sure of either.

Joel, is your hand raised for this one?

JOEL NIGG: It was raised from before, but I'm happy to raise it again if nobody else has an eagerness to jump in. I'll go ahead and start, just to say that part of the challenge here is I use the term gold standard and we use the term ground truth now at NIMH, but these are rather misleading statements, because we really are in this kind of nomological net where it's really a bootstrapping process all the time and hope that we're incrementing toward or approximating toward the truth. But I think that's the challenge here with the biomarkers is understanding the correlation and their cross-validation is recognizing that neither one may be a true gold truth. So you have to have that incremental process of bootstrapping forward.

RAQUEL GUR: Ultimately advancing mechanistic understanding, because otherwise we will have lots of correlation and no basic understanding of how this pathological behavior emerged and in whom and the course it will take and how will they respond to treatment. It's tough questions.

EIKO FRIED: I think this goes back a bit to the question before about the converging data sources and then we disagree all of a sudden with the pragmatic summaries we superimpose on these landscapes. There's a book by Hasok Chang called Inventing Temperature I highly recommend in this idea of bootstrapping that is about the epistemic iteration as a cause between measurement and data information theory development, and so from that perspective, temperature wasn't discovered; it was invented to some degree, because we had neither thermometers nor did we have a concept of temperature. So how do you come up with it? It's really -- it took hundreds of years until the first thermometer was built, and so from this perspective of like how do we validate biomarkers with biomarkers, now we just need to update to epistemically iterate to improve our measures and also to take these things seriously.

To go back my field, I know -- everybody knows this is my pet peeve, I'm sorry to bring it up, I waited for five hours, but, yeah, the Hamilton is still the gold standard scale in our field. It's from 1960. We have not taken anything seriously we have learned about depression to epistemically iterate our measures. Yes, the PHQ-9 and so forth. But anyway, I think epistemic iteration is a really important concept here to actually even change the measures in a way that we've actually learned from the data we've received.

LAURA GERMINE: So we have one more minute, and one I say simple question, maybe it's a very complicated question, but Greg, you mentioned in talking about the EHRs the idea that if we added one extra minute of burden to clinicians across -- I can't remember what the system was -- it would be 54 FTEs. When I heard that, I was like, well, then we can't ask anything, and we can't. That's unacceptable. But is there anything that you think would be worth it to include in the EHR so if you had that one minute of extra burden or something, or maybe it would just be better notes, but what would that thing be or what should that thing be?

GREG SIMON: Since we only have one minute, there can't be an answer to this live. But the question I would put back to this group think about is imagine this future world where most mental health care is delivered by video visits. So people are sitting in a video waiting room, essentially, and what is the equivalent of the fish tank in the waiting room for the video waiting room? The thing that people stare at and are just waiting for their clinician to show up to start the visit.

If there were something we asked people to do during that time that we're interested in, and we were willing to accept that they might do it for 30 seconds or they might do it for 10 minutes, depending on how much time, and some might do it and some might not do it, but it would be really cool if we could do that.

If you were assigned the waiting room task of the month, what would it be?

LAURA GERMINE: Anyone want to take that briefly?

GREG SIMON: It could happen.

LAURA GERMINE: Well, in that case, why don't we go ahead and stop there, and I'll turn it over to Sarah, who will be closing out the workshop. Thank you, everyone.

EIKO FRIED: Thanks, Laura, for moderating so much. Thank you so much for staying all the time and doing a fantastic job.

SARAH MORRIS: Thank you very much, Laura. I will tell you, the RDoC Unit was really quite giddy as the responses to the invitations to give presentations at this event came in. There were fast and furious emails of excitement when each of you accepted the invitation. So it's clear that getting this was well founded.

So thank you all very much for your very thoughtful presentations. You've not shied away from highlighting the complexity and challenges of this work, including the importance of looking at dynamics over time, the importance of integrative approaches, and the challenges of incorporating novel measures into point-of-care settings, even before their clinical value has been proven.

It's easy to feel overwhelmed by these complexities and challenges. So I appreciate Joel's reminder that it's not possible to do everything all at once, and it's okay to focus on a subset of goals. It's exciting to look down the road at the dual lanes of precision psychiatry with clinical prediction and decision-making on one side and identification of novel mechanisms on the other, and as they intersect and inform each other going forward.

So thank you again to all of the presenters. Thank you to the attendees for joining today and for your excellent questions, and I will close out there.

Quick Links

Share Page

NIMH Research Domain Criteria Roundtable - Data-Driven Refinement of Psychopathology: Toward Precision Diagnostics

Transcript

Follow Us

Subscribe to NIMH Email Updates

NIMH Resources

Policies and Notices

Federal Resources