Skip to main content

Transforming the understanding
and treatment of mental illnesses.

Celebrating 75 Years! Learn More >>

The BRAIN Initiative® Cell Atlas Workshop Day 3: From Single-Cell Genomics to Brain Function and Disorders—Data Integration and Annotation

Transcript

ERIN GRAY: Good morning, everyone. It's 11:30, so let's go ahead and get started. So I'm Erin Gray. I'm a program officer at the National Institute on Aging here at NIH. And it's my great pleasure to welcome you to day three of this workshop. We've already had two really great days of talks and discussion. And I just want to give my thanks to all of you for your contributions towards making this a really valuable and truly insightful workshop. I'd like to give a special shoutout to Laura Reyes for all of her hard work behind the scenes, and also a big thank you to Yong Yao for organizing. So our first session today is on brain cell atlases for neuroscience and health and disorders. So this will include one keynote presentation and one panel discussion.

ERIN GRAY: So our keynote presentation is entitled Decoding the Regulatory Code of Brain Cell Diversity, and this will be led by Drs. Joseph Ecker, Bing Ren, and Nelson Johansen. And just briefly to introduce the speakers, Dr. Ecker is an investigator at the Howard Hughes Medical Institute, a professor and the international council chair in genetics, and director of the Genomics Analysis Laboratory at the Salk Institute. His research focuses on genomic and epigenomic gene regulation and on the application of DNA sequencing technologies for genome-wide analysis of DNA methylation, transcription, and gene function. And Dr. Bing Ren is the director of the Center for Epigenomics and the professor of Cellular and Molecular Medicine at the University of California in San Diego. His research is focused on discovery and characterization of the transcriptional regulatory sequences in the human genome. And Dr. Nelson Johansen is a scientist at the Allen Institute. And his work is focusing on using machine learning for comparative analysis of cell types in the brain across species from the lens of single-cell transcriptomics and epigenetics.

BING REN: Hello. Good morning, everyone. My name is Bing Ren. I'm delighted to join Joe and Nelson in this presentation today. We are going to discuss how the brain cell analyses help improve the decoding of gene regulatory sequences in the genome and how this knowledge can be used to improve our understanding of the brain neurological disorders. There has been a lot of genomic studies, genome-wide association analysis that uncovered thousands-- or hundreds of thousands of risk variants associated with neurological disorders. But decoding the meaning of such risk variants has been really slow and a big challenge. And the reason is that a large fraction of them are non-protein-coding, and it's unclear how they contributed to human disease. A new paradigm has emerged in the last decade that just suggests that a large fraction of these non-coding risk variants may act by perturbing transcription factor binding and disrupt cell type-specific gene regulatory programs. And the goal for our work going on in my lab and in that of Joe's and Nelson's is trying to take this brain cell analysis knowledge to really test this paradigm and to improve understanding of various neurological disorders.

The outline of today's presentation is following. We're going to have a brief discussion of strategies to map gene regulatory code and analysis. And this is followed by Joe for discussing recent mouse brain cell analysis focusing on gene regulatory programs. And I will follow Joe to discuss the brain cell type specific regulatory programs. And finally, Nelson will discuss how this knowledge is enabling an enhancer challenge, which is designed to test various machine learning programs to predict cell type specific enhancers.

The major approach that we use to study regulatory programs is known as epigenome profiling. Epigenome refers to the covalent modifications to DNA, such as DNA methylation or histone proteins, such as methylation. It also refers to the chromatin structure, including open chromatin or three-dimensional chromatin contacts. All of these features of the epigenome can now be measured using next-gen sequencing technologies, such as ChIP-seq, ATAC-seq, DNA-seq, whole genome, bisulfite sequencing, and Hi-C. But traditional epigenome profiling method suffers in that they require bulk input and does not review cell type specific epigenomes. This has changed due to the advancement of single-cell epigenomics. By single-cell epigenomics, we are talking about high-throughput method to interrogate epigenome in individual cells in a highly parallel fashion using either micro plate format or droplet format.

For example, we established single-cell ATAC-seq using a combinatorial indexing strategy in-house. And 10x Genomics is selling now single cell ATAC-seq or multiome in droplet-based form. And Joe's lab and also our lab have now established a tool to enable single cell analysis of methylation and organization in the same cell. My lab recently advanced a technology known as Paired-Tag that enables simultaneous profiling of histone modification and gene expression from the same cell in tens of thousands of cells in parallel. So Joe will discuss how this information is now advancing our knowledge in the mouse genome. Joe, take it over from here.

JOSEPH ECKER: Yeah, thanks, Bing. So yesterday, Hongkui Zeng described the beautiful cell atlas that was produced at Allen Institute and how that has been useful in identifying cell types. What I wanted to point out here is that, beyond cell types or brain regions, we're focusing on understanding the genome within those cells. So we want to understand gene regulation at various levels. Bing, you can advance the slide. These levels include understanding how the chromatin is folded into compartments, which are large-scale areas that can be either membrane-associated, nuclear-membrane-associated, or more near to the nucleolus. And those regions are associated with large domains that are different in gene expression. And that can be correlated with DNA methylation, for example, in these blue and red segments here that are boundaries.

The tag domains, the topologically associated domains, are sort of the next layer of gene regulation, where it's been shown very nicely by a number of groups that dissociating these topologically associated domains that can be gene spanning results in phenotypes in mouse, for example, and in human disease, then brought down to even another level where we can associate using the Hi-C assay in single cells. Promoters and enhancers, coupling the DNA methylation and open chromatin assays, we can identify candidate regulatory elements and then write down to motifs that can be predicted from the sequence as well as the methylation. Next slide, Bing.

And so I'm just summarizing five years of work from three laboratories in two minutes. This is work from Bing's lab that was recently published describing the global brain-wide environment of cell types, as identified by open chromatin signatures. And these can be associated very nicely in the upper right panel with DNA methylation. Hypomethylation and open chromatin are very highly associated. You can then carry out correlation analysis to look at open chromatin and gene expression to predict gene regulatory networks. And Bing's lab has used deep learning approaches to be able to use the information from this high-level resolution maps to predict chromatin accessibility in cell types which haven't been profiled. Next slide.

Advance it one more time, Bing. So we've done the same thing with DNA methylation, and we have very good alignment as shown with the overlap of clusters between DNA methylation and-- we identified about 4,600 clusters that align very well with the RNA data. This can be used to integrate these modalities. So we have DNA methylation, open chromatin RNA-seq, and chromatin contacts that all can be joined - next slide, Bing, is the advanced - to identify gene-level interactions. So the open chromatin signatures can be associated with hypomethylation of the gene. The TFs have hypomethylation of motifs. And all of those correlations can be used - next slide, Bing - to-- oh, this just shows the number of edges for these several hundred clusters where we've analyzed the data. Can be used to identify gene regulatory networks and predict, using page rank score, the most important motifs in, for example, hindbrain subclasses as shown on the bottom there. Next slide. Just advance one more, Bing.

Also, I wanted to point out, and Yong sent a link to the paper yesterday, that this assay of DNA methylation in single cells can be used beyond clustering and gene regulation, can be used to look at projection mapping. So in this study, which is a collaboration with Ed Callaway's lab, we can inject various regions of the brain and use AAV retro to activate a marker gene in the source area and then profile DNA methylation. And what we can identify are projection-specific regulatory elements from these kinds of data. And they can be mapped using MERFISH into a spatial context. So you can map the projections into the context of where they were originated. Next slide.

And then finally, work from Bing's lab - and advance it again - and this is something that we use - one more time, Bing - to-- Nelson will talk about using this data where we've got multimodal data for four species that can be combined with RNA data. So we have multimodal RNA, ATAC-seq, methylation, Hi-C that can be used to predict regulatory elements across the species. So I'll pass it on to the next speaker.

BING REN: Yeah. Thank you, Joe. I'll just briefly discuss how the same methodologies now have been applied to the human brain. This is the first survey of the human brain structure. So in a separate package of papers published in Science in October, we described how human brain, single-cell analysis of chromatin accessibility, focusing on 1 million cells from 42 brain-- sorry. From 42 brain regions, and that gave us an annotation of 107 brain cell types, which leads to a discovery of more than half a million candidate regulatory elements that operate in one of these 107 cell types. We then use computational means to link these elements to their target genes, putative target genes, and then perform analysis to identify cell-- to interpret non-coding risk variants of human neurological disease and link them to the cell types that are relevant. This is, in a parallel paper from Joe's group, the same set of brain regions from the same donors were interrogated using whole genome bisulfite sequencing on individual cells as well as single cell Hi-C so that you can interrogate both dynamic methylation pattern changes as well as quantum organization changes in the genome.

I just want to highlight how these maps have enabled us to gain a better understanding of a variety of neurological disease. This is basically analysis known as LD score regression analysis, which pinpoint the disease-relevant cell types for two dozen neurological disease and traits. For example, we were able to link schizophrenia to disruption of enhancers of a variety of neurological cell types, including both excitatory neuron types and inhibiting neurons in the cortical areas as well as in subcortical neurons. Alzheimer disease in the same analysis is linked exclusively to a disruption of regulatory elements discovered in the microglia population. And the major depressive disorder, for example, is linked to a few number of excitatory neurons. And bipolar is linked to a different subset of excitement neurons and inhibitory neurons.

So I'm sure this type of advances is now going to shed new insight into the disease mechanisms. We also, as Joe mentioned, developed and incorporated the latest machine learning algorithms, in this case, using a transformer model, we are able to predict regulatory sequences given any DNA sequence and particular cell type-specific usage in the brain cell types. So we can see this is a snapshot. The blue is the observed open chromatin while the brown is the predicted ones. You can see these regions of about 130 kilobytes. The prediction quite matched with the observed. In fact, the Pearson correlation coefficient is approaching 0.8 on average. While we are making this prediction –

ERIN GRAY: About three minutes remaining. Thank you.

BING REN: Thank you. We are making this prediction so that we can better interpret other disease risk variants. And now Nelson will use the remaining time to discuss the enhancer challenge. Nelson, you're muted.

NELSON JOHANSEN: Sorry. So yeah, I'll keep it brief. The great work that Bing and Joe just talked about describes these atlases of the epigenome that allow hours to characterize cell types, and through the lens of enhancer elements. And one problem in the field is there are not tools that exist to predict enhancers for cell types are not perfect. And so utilizing these atlases of both epigenomics, transcriptomics, and cell typing, we have an opportunity to figure out, or understand, what models work best to predict cell type enhancers. And also, if you understood it and really carefully validating these enhanced elements in mice to see how well they work for targeting cell populations in the whole brain. So marrying all this work together, we put together a challenge to ask teams to come forth and predict enhancers for cell types. We can then validate using these experimental validations from the Allen Institute. Next slide.

Really briefly, this validation is a great work from a lot of my colleagues at the Institute, and it involves taking enhancer elements, injecting them into mice, and looking to see what cell types light up in SYFP of fluorescence to understand kind of the targeting pattern of enhancers in the whole brain. Next slide. So we asked teams to take all this beautiful data, the multiomics, the snM3C, external data from four species in motor cortex, train their machinery models, rank enhancer list per cell type, and then validate those results using these validation experiments I just talked about. We had six great teams come together and really apply a diversity of methods - next slide - ranging from deep learning models, as you've heard about before, using DNA sequence to predict what open chromatin is specific to a cell type. Also including priors on what describes functional enhancers, how do you use the Hi-C and the RNA together? Also approaches that didn't use too much machine learning. They just looked at the kind of characteristics of what makes an enhancer specific to a cell type. And really, we've found that there's a lot of kind of priors and tools that work, and how we can utilize that, I think, is really important to the field in the future. Next slide, Bing.

BING REN: Yeah. Just to summarize, we discussed strategies for gene regulatory code analysis focused on the diverse brain cell types. We discussed the cell-type-specific regulatory program in the mouse brain and in the human brain and how these maps are now enabling scientists to try to test various enhancer prediction programs, matching that with the experimentally determined enhancer activities. So we would like to conclude with acknowledgement. Nelson, do you want to say a few words here?

NELSON JOHANSEN: Yeah, of course. The work that I talked about was a really great team effort at the Allen Institute. It took a lot of people coming together to produce these validation results. So big thanks to all the teams and everyone for that effort.

BING REN: And for the work that Joe and I discussed, it also comes together with a large team from multiple labs, a broad set of institutions. I won't have time to read their names, but I'll leave the slides here, and we thank everyone for your attention.
ERIN GRAY: Wonderful. Thank you very much for that great presentation. These multi-omics atlases you've generated are really quite impressive, and it's really great to see them being applied to neurological and neurodegenerative diseases. I don't see any questions in the Q&A, and we are a bit over on time for this session. So I'm going to go ahead and move us into the next session. But thank you again. That was a really great presentation. So our panel discussion coming up is on brain cell atlases for neuroscience research. So we have about 30 minutes, so we'll end around 1:20 Eastern. This panel will be led by Dr. Fenna Krienen at the Princeton Neuroscience Institute and Bosiljka Tasic at the Allen Institute. So I will turn it over to Fenna and Bosiljka. Thank you.

FENNA KRIENEN: We're really excited to be moderating this exciting session. I'll just pull up another slide to show you where we are in the program. So what we're going to do now is invite about seven short, 5-minute-ish presentations to kind of frame the major questions that we think would be really important to cover in this session. And so we will go forward with introducing the following speakers, and then we will convene with a larger group of folks to have a more extended panel discussion. And many thanks to the note-takers in this session. So without further ado, can I invite Mike Skinnider, who is going to kick off our short presentation series?

MICHAEL SKINNIDER: For sure. Thanks, Fenna. You can see that, I'm assuming? Great. So thank you for the opportunity to talk a bit. I'm going to just briefly talk about some of my work on developing computational approaches to comparative brain cell atlases. And the context, of course, is that with increasing throughput and decreasing price of single-cell and spatial genomics, we're seeing an increase in the number of comparative brain cell atlases that span multiple experimental conditions. And I think this is a very exciting development because these comparative atlases are allowing us to decipher the shared and cell type-specific responses to biological perturbations such as disease, drug treatment, genetic perturbation, or even organism-level behaviors or experiences. So my own work in this area has focused primarily on spinal cord injury. For example, my colleagues and I developed a comparative single nucleus atlas of the lumbar spinal cords of paralyzed mice that were subjected to six different neurostimulation protocols. And more recently, we developed a comparative atlas of the spinal cord injury lesion site itself that encompasses 18 experimental conditions, including a range of injury models, severities, time points, and treatments.

And one lesson from this line of work has been that comparative brain cell atlases present very different biological questions compared to single-condition atlases. And I think the three important questions to ask in any comparative atlas are, are there changes in cell type proportions, what cell types are responding most strongly to a given perturbation, and what genes are differentially expressed in response to this perturbation, both within and across cell types? Answering these questions requires conceptually appropriate computational methods, and so I wanted to talk about two tools that I've developed to help address the latter two of these questions, i.e., which cell types are undergoing the most profound transcriptional response to a perturbation and what genes are differentially expressed in those cell types. So when I say identifying cell types that are undergoing the most profound transcriptional response to a perturbation, what exactly does this mean? Well, a concrete example might be that we've collected comparative brain cell atlases from mice that have been exposed to a stimulus versus unexposed mice, and we want to identify subtypes of neurons that are undergoing some transcriptional response to that stimulus on the basis that those neurons might be involved in a neural circuit that's relevant to that response.

And so to answer this kind of question, my colleagues and I developed a machine learning method called Augur, which takes a supervised classification approach, and it essentially tries to predict the experimental condition that any given neuron came from based on the gene expression. And so the intuition is that this classification task will be easy when neurons of that particular subtype are transcriptionally distinct between two conditions. But conversely, it will be very difficult if those neurons are transcriptionally similar, i.e., they're not actually responding transcriptionally to the stimulus. So we applied Augur to this comparative single-nucleus atlas of the lumbar spinal cord that I mentioned earlier. And we used Augur to identify a subtype of neurons that enables the recovery of walking after paralysis, which we then validated experimentally. So a logical next question is, what are the neuron subtype-specific gene expression changes that enable these neurons to participate in that functional recovery? So the standard approach to answer this kind of question is differential expression or DE analysis. And there are many different statistical methods to perform DE analysis of bulk or single-cell RNA-seq data. But not all of these methods are equally appropriate. And in fact, my colleagues and I found that some of the most widely used methods are used to perform DE analysis of single-cell data can actually produce hundreds of false discoveries and null comparisons without any biological differences. And we confirmed this finding through a range of orthogonal analyses, including simulation studies, re-analysis of published data sets, as well as a prospective RNA scope experiment, which established that these putatively DE genes were indeed false discoveries. And we created an R package called Libra that implements statistically appropriate methods for single-cell DE analysis.

So in summary, I think that the emergence of comparative brain cell analysis is very exciting. It's going to help us address new biological questions. So for example, what neurons are participating in the responses to a given stimulus? And I've told you about two computational tools that can help decipher the shared and cell type-specific responses to a perturbation within comparative brain cell atlases. And these tools are both open source and they're available via GitHub. So if they sound potentially useful to your work, please do check them out.

BOSILJKA TASIC: Thank you, Michael. She will, I think, proceed directly. Maybe we don't need to announce every single one. I don't know, Fenna. We didn't discuss how we're going to do it. But Bronna, please proceed.

RONNA HERTZANO: Thank you. I'm going to talk with you today about NeMO Analytics and how we can use it to visualize health and disease states, as well as really reutilize published data. So our goal in the NeMO team has been to increase the reuse and accessibility of multiomic data and maximize the utility of hard to acquire data sets. This work started from the EAR field with a portal named the gEAR that was published a few years ago and then extended to the BRAIN Initiative. The goal is to make the data findable, accessible, interpretable, and reusable. In the NeMO Analytics portal, we currently have over 1,000 data sets. The data sets come from either individual users that upload their data or from public sources that we pick up the data. And there are various data modalities, essentially multiomic data, that are then curated and presented in what we call profiles. So one of the unique things about NeMO Analytics is that it allows you to see multiple data sets in the same page. So you can see here every data set is presented in a box. And so you can have multiple data sets from different species or different modalities presented in one page. So here you see patch-seqs, spatial transcriptomics, ATAC-seq, and single-cell RNA-seq. These are all presented in one page side-by-side in the NeMO Analytics tool. We collected data sets from the BICCN. Specifically, we supported the BICCN motor cortex package. For example, there are pages that support Alzheimer's disease research.

More recently, we started supporting the SCORCH consortium, which is for single-cell opioid responses in the context of HIV. And what we have is not only the data sets presented side by side, but every data set has link-outs, the manuscripts that generated them. You can download the data, so it's a very easy way to not only identify data in a centralized place, but also get access to the raw data or the process data, and also change your display. So any data set can be presented in multiple displays, anywhere from SVGs, which are colorized cartoons based on gene expression, to TC, UMaps, dot plots, heat maps, violin plots, volcano plots, and so on and so forth. And the user can choose their display and customize how they look at the data sets and how the data sets present in their screen. Our analysis tools consist of a comparison tool, single-cell workbench that allows users just within a point and click interface to analyze and navigate the single-cell data sets, multi-gene displays.

And most recently, we added transfer learning. So the transfer learning package that we implemented was developed by Carlo Colantuoni, and this is the Project R package. And essentially, what we've implemented is using a point and click interface, allow users to take any gene card, whether it's a weighted or unweighted, and project it on any data set that exists in the NeMO Analytics. So you can imagine a gene card that is markers for specific cell types, and let's say you are looking at a new data set that you're not sure about specific cell types. It could be a gene card that is associated with a certain process that is happening in the brain: differentiation, cell death, change in cell phenotype, and so on and so forth. So with this, I want to give the shout out to the team. This is a large team of people that have been working together now for almost a decade. The lead engineer is Josh Purvis. And I hope that anybody who's interested will access the portal and we're also happy to provide tutorials and also help in uploading data sets. Thank you.

BOSILJKA TASIC: Thank you, Ronna. I'm next, I think. Fenna, is that correct?

FENNA KRIENEN: Yes. Yeah. You're next. And I'll just say, if anyone has questions, please put them in the Q&A. We won't stop in between presentations to address them, but we could address them during the panel, or speakers can type in answers as others are going. Go ahead, Bosiljka.

BOSILJKA TASIC: I'll share my screen. Can you see my screen properly?

FENNA KRIENEN:. Looks great.

BOSILJKA TASIC: Okay. So I will present sort of a challenge that we are facing with a sister consortium to BICAN, which is called the Brain Armamentarium consortium. And this is really a baby consortium. It's a new thing, and I think it's quite relevant that Doug Kim, who is a program officer of this consortium, encouraged me to show how BICAN and BICCN data will be used for defining and characterizing tools that this armamentarium consortium tries to make. So the BICCN and BICAN are trying to define cell types, but in order to define function of cell types and in order to study them in different modalities, we would like to provide experimental cell type access. And this is what the armamentarium consortium is doing. This is an initiative that you can read more about at the NIH website. It's called Armamentarium for Precision Brain Cell Access, and the program officer is Doug Kim. So inspired by yesterday's session, I just wanted to sort of give an overview of what we are facing as a new consortium. Of course, as a researcher, we will be collecting data and metadata, but we would like to synchronize them with maybe the help of a curator. Then we would like to extract features and arrive to knowledge.

But you can imagine as in any new consortium, adoption of the common metadata and data can be complicated-- or common metadata and data formats. And then we are faced with this, and I'm just giving an example from Allen Institute, but of course, there are a number of atlases that BICCN and BICAN have generated which provide us with these daunting taxonomies of like 5,000 clusters, super type types. Maybe 34 classes, it seemed manageable. But how are we going to ask everybody with characterizing genetic tools to use this? And this is a challenge. This is definitely a challenge. And I will provide baby steps. I will give you just an insight into baby steps we're trying to do to synchronize our data collection for tools for cell type access. So what I will just present briefly are the common metadata and data and really focus on minimum to facilitate adoption. So what we have created so far are descriptions for the types of players in these experiments: molecules, which is, let's say, DNA, for example, that will be the genome of the virus; delivery vehicle, which can be a virus; subject, which can be, let's say, a mouse or a primate; and then procedures. And procedures vary really quite dramatically here because the way you deliver a virus to an animal and the way you analyze are not always single-cell genomics procedure, but can be maybe just imaging where you're imaging for two markers.

So I have sort of tried to facilitate adoption by making things relatively simple, focusing initially on mouse and 43 course regions and cell major division in classes with four division and 34 cell classes. For each one of these, we're trying to adopt common names. For example, for cancers, plasmids, etc.; for delivery vehicles, for example, for viruses; for subjects, of course, mouse, rat, and etc.; and then procedures. And then when it comes to adopting names of brain regions and cell types, again, we are doing baby steps with 43 coarse regions and four divisions in 34 classes. Of course, the hope is that we will ultimately be able to adopt common coordinate frameworks, for example, the Allen Common Coordinate Framework with all these cortical, subcortical areas and fiber tracts and ventricular structures. And then the full taxonomy, perhaps through Allen map micelles or other regions, if people have single-cell data. So with this, I will stop here. I don't want to overwhelm you. I just want to sort of show the challenge we are facing and how we are trying to approach it for our system consortium. I can stop sharing, or?

FENNA KRIENEN: Great. Yes. Thank you. That was great.

BOSILJKA TASIC: Or can somebody just start sharing? I'm just not sure.
FENNA KRIENEN: I think I can take over.

BOSILJKA TASIC: Yeah. Stop share. Yeah.

FENNA KRIENEN: Yeah. Okay. Great. So just on the heels of that really lovely presentation by Bosiljka presenting the overall goals of the armamentarium, I want to share some of our efforts on leveraging these cell type atlases for a similar goal developing new tools, but specifically in the really challenging case of animals, for which we don't have a robust genetic toolbox already. And so of course, we need cell type tools for any species, even mouse, but it's really especially acute for primate models. And so we and others have been really excited about the ability to engineer viruses. And Bosiljka mentioned a little bit about this. Also, Nelson and his Enhancer Challenge presentation kind of framed this approach. But the idea is that you can package in AAVs a transgene that is controlled by a regulatory element such as an enhancer, and in that way, restrict expression of your transgene or your payload into the cell type of interest. And again, the armamentarium is kind of front line in developing these tools for many species. The first step for us, as for so many now, is to leverage the single-cell multiomic atlases that are, again, made possible by the BICCN and by the BICAN. In our case, we're producing an initial census of the marmoset brain. So we nominate candidate enhancers for particular cell types based on these atlases. And in this example, for instance, we were excited to find candidate enhancers for this novel TAC3 positive interneuron type that we had discovered in striatum a few years ago in primates. And by generating an AAV under the control of such enhancers, we could systemically or locally inject the virus into marmoset striatum and then start to label and get an initial glimpse of this morphology of this intriguing cell type that we hadn't seen before.

But with the new and beautiful multiomic atlases that Bing and Joe and others have now kind of shown us and Mike earlier in another system, nominating enhancers is no longer really so much the challenge as in screening and validation of the candidates, particularly in, again, this primate context. So in our workflow, we create this library of candidates, but we create constructs where each enhancer AAV also carries a DNA barcode. We package these all in AAVs, and then they're injected as pools, again, either systemically or locally into another marmoset so that we can use single-cell sequencing to read out those barcodes and understand or measure the predicted cell type specificity and whether that was achieved in vivo. So this enables us to test many enhancers in a single animal, which is really important in the context where the animal resource is so limited and precious. So here's just an example of that approach. So after the nomination, we pool, make all the viruses, inject them here in a systemic preparation made available by this BI 103 new capsid by Ben Deverman's group at Broad so this can be systemically delivered to marmosets, and using single molecule FISH to characterize, in this particular case, an enhancer that we think is selective to deep layer glutamatergic neurons. Of course, that one at a time, again, is not the scale that we're looking for. In parallel, though, we package the same elements, so the marmoset enhancer, if you will, and deliver that systemically to mouse. And you can see in this beautiful specificity in layer five of the neocortex which highlights another feature that was mentioned earlier a little bit before that many are evolutionarily conserved. Some, but not all. And you can also appreciate how there's often this really exquisite regional specificity of these elements. So it's really restricted just to cortex, even though it's a systemic delivery.

We don't think that screening in mouse can replace screening, especially in vivo, in a primate, but it's an important part of this toolkit. So our goal and several others' is to develop functional tools, eventually, for primates, but of course, they can be useful in their own right in the mouse as well because it bypasses the need for complex breeding schemes and CRE and so forth. So we've used some of our enhancer candidates to control behavior, so for instance, optogenetically inducing this ipsilateral rotation behavior by stimulating the indirect pathway of the striatum. We can also use it to drive the expression of sensors, and of course, others are doing this as well. But in this example from Guoping Feng's lab, using a chat enhancer to restrict expression of GCaMP to cholinergic neurons in the striatum and recording during behavior. So I'll just close by recognizing we heard from Josh Huang on Tuesday how essential connectivity is for understanding what a cell type is and does, and that cell types are really defined by their relationships with other cells. I think we'll hear more about this later on. But input mapping, for instance, using rabies has been a really critical tool for systems neuroscience. And that would extend to value in primate context as well. So another major goal would be to use enhancer AAVs in the context of synaptic tracing, for instance, using the enhancer to restrict the population of available starter cells that the rabies can then trace the inputs to. So these efforts are early, but they're underway.

So just to summarize, and again, I think we'll hear some of these themes again and discuss them in the panel discussion, but we think this is a very powerful set of tools for neuroscience in any species, but particularly necessary for NHPs with many applications. But there are ongoing challenges. And Bosiljka highlighted the last one here, just the criteria by which we decide that something is specific, efficient, where it is, how do we label the metadata, but also more technical challenges like ensuring robust expression while maintaining specificity and the challenge that each individual species may have different delivery needs, different AAV tropism, and efficiency of capsid design and engineering needs to be critiqued in some senses. So let me just close by acknowledging the broader team, and then we'll move on to the next short speaker. Thanks.

BOSILJKA TASIC: Thank you, Fenna.

FENNA KRIENEN: I think Jeremiah Cohen is next.

JEREMIAH COHEN: Yes, let me share a screen. Okay, let me know if it's not working. Great. So thanks for the opportunity. And thanks to Fenna and Bosiljka for organizing this part of the workshop. So I want to discuss kind of a case study of how we can start to use the sorts of tools and techniques that we've been discussing in the last couple of days to really make deep structure function links in the brain. So here the structure part has really led the way, as we've seen, for example, in the work that Hongkui highlighted a couple of days ago. But really to understand the function of these systems and whether, as John pointed out, we can start testing these cell-type hypotheses, we need links to function. And that includes physiology and behavior. So the sort of vignette that I'll show briefly is about some special populations of cells that we care about, which are sometimes called neuromodulators. And these are small groups of cells largely in the brainstem that release special neurotransmitters, norepinephrine or serotonin or dopamine. And they're made up of very small numbers. In your brains, you have only about 50,000, for example, norepinephrine cells that supply most of the nervous system with that transmitter. And their diversity is something that we're starting to appreciate again here based on the wonderful atlases that have been developed starting really in the mouse.

Now, classically, in this case, the norepinephrine cells in this small pontine structure called the locus coeruleus have been considered kind of a monolith. The mouse only has about 1,200 of these on each side, and it turns out that they innervate as a group essentially the whole brain, with the exception of the basal ganglia, whereas dopamine neurons, by contrast, in the ventral midbrain, massively innervate the basal ganglia but avoid a lot of the structures that these locus coeruleus cells supply, including the isocortex. So in our experiments, we train mice to do tasks in which they have to learn on the fly from recent experience. And so here's an example of what one of these experiments look like where the animal has to make a series of choices where the outcomes of the actions that it makes are rewarded with changing probabilities over time. So it has to forage in this environment for reward as the environment changes. And so we make extracellular electrophysiological recordings from these identified norepinephrine cells using techniques where we can ask each cell whether it's a norepinephrine cell or a cell of some other type, and here we found two completely non-overlapping shapes of action potentials of these norepinephrine cells in the locus coeruleus. So there's one wide type and one narrow type, and it turns out that these two types do essentially different things. They have different functions in the behavior. So the wide type carries a key learning signal that is called reward prediction error that we think drives learning from the outcomes of recent experience, and then the narrow spiking cells have a separate response that they're excited when the mouse does not get a reward in and predicts changing future actions based on these responses.

So this is kind of what the dynamics of these two populations of physiologically defined subclasses look like, these kind of non-overlapping responses as a function of key learning variables. And so now we want to map these really onto the structure and the cell types and know whether this sort of tests the transcriptomic hypothesis that John proposed a couple of days ago. So one key question, of course, is whether these are distinct projection systems. So it turns out the calcium dynamics of the axons in the frontal cortex, which is a key structure for learning in this task, look like this first subclass, these wide spiking norepinephrine cells, and we can compare that also using a sensor for norepinephrine dynamics in that same structure, and it also matches these first type. And so now we can start really deeply linking to structure. And so here are the first 10 examples, as far as I know, of complete morphologies of norepinephrine cells in a mammal, and it turns out that they are probably modular, so. There's a few examples that project to the isocortex almost exclusively, others that project to the brainstem and spinal cord, and then others that supply norepinephrine to the cerebellum.
And so together with other colleagues at the Allen Institute, in this case for Brain Science, we're working on patch-seq experiments in SLICE to map transcripts to electrophysiological properties of these norepinephrine cells, and as you heard from Xiaoyin a couple of days ago, spatial transcriptomics to map the locations of the cells to the transcripts and their projection targets. And then together with several colleagues within the Allen Institute for Neurodynamics and Brain Science, including using some of the tools that Bosiljka just described, we want to ask these key questions about these neuromodulatory systems and really deeply linked structure and function. And so this is, I think, a sort of a case study in how we can start testing the cell type hypotheses that have come largely from the transcriptomic data. But I think we need to, in the case of the nervous system, extend to function. So thanks very much.

BOSILJKA TASIC: Thank you, Jeremiah. I hope we can give all of these functional types transcriptomic names, corresponding transcriptomic names. Thank you. Scott is next.

SCOTT STERNSON: All right, great. Well, thanks a lot, Bosiljka and Fenna. It's great to be included in this really interesting panel. My lab is interested in the relationship between cell types, their functional activity, and the roles in controlling behavioral states. RNA-seq has revealed that there's really enough gene expression similarity between individual neurons that these can be described by a limited number of molecularly defined cell types, and analogously, electrical activity recordings in awake behaving animals show that many neurons in a brain area have similar activity patterns and can be clustered into functionally defined groups. And it really raises the question of whether there is a relationship between these two modalities. Do the molecular gene expression groupings have any correspondence to the neurons with correlated dynamics and related to this? To what extent do groups of molecularly defined neurons contribute in a predictable way to behavioral states? And I think it's worth stepping back and laying out some general possibilities for how molecularly defined neuron types might be related to behavioral states. And maybe one of the most simplistic but least flexible coding schemes is a labeled line configuration where members of a cell type have similar dynamics and a single cell type encodes a particular behavioral state. Now others have taken the opposite view that although molecularly defined types exist, each cell's activity and its role is independent of its cell type in this framework. And then an intermediate view is that behavioral states are encoded by an ensemble of groups of molecularly defined cell types. So like the labeled line, a cell type has similar tuning as other cells of its molecular type. However, behavioral states are encoded not by a single type of neuron, but by specific combinations of these cell types groups.

And so how do you really test this? And there's probably not-- the answer is probably going to be a variable depending on brain region and behavioral conditions. So to address this challenge several years ago, we developed a method called CaRMA imaging, which stands for calcium and RNA multiplexed activity imaging. It's a method where we image through a GRIN lens deep in the brain calcium dynamics without regard to cell type. We test the same neurons, look at their dynamics across many behavioral states, and we take out the brain, slice it up, and in the brain sections, find the same neurons ex vivo that we imaged in vivo. And then we can take the marker genes that come from cell type classification associated with single-cell RNA sequencing and perform multiple rounds of fluorescence in situ hybridization to retrospectively assign cell type identities and molecular profiles to the cells we had imaged in vivo. And so this is a good way to rapidly, and even in a single animal, establish the functional roles of all the different cell types from a particular brain region that are identified by RNA sequencing. And so we use this approach to image the calcium activity of ensembles of the same hypothalamic neurons across 11 different behavioral states. And then we did 12-plex FISH on the same neurons afterwards. This is work done by Shengjin Xu in my lab, who now has his own lab at Institute of Neuroscience in Shanghai, and Hui Yang, who's a med student now at Einstein. And so we then have this vector gene expression as well as the functional activity across all these different behavioral states. We can cluster the neurons solely based on their gene expression profiles, and then we can look across at the corresponding functional responses of individual cell types across these 11 different behavioral states where blue is inhibited, red is activated, and white isn't changed.

And what's interesting is that in some cases, we see molecularly defined groupings that show similar activity patterns within a group. And in other cases, it's more different. So we can use this type of data to figure out what cell types contribute the most to different behavioral states. And then the study I'm talking about right now, we used multinomial logistic regression to determine the contribution of individual cell types to predicting and distinguishing these 11 different behavioral states. Essentially, their output of this is represented here, where we can measure the importance of a cell type for distinguishing behavioral states by this arrowhead size. We can also show the response sign and the magnitude of the response for neurons represented by line color and thickness to generate what we call molecularly defined response to coding diagrams for each behavioral state. And I'm just showing this for 4 out of the 11 behavioral states in this study. You can see that many of the cell types show similar activity patterns as one another, and they have different contributions to distinguishing behavioral states. But the fact that we can even have these types of diagrams and make predictions based solely on gene expression profiles is really a simplification for understanding this brain region that arises from the fact that many of the cells within a particular type have similar activity patterns and so contribute in a consistent way to the neural ensemble. It really reduces dimensionality of the neural ensemble, which, for instance, makes it valid to use methods like optogenetic manipulations at the cell type level, which is widely done in neuroscience, but there really hasn't been a concrete neurobiological basis for justifying this without these types of experiments.

One other thing before I close is that we can also analyze these marker genes for their ability to predict functional activity. For example, after clustering the functional responses to this ensemble of neurons - and so now we're clustering neural activity by calcium dynamics - we see an inhibited and inactivated population, and we can predict the functional response type just from the level of expression of different genes. And we can then go on and quantitatively assess the relative importance of individual genes or combinations to this predictive accuracy. For instance, here we see the MPY1 receptor, and our modulatory receptor typically predicts the neurons that are inhibited by this hunger-inducing hormone that we'd injected into these mice. And I won't tell you really any more data, but just to mention that if we look across all of the 10 out of the 11 other behavioral states that we observed, MPY1 receptor is the most predictive gene, indicating that CaRMA imaging can help us find genes that are useful for accessing the ensemble activity groupings in a brain region.

So for us, this is really exciting. CaRMA imaging combines systems neuroscience with molecular information and deep brain structures, allows us to use marker combinations from single-cell RNA sequencing seamlessly in the context of in vivo dynamics. Obviously, this is early days. More markers will improve predictive power. We think it's especially important to look at neuromodulatory receptors, but also, for example, projections using axon projecting barcoding and reading it out with an endogenous gene expression. This method gives really quantitative metrics to relate gene expression to neuron activity. We defined a whole series of metrics in this paper to describe the degree to which cell types are a sort of coherent functional grouping. It's important as well to identify optimal gene sets and how they relate to expression to predicting neuron responses. I told you about the MPY1 receptor. But more generally, this approach generates these kinds of high-dimensional, unbiased data sets and integrate gene cell type and behavior. And I think this ultimately should increase the opportunities for new discoveries in neuroscience. And that I'll just end here.

BOSILJKA TASIC: Thank you, Scott. And we are running a little bit out of time. But Marta, please go ahead. I think it's okay.

MARTA SODEN: Great. I'll be quick. Can you see my screen okay?

BOSILJKA TASIC: Yes, thank you.

MARTA SODEN: Okay. Great. So thank you very much. So in just the few minutes I have, what I'd like to tell you all about is a project that we've been working on where we're really trying to harness transcriptomic data to answer questions about circuit connectivity and neuronal excitability. This is a much smaller scale project than kind of a lot of the big atlases we've heard about over the past few days. But what I hope to communicate in the next few minutes is just kind of the ways that we're thinking about how we can use this really powerful technology to really probe these circuit level of questions. So I'll set the stage with the kind of question that we set out to answer. So we're very interested in a brain region called the ventral tegmental area, or VTA. The VTA is home to many dopamine neurons as well as other cell populations. And we know that the VTA receives input from many, many different brain regions, dozens of different brain regions. And there's a lot of heterogeneity amongst these different inputs. So there's heterogeneity in terms of the axonal innervation patterns, how they innervate different subregions of the structure. There's heterogeneity in terms of the patterns of synaptic connectivity from these different inputs. And we know that if we optogenetically stimulate different inputs to the VTA, we get out different patterns of activated cells that we can measure just by looking at fos protein expression across the structure. So we know that when we stimulate these different inputs, we are activating different subsets of VTA neurons. But what we didn't really have a very good sense of is what is the genetic identity of these neurons, or what is their kind of molecular signature? And we thought that if we could use transcriptomics to get a better handle on that, we would then be able to kind of isolate specifically activated subpopulations and study how those subpopulations regulate downstream behaviors.

So what we set out to do was to combine optogenetic stimulation with single-nucleus RNA sequencing. So what we did was put channelrhodopsin into a number of different brain regions, and then we implanted fiber optics above the VTA so that we could stimulate those incoming axon fibers with light. We also have a YFP control group that we can use for comparison, okay? So after we do this light stimulation to activate the specific inputs, then we can dissect out the VTA tissue and do single-nucleus RNA sequencing to get that data. And then what we can do with this data set is we can kind of do a standard cluster analysis where we can kind of identify different VTA cell types and find markers for those cell types. But what we really focused on was looking at analysis of expression of immediate early genes. So these are genes like fos, but also many, many other genes whose expression is regulated in response to changes in neuronal activity. So we can look for increases in immediate early gene expression as a signature of which cells have been activated when we stimulate each one of our inputs. So I'll just going to share a few of the types of analyses that we can perform on this data set, just a really quick survey, so we can look kind of at the bulk population as a whole. And for each one of our stimulus input groups, we found that there were many, many of these immediate early genes that showed increased expression after stimulation compared to control, okay? But we can also drill down and look at a cluster-specific level. So we can ask which specific neuronal subclusters show significant activation or increases in immediate early gene expression after stimulation. And when we do that, we see that among our different inputs that we stimulated, for each one, we saw a different pattern of clusters that were activated following stimulation, okay? So now, we can take this data and try to follow up and better understand what is the behavioral function of these different clusters that were activated.

We can also ask questions about the pattern of which specific immediate early genes were induced in each cell type and by each different stimulus. And we can also use this data to probe for ion channel genes that are associated with immediate early gene activity. So in our dopamine neuron populations, for example, we were able to pull out a set of ion channel genes whose expression is anti-correlated with immediate early gene activity and a set of ion channel genes whose expression was positively correlated with immediate early gene activity. And these different ion channels show gradients of expression across our dopamine neuron populations. So this is a way to identify candidate genes, but now we can go in and try to investigate to understand their function in regulating excitability and responsivity to these different inputs. Okay. And then finally, this is something that would be great in the future to do with more sophisticated spatial transcriptomics techniques, but even with kind of a basic, nine-gene multiplex in situ, we were able to start correlating these variable axon fiber innervation patterns that you can see here with the spatial patterns of immediate early gene expression. So we can start to do correlations and see how those two things interact with each other and how kind of the subconnectivity of neurons within the VTA is affected by these different axonal innervation patterns, okay?

So just to quickly summarize, what we've been able to do is to assess what is the net effect of stimulating specific inputs into this brain region, and we can use this information to kind of better understand and tease apart the circuit connectivity and circuit function. We can identify the activation of cell type-specific transcriptional programs, which we found are variable from cell type to cell type. And we can detect correlations between these markers of activity, these immediate early genes, and genes that determine excitability, notably ion channel genes, to help us identify new research targets. So I'll just stop there and thank the people who contributed to this work. I want to especially thank Rhiana Simon, who was a huge driver of this research, and my collaborators, particularly the Stuber Lab. So thank you very much.

BOSILJKA TASIC: Thank you. Great talk. So many exciting things in this session. Maybe I think we can conclude and now open it for questions. There are already some questions in at least the chat that is accessible to all the panelists. And I don't know-- okay, I see Josh. Maybe why don't you start us off and then I can also try to read the questions in the chat.

JOHN HUANG: It's not a question. It's really a comment or a suggestion. I just feel that this session is so exciting from the work of Jeremy Cohen and Scott. It really seems to me there's a huge opportunity here to leverage on the large-scale transcriptomics to go across the level to really discover the principles. So my appeal is that this kind of research program, I think, is very exciting, but it probably don't fit conventional R1s. And I think again, it requires a mechanism for funding. I think the BRAIN Initiative at the NIMH has been great in recognizing the significance and opportunities for producing these very large-scale data that lay out the landscape. But now it seems this is, to me, another very important problem and opportunity. That's all I wanted to say.

BOSILJKA TASIC: Thank you. I agree. Maybe appeal to NIH. We need large-scale functional examination connected to probably.

JOHN HUANG: Mechanisms, some organizations that's different from the very scalable approach.

BOSILJKA TASIC: And I don't know if any of the NIH folks want to chime in or respond to that. I mean, I agree with you. One thing that I think could facilitate potentially - and this is my pet thing with metadata and data synchronization - is if everybody tried to use the same nomenclature. We could actually distribute these functional characterization across the labs and then unite them somewhere. So this is one appeal I have for all of you who have presented these amazing functional characterization. What would it take? And just thinking even further, let's imagine you have published your papers or your papers are now being maybe reviewed. But what would it take to try, with the methods you have - you don't always have a whole genome or a whole transcriptome sequencing - to use common nomenclature? What would it facilitate for you to use common nomenclature? And I'm sorry. I don't mean to put anybody on the spot, but one of the functional talks would be lovely if anybody can chime in.

SCOTT STERNSON: Well, I mean, I think that to use functional nomenclature or some sort of consistent nomenclature is always going to be advantageous. What's basically emerging now is, for the most part, for many brain regions at least, the cell types are now pretty well mapped. And so one doesn't necessarily have to resort to homebrew solutions. You can essentially download marker genes off of the internet now. And to the extent that there's sort of agreement on cell type classifications, that can be essentially the nomenclature. I think that, as John has pointed out, all these sort of transcriptomic cell types are hypotheses. There are all these questions about, for instance, how do you cluster your cell types and so on. From my perspective, a lot of the utility of this is with respect to how much it reduces the dimensionality of how we study the brain, especially functionally. And so I think that it's important to use that common nomenclature and then potentially be able to go back and forth between the functional experiments and the classifications in order to look at whether or not they're split too far or not split far enough and that type of thing. So there's really a potential for interaction as long as we're using the same terms and the same marker genes.

BOSILJKA TASIC: I agree. I just don't know how do we get people to use exactly the same names. So that's where I'm just thinking what would it take to-- let's say if NIH required, "Hey, you got to use these names," maybe. I don't know. John, maybe I will let you chime in. I don't know if it's along the same path or a different one.

JOHN NGAI: Yes, but also going back to Josh's comment. Yeah, so I mean, Scott, I agree with you. Nomenclature is tricky. And I mean, you know the old adage that scientists would rather use each other's toothbrushes than each other's nomenclature. But it'll be a social engineering problem that might be more complex than understanding the mammalian brain. But it's something we do need to work on as a field, at least have a way that-- if there are different names, at least a way of cross-referencing them. That's been done in other cases. I mean, we just have to keep on working on it.

Getting back to Josh's point, ultimately, the tools, resources, and knowledge that's being generated by these cell-atlasing projects, for me, there's inherent beauty in it, but it's also a means to an end. It's really to provide information that we can use, as we saw in these short talks, as a preliminary way, and eventually, as a deeper way of probing neural circuits and probing mechanisms that are driving behavior. And so to Josh's point, BRAIN Initiative, actually, has a very robust brain circuits portfolio. Actually, it's the largest part of the brain portfolio. And my hope is that more and more of these projects will incorporate the tools that are coming out of these big projects, the cell-atlasing project, the connects project, armamentarium project. I mean we're standing up these large projects, not just to do it all for themselves, but to actually have it enable these other types of studies. So some of these studies are getting pushed out into the ICs. I mean, there are a lot of opportunities there, but some of this work is ripe to be pursued within the BRAIN Initiative itself. So just let us know if you have good ideas. We can put you in contact with the appropriate program officers. And there are mechanisms for this. But Josh, you're right. Some of these studies, not all of them, will take a larger team effort, but at least at BRAIN, we do have mechanisms for that. So don't rule that out of hand.

BOSILJKA TASIC: Thank you, John, for that encouragement. Jeremiah?

JEREMIAH COHEN: Well, so the pithy answer to Bosijlka's question, maybe, is just self-discipline, right? We need to agree on the right terms and push them. The problem, though, is that, how do you know that you're actually talking about the same things across multimodal experiments? So the stuff that Scott presented, I think, is the closest we can get to knowing that. But even across species, how do we know that a cell in the brainstem in a mouse is the same as in one of fennec critters or in our brains. These are not trivial at all. So actually, I think in some ways, we have to be very careful about using nomenclature in a way sort of with confidence allocated to the right parts of an experiment. But the multimodal problem is huge and open, right? I think that this is, for the next decade or two, for us to figure out.

BOSILJKA TASIC: Yeah. No. I agree. I'm always thinking, can we at least ask people to-- and again, how would we institute that to at least give us the coarsest definition of cell type that corresponds to some taxonomy or some nomenclature? If they don't know the finest, maybe give us at least the coarsest so that we know roughly where you are. And also that all the data needs to be shared in some cell-by-gene, even if the genes are very few, some cell-by-gene tables. I think that could actually really facilitate. Anyway. Thank you, Jeremiah. Wenjin.

WENJIN JIM ZHENG: Hi. Yeah, actually, I go by my middle name, Jim.

BOSILJKA TASIC: Oh, sorry.

WENJIN JIM ZHENG: That's okay. Yeah. I know my name is like my full name. Anyway. So I think the nomenclature is just a label. I think what's really important is that, underneath, all these molecular level characteristics that can differentiate different cell type, right, like using clustering to identify all these clusters. But I think what really important is-- like now, an important thing is really incorporate all those different important features to represent these cell types precisely. I think about 15 years ago, my group developed what's called ontology fingerprints. So basically for each gene, instead of we just have one or a few ontology terms, you can build a vector of tens or even hundreds of ontology terms with the enrichment of p-value to characterize the genes and build from all the publications about that gene. So by this kind of approach, basically you look into every possible aspect that can differentiate yourself from the other cell and in a more precise way and quantitatively so that you can not only define your cell type but also compute how different they are based on these distributed representation quantitatively. I think that might be something that could be useful.

BOSILJKA TASIC: Thank you.

FENNA KRIENEN: I wonder if we could also briefly turn the question on its head because now Scott has taught us-- Jeremiah and Marta have taught us all of these unique features about some transcriptionally defined cell types. How does that information feed back into the atlases that are becoming more and more widely used and they're standardized and they're kind of molecularly based? And maybe Jim can also still weigh in on this concept, but what is needed to facilitate-- Marta now telling us something more about the inputs or more about the behavior that we didn't know already in capturing that information in a systematic way.

SCOTT STERNSON: Well, I mean, I just would go back to this question, but it's okay. Do you need more of to direct people to do it? Or is there an incentive? And I think the incentive is that we usually are evaluating things with FISH, with a limited number of marker genes. And there's strong incentive to be able to sort of go back, compute the cell type, the full transcriptome, and make predictions based on that. Some of the methods we talked about, like Martha's, that's inherent because she's doing seq, or in other cases, we're really relying on the shorthand of the marker genes. So we're going to have to-- if we want to really take advantage of all the data, and that's really the incentive, these large-scale mapping programs that integrate more and more neurons and give a sort of systematic nomenclature, I think there's a pretty strong incentive to use those and then to relate the data back to them. It also helps know where you are because there is a lot of spatial molecular details that will ultimately get added to the atlases as sort of anatomical features that are sort of brought out by these molecular gene expression properties become better appreciated.

BOSILJKA TASIC: Yeah, I just want to second kind of what Scott is saying and maybe just provide, yeah, an additional comment. Even if somebody did M-FISH with 10 genes or even 3 genes that were smartly selected or carefully selected, I think providing actually within their study a cell-by-gene matrix, and it may sound trivial, cell one, gene one, two, three, cell two, gene one, two, three, actually with a very carefully selected set of markers, that could be mapped to a complex taxonomy, believe it or not. Like in cortex, a single marker can define, let's say, CHODL interneurons. If you told me that every cell that you measured has CHODL expressed, we could map it and then we could connect it to the other functional properties. So I would just encourage maybe everybody to, for anything they can, however simple their molecular code is, to provide cell-by-gene tables. Just my short plug. Jim, and thank you for changing the way your name appears so that I don't miss.

WENJIN JIM ZHENG: Thanks. Yeah. So I think in the first day panel, I brought this up that when we talk about the integrated multimodality, the data, right, I brought up that we actually should also consider literature as additional modality that can incorporate into this. What I mean by literature is right now, given advanced larger language model and things like that, the way we can extract the information and knowledge from literature is way different from what you used to have. And by incorporating literature-- think about if you have a few genes, you just have a gene name or maybe description of that gene or what you know about the function of that gene. But in the literature, there is way more information about those genes than just those information we're using right now. And leveraging the larger language models and the literature mining, text mining, I think we can have more information incorporated into these genes that then can be used for better defined cell types and the inferred relationships and things like that.

BOSILJKA TASIC: I agree. I don't know, Fenna, if you were able to-- I'm sorry, I'm paying too much what people are saying and didn't have time to read the questions that are listed. I think Marga had a question a while ago about splicing. Is it sufficient that we measure just genes without the splicing information? I think more would be better, but we can't sometimes even get people to use it.

MARGARITA BEHRENS: Yeah, now my question is, to identify function, whether we need the splice variance information to truly attach a transcriptomic profile to a specific function. That was my question.

BOSILJKA TASIC: What do people think about it? Maybe I'll open it to-- I can still say what I think, but it would be great if somebody else chimes in. Okay, maybe I'll answer. I would say, of course, more information would be better, and especially if you're trying to perturb a particular protein and ask how that particular protein or gene contributes. You need to know which isoform is expressed. So I'm absolutely on board with that. I think we are still at the stage, how do we define cell types based on multiple genes? And frequently, the code is redundant enough that we can just use gene as a gene. But I completely agree with you. I've had people reach out to me saying, "I see that my receptor is expressed in your cell type that you defined. I want to know which of the isoforms is expressed." And I was like, "Well, I can give you 10x data," which is heavily 3 prime enriched so I don't have the full splice isoform information. Maybe I have SMART-seq for that data. Maybe I don't. But ultimately, if you want to perturb that gene, you would need to know exactly what is the splice isoform that's expressed. You maybe want to over-express it or something like that. You need to know which one. So I agree with you, Marga.

MARGARITA BEHRENS: Yeah. Most of all data showing the function of different receptors, for example, depends on the splice variant expressed.

BOSILJKA TASIC: Absolutely.

MARGARITA BEHRENS: So that's why the relationship to function is not direct. Yes, you disrupt the gene and you will knock it out everywhere and you will affect the function, but the details between two very similar neurons spiking in a different way may be related to that splice variant expressed and it affects both.

BOSILJKA TASIC: Absolutely.

FENNA KRIENEN: Maybe we can go back to Chunyu and then Scott.

CHUNYU LIU: Thank you. So I just want to play the devil here. So the marker gene is not crystal clear, black and white situation either. I want to point that out. If you consider the quantification expression difference, you rarely see a gene really exclusively expressing single cell type. It's more a quantitative difference. It may be higher than the other cell type, in more proportion. So when you really want to use a marker gene to define cell types, we have to be very careful about the criteria, the threshold, and everything.

BOSILJKA TASIC: I completely agree. Everything is continuous quantitative, and. Yeah. So Scott, do you have a-- do you want to chime in?

SCOTT STERNSON: Maybe just on both of the previous comments. With respect to splice variants in the advantage of performing functional experiments and relating them back to gene expression is, it's really the ultimate test of the relationship between, really, any characteristic that you can measure in the transcriptome to functional responses. And then with respect to this business about the exclusivity of marker genes - this is a very important point - it is absolutely the case that the predictive accuracy for gene expression to functional dynamics winds up being dependent on expression level and being able to measure in a continuous linear way the actual gene expression. That's what's nice about single molecule FISH. As long as you're not doing really nonlinear amplifications, you can get that data and identify, essentially, cut points for the level of gene expression that seems to be most predictive of a particular functional response. And so it is the case that treating-- and this also tends to be a problem with CRE lines, which is what I really like about, for instance, Fenna's approach. I'm very interested in this idea of using AAV promoters to gain access to these neurons instead of CRE lines because promoters that are based on the marker genes that we use to find the cell types are more likely to be sensitive to expression levels and the strength of expression I need in that gene for that promoter. And that might be a way to better access, let's say, the cells that are relatively high expressors of that gene, but you can't really rely sort of on the binary on/off that CRE is associated with. I think that's going to be really important in the future to make full use of this marker gene to functional dynamics connection and trying to bring that into the causal manipulation realm.

BOSILJKA TASIC: Yeah. Thank you. More analog than sort of digital, so having a different-- that the tool corresponds to expression levels. Though people love CRE lines, I have to say. I'm somewhere in between. I'm thinking if you can put a border well, maybe CRE lines and FLP lines are still useful, but I mean, I subscribe to both, I would say, to both analog and digital. Chunyu, do you want to?

CHUNYU LIU: Yes. Thank you. So I'm glad you recognize the quantification issue. I want to bring up another complexity. Actually, in my lab, we recognize there's also some discrepancy or inconsistency between protein level and the transcriptome level. We do see genes seem to be, protein level, pretty high specificity. But if we look at transcriptome, they're really not that specific. So this is another thing. I don't know how many of you noticed that problem.

BOSILJKA TASIC: Yeah, it would be wonderful to measure for every tool or for every perturbation or whatever we are using. Yeah, also the perturbation maybe and changes at the protein level. We are really doing most of it at the transcriptomic level just because it's easiest. Absolutely. So some suggestions from the comments are, we need to choose a taxonomy. We need to map all cell-by-gene data sets, but I would say even a very simple cell-by-gene data sets, to a atlas, to a common nomenclature. And I agree with that. I agree with that. I would love if that were a requirement also for publication maybe. Any other comments?

SCOTT STERNSON: Well, I'm just going to say with proteins, I mean, it is actually possible now. There are methods that have been described to read out protein, as long as you have decent antibodies, protein in conjunction with RNA just using a FISH pipeline. We do this for our thick tissue FISH methods, which are important for registering in vivo and ex vivo. And so it is possible, at least with relatively high expressed proteins, to measure that relationship if that's something that one wants to do.

BOSILJKA TASIC: So I agree. I think one major challenge there is protein distribution and cell segmentation. Where do you actually measure the levels? And so I don't know how you have-- maybe, Scott, if you can comment on how you have dealt with that. I mean, if you have a transcription factor, absolutely. That's nuclear. But if you have a receptor that's present somewhere at the tip of a synapse, I mean, what protein level are we measuring? Do you know?

SCOTT STERNSON: It's definitely lots of room for improvement. The whole business of proteins is always the quality of your antibodies and then also your expression levels. So that problem is-- I don't think it's really solved. To the extent that you have favorable protein distribution, expression levels, antibody reagents, then one can put that into a gene expression pipeline in situ after in vivo imaging. But always with proteins, a lot of things have to go your way to really get good data out of it. That's why we do so much FISH.

BOSILJKA TASIC: Exactly, yeah.

FENNA KRIENEN: Okay, I'm going off script here, but we have some amazing panelists who have specific expertise that we haven't really had a chance to touch on, so. I think we've been really curious about the extension and the availability of tools and what we might like to see in other contexts that are really relevant for neuroscience research and cell typing, such as developmental context. So not to call anyone out, but Tom Nowakowski is on the panel and Arnold. And so I'm wondering if anyone could speak to the challenges or opportunities of developing tools or even atlases that could be used for tools in a developmental context where cell types are not stable, they're maturing, and so forth. Anyone who wants to chime in on that topic, it'd be great to kind of build out the complexity here.

ARNOLD KRIEGSTEIN: Perhaps I could just add one bit of a perspective, and that is, of course, functional studies in the developing human brain are extraordinarily challenging. You can do sliced cultures under certain circumstances, but that's quite a difficult thing to do. Organoids, of course, have become a popular platform for looking at functional assays in human cells during development. That's very restricted to stages that really are fetal or at most perinatal. But they have been starting to become effective in looking at functional assays of specific cell types. Structural organization is missing. There's a problem with complementary cell types. The proportion of cells are different. Niches are quite disturbed. So there's a lot of limitations currently. But that looks to be the future direction of functional studies for developing human brain. And also, I think that there need to be some more cross-comparisons between the cell types that are being discovered in the developing primary tissue that is a normal developing fetal brain and what's seen in the organoids. I think there's still quite a lot that needs to be done there to determine the real fidelity of the cell types.

BOSILJKA TASIC: Yeah. So many challenges. First, mouse versus human, then human in vitro versus in vivo, and then the time dimension. Completely, yeah. Understood. Any other people who work on development, maybe development in model organisms where maybe at least one of the challenges is not that-- it can be bypassed. I don't know. Anybody working on mouse development or non-human primate development? But I agree, everything becomes more complicated when you add the time dimension.

HONGKUI ZENG: Hi, Bosijlka. So I'm not on this panel, this particular panel, but I'd be happy to chime in.

BOSILJKA TASIC: Please chime in. Yes.

HONGKUI ZENG: Yeah. I think developmental studies will be complex but also extremely important. I really hope that there will be good, a variety of computational approaches developed in just studying the trajectories of the different cell types, not only prenatally or an early stage of neurogenesis, progenitor differentiation, but also a lot of things happening post mitotically during the maturation process of neurons and non-neuronal cells. And at those stages, you can't really use lineage tracing type of work to do it anymore because there's no mitosis anymore. But there are still a lot of changes. And you see the neurons are really gradually maturing, and they're affected by a lot of, let's say, circuit factors, activities, axon projections, synapse formation interactions, things like that. And I feel like a lot of the current computational methods are really developed based on very early stages of development, mostly involving lineages, progenitors, things like that. But I think later-stage studies would be really good, and much more refined computational approaches would be really important.

BOSILJKA TASIC: So arguing for maybe additional focus on post-mitotic maturation and circuit formation stages, and how do we connect those trajectories where cells are not dividing anymore?

HONGKUI ZENG: Right.

BOSILJKA TASIC: Yeah.

HONGKUI ZENG: And I agree with a previous comment that profiling of proteins are really, really important. I really hope that proteomics, especially single-cell or cell type proteomics, will be incorporated into the cell atlas in cell type study in conjunction with transcriptomics and epigenomics and spatial transcriptomics and things like that.

BOSILJKA TASIC: Let's measure everything, yeah, if we can. That will be wonderful. Any other questions or comments? Let's see.

FENNA KRIENEN: Yeah. Anton?

ANTON ARKHIPOV: Yeah. Hi. Thanks. Well, since we are talking about challenges, I thought I'd mention another challenging area. We've heard a lot today about relating in vivo function of cells with the cell types, which is very, very important and exciting. But also, if we want to understand how it actually comes about, right, then we need to think about the circuits. And the important part about circuits is connectivity or properties of connections. And so I think-- just I want to mention some challenges here, which is a lot of the methods for studying connectivity do not lend themselves easily to mapping on transcriptomic cell types, and one needs to think about some proxies for that. So for example, if we are thinking about electron microscopy methods for reconstruction connectivity, one needs to think how these cells in the electron microscopy volume can be mapped to the cell types. There, for example, patch-seq can be used and provides really an amazing sort of Rosetta Stone for that. But obviously, it's not straightforward.

And for me personally, so I'm very passionate about brain modeling. I strongly believe that to really understand how the brain works, we need to do computational biorealistic modeling. And that brings with itself some further challenges. So just to make an example, when we are modeling a single cortical area in the mouse, the transcriptomic studies, for example, from the Allen Institute, but of course from other places and other people here telling us that there is about 100 different cell types in the mouse visual cortex alone. Now, if we want to simulate that or to understand really how all the cells are working together, we are talking about metrics of 100 by 100 interactions, right? That's 10,000 elements that we need to know: probabilities of connections, synaptic properties, the weights of those connections, distributions of weights, kinetics of these weights, plasticity of these weights. So I think there are some emerging methods, for example, using optogenetics and SLICE that can help address that. But I just want to bring it up because it's a very challenging but very, very exciting and important area. One possibility is that we may end up sort of deciding that maybe, okay, 100 cell types is just way too much. And we have to settle for something simpler with some kind of supertypes other than 100 cell types. But maybe there is some progress that will be made and this 100 by 100 metrics can be characterized. So I would be really looking forward to that.

BOSILJKA TASIC: Yeah, thank you. I mean, in general, I think, would be taxonomies need to be hierarchical and it will be great to have-- I mean, that's what we always try. I think most of the people in the field try to do that. Again, sometimes, yeah, still agreement on the hierarchies and what will be the best level of resolution for this study versus that is not always straightforward. Jeremiah?

JEREMIAH COHEN: Well, maybe this is actually a really useful place for established, clear taxonomies. For example, these norepinephrine cells that I showed, the ones that project to the cerebral cortex, they do this weird snake-like pattern with their axons that we haven't really seen in light microscopy from any other cells in the cortex, and so Antoine could go into an unlabeled EM volume and search for these weird structures based on known experiments from a different modality and then say, "Here are the vesicles. Here's the release. This is how we think it maps onto binding to receptors and post-synaptic cells," and. So maybe linking across modalities actually really requires a heavy hand with the taxonomy and kind of helping people get on the same page.

BOSILJKA TASIC: Absolutely. And a Rosetta Stone or multiple Rosetta Stones that will translate modalities and at different resolutions. So this is another thing that we keep talking about, maybe a community annotation effort that could bring together-- first, I mean, if you could have some straightforward Rosetta Stones which are cell-by-gene matrices and some way to map, even though the codes or the expression patterns may not be full. Maybe you're measuring just three genes, and maybe you're measuring three genes and very coarse levels of those genes, but at least you have some mapping, some prediction of how that maps. So I would love to encourage everybody to do that, even if they presented-- even if it's binary data. Because that is a very simple Rosetta Stone. But then the second thing is really, how do we bring all the data sets together, maybe in a community annotation effort? And I think that's maybe a bigger question. I don't know. I would love to hear if anybody else has any comments on that.

I don't know if we have any other questions we haven't answered, but I think we have been emphasizing multimodal integration and ability to, yeah, to relate cell types across scales and modalities, and I think that is absolutely something where either a Rosetta Stone that is gene expression or maybe tools. If everybody uses the same tool and does something with the same tool, you know at least that they're looking at the same set of cell types. So that's yet another way to relate across modalities. Again, once they change species, once they change maybe the delivery mechanism for viruses. I would like to hear any other comments from any other panelists. I feel I'm talking too much. How would people feel if NIH said, "Hey, when you publish a paper and you deposit to," I don't know, "PubMed Central, you got to give the best guess"? And I don't know how the best guess would be named for the cell types you examined if you have single cell data. How would people feel about that? That may be a bit of a heavy hand, but I'm suggesting it because I'm not in NIH, so. Ed?

ED LEIN: Yeah, I just want to say that-- I mean, some of these efforts for standardizing nomenclatures and coming to a taxonomy that has builds like the genome is a goal of BICAN, and to do this across species so that we use the same nomenclature for homologous cell types to the extent that they can be identified across species. And I think rather than being heavy-handed, is it encouraging people to do it and having a mechanism for feedback where this works for people and where it doesn't work for people, and to contribute their information to the knowledge base over time. The more and more and more knowledge that's associated with this, the more value it will have, and the more it will get used. You can't be heavy-handed on something that ends up not being useful for people. It has to be something that brings value. And so I think this is a big goal of BICAN to get to that sort of more formal release of something with the system that people can use and have tools to map against and people can begin to actually use these things. What I think we haven't solved is this community annotation element or how we would get information from the community where it is and isn't working or add that information. That's just a challenge for the future, but we're just beginning this. So I think this is great feedback of how we might make it useful to a broader community.

BOSILJKA TASIC: Thank you. John?

JOHN NGAI: Yeah, I mean, I think the big picture here is not the nomenclature in and of itself but as a means of making the information to be broadly usable in a consistent way, right? And of all the various things that NIH may or may not impose on the field, this would kind of not be very high on my list because it's hard to impose something on a field that hasn't yet figured it out. I mean, there's probably no one right way to do it. There's probably a ton of wrong ways to do it, and we don't want to push down those routes. In the cases of various gene families where there were issues about nomenclature because somebody started working on a gene and it had this function, so they named the gene or cell after that function but it turned out that was just kind of one little thing that that gene or cell did, and then people would discover these different ways, and then finally, the whole fields have to get together and come to some agreement for how you name the members of that gene family; that's relatively simple here. Now we're talking about potentially hundreds, if not thousands, of cells that have not really been characterized for any one of their-- what could be a myriad functions.

So I think as Ed was saying, I mean, in BICAN, it's not the entire community, but if what BICAN can do is develop kind of a structure, a procedure for going about this, that would be incredibly useful, right? So I don't think we're going to have coming out of BICAN a strict, hard nomenclature for all the cell types, but hopefully, there'll be a process for doing that. And one of the things that's really important is that the information, the resource here is not just accessible to others, but that there's kind of a two-way communication, right? There are going to be others outside of BICAN who are going to be not only using this information but contributing to the knowledge base, and that's another big piece of it. So hopefully, the folks here in collaboration with others that are not here and not funded by BRAIN necessarily can figure out a way to do it, and then hopefully that'll kind of coalesce onto something. It may not be perfect, and I think perfect is probably not a reasonable goal at this point, but it's something that will be usable and at least have a greater level of consistency and therefore usability for the entire field.

BOSILJKA TASIC: Yeah. Thank you so much, John, for that. I agree that, yeah, welcoming people and making it easy for them to submit their data, and actually, there is a question saying, "Are you saying for every single cell experiment, we should also submit the annotation of the cell type along with gene expression data?" I think a best guess annotation, whatever you could. I mean, usually not like manually map, but let's say if you had a way to say, "I measured these three genes, and based on the classifier, I can tell you that this is the cell type it is." I think actually it would really facilitate inclusion of the data. And I'm really talking about people who don't do genomics as much, maybe people who do multiplexed in situs on 10 genes. That can be very, very valuable information that could be integrated, but also maybe the lab is not versed in genomics data and genomics data analysis and presentation. But even if we had, as you said, for everything, something that you did a mapping with a small number of genes, I think it would still be useful. Hongkui?

HONGKUI ZENG: Yeah, I very much agree with what John said. I think we provide the resources, the options to the community, but we can't really enforce anything. Forcing something is never going to work. Create atmosphere, "Hey, this is the one. You have to use it." But I think what we can-- to get the community to accept it, scientists-- there are so many brilliant scientists out there. They have their own ideas, their own ways of doing things, and a lot of good thoughts and expertise as well. So instead of enforcing people, I think there are a lot of things-- we can do some of the investigations by ourselves and give that as example cases, publish in papers; doing some of the good studies by ourselves to see how we can actually integrate previously published historical data with our taxonomy and show that the results make sense. Basically, using examples to show this is the good way or the right way to do it. And I think we can think about maybe what kind of studies we should do in order to demonstrate the value of the resources that we create. Not just say, "This is our resource. Use it," but really use good examples to demonstrate how it can be helpful, informative, facilitating, and things like that.

BOSILJKA TASIC: Thank you, Hongkui. Any other comments, questions? Marga, "Gene cards gives aliases for each standard nomenclature gene." No, I agree. I agree. It's just, again, imagine everybody submits with their own name.

MARGARITA BEHRENS: Give time because that resource was not originally there. And as you say, people from different fields had named their genes in their preferred way. And then finally, over time, it condensates, and you can find the aliases, and you can find your preferred gene that is now called whatever, but it is incorporated there. So those kind of things, I believe, over time, they are going to be created. Somebody will do it.

BOSILJKA TASIC: Yeah. We're talking about cell type cards. For example, Mike Hawrylycz, Ray Sanchez, and I think it was also Tyler - I'm not sure how many people from Allen - created a prototype for cell type cards. Imagine if we had cell type cards with all the aliases. But sometimes actually, the relationships may be more complicated, but. You have actually mapped this at this level of taxonomy versus at this level of taxonomy. So we need cell type cards at various levels of taxonomy, Absolutely. "BRAIN has a funding opportunity to support standardization efforts. We particularly encourage such efforts by this group of scientists." Ming, a program officer at NIH, says. Thank you. I think we have six more minutes. Any last moment comments, any questions? I'm trying to read.

FENNA KRIENEN: I think we're actually at time.

BOSILJKA TASIC: Are we at time?

FENNA KRIENEN: Yeah. I don't know if NIH is going to rest control back from us.

BOSILJKA TASIC: Okay.

ERIN GRAY: Yeah. So we were scheduled to end around 1:20.

BOSILJKA TASIC: Oh, 1:20. Okay.

ERIN GRAY: Yeah. I would like to give people their 20-minute or so break. It may be about 15 minutes, I think, at this point. But I do want to thank you, both Fenna and Bosijlka, for moderating this panel and organizing it. And a thank you to the panelists as well. I think this was a really great discussion and truly illuminated the challenges and opportunities that we have in bridging these cell atlases to reach the broader neuroscience community. And it gives us a lot in NIH program to think about moving forward. So thank you very much.

DANIEL MILLER: I'm Dan Miller. I'm from NINDS. So the second part has more of a disease focus, but we're moving into a panel called Cell Analysis and Brain Disorder Research. And we'll begin with a keynote talk by Ed Lein, and then follow with a panel discussion, as we have been doing. So not sure Ed needs much of an introduction to this crowd, but I'll give a brief one. So Ed is a senior investigator at the Allen Institute for Brain Science. He leads the effort there to create large-scale anatomical and cellular and gene expression analysis of the adult and developing mammalian brain. He was a BICCN awardee. Now, he heads one of the large human-focused BICAN UM1 centers, and his talk today is called Putting the Human Brain Cell Atlas to Use: Insights into Cellular Vulnerabilities and Disease Trajectories from the Seattle Alzheimer's Disease Brain Cell Atlas. Ed, I'll give you five minutes warning if that's OK, and take it away.

ED LEIN: Terrific. Let me share my screen here. Okay, does that look all right?

DANIEL MILLER: Perfect.

ED LEIN: Excellent. Thank you very much for the opportunity to speak to you all today, both the BICAN crowd and also the many other communities that are participating here today. I'd really like to spend the session to try to illustrate how we can put these cell atlases to use to study human disease and to get a much higher-resolution understanding of the pathology of disease. And in particular, I'll use a parallel project that I helped to lead called the Seattle Alzheimer's Disease Brain Cell Atlas, or SEA-AD, as a way to show how we can directly take advantage of these cell atlases to better understand disease. So I just want to step back for a minute - and this is particularly relevant for the prior discussion - to say that these cell atlases are much more than a transcriptomic set of clusters. They are a description of the cellular makeup of different parts of the brain that become increasingly valuable as you add more information to it. And I think that the original nature package that we had on primary motor cortex remains one of the best examples of this, where we were able to both define a reference classification to map across species and then to begin to layer on information.

So of course, by using single cell genomics, we have gene expression, we have epigenetic profiles, but we also get the proportions of cells. We get their spatial organization. We begin to get their cellular properties. And this aggregated set of information becomes really powerful as a reference that we can then use where we can apply single cell genomics in other contexts and interpret it in that light. I think, in human brain, this has been particularly transformative because we simply haven't had the tools to be able to look at this level of resolution in the human brain until very recently. And this was sort of really epitomized in this recent science package, which is kind of shocking to think this is the pilot phase of BICCN. But really, it was. It was establishing that these technologies could be applied to human brain and non-human primate. You could do all sorts of different things with these studies that were illustrated by the many terrific studies from the BICCN community.

But this is really just the beginning, a first draft, if you will. And in the spirit of sort of broadcasting where things are going, I did want to highlight our particular project in BICAN, which is called the Human and Mammalian Brain Atlas, or HMBA, where we're trying to take this to a whole different level. The bar was set really high, but now we need to get to whole brain atlases in human and non-human primate for the main, biomedically relevant, non-human primate organisms, multimodal characterization, and try to link this into the fMRI world by being able to map these data into coordinate spaces that are used by the MRI community so that you can begin to make links across these scales and modalities. And so this is really kind of a tall order. And I just really wanted to highlight that these are major consortium efforts. And in order to pull this off, we have to bring together a wide variety of different expertise in all the different types of analyses that we can bring to bear on this, on the tissue side of things, really trying to transform how we even prepare brains and map them into coordinate spaces, as well as working across species. And so this is just beginning but is really aiming to try to extend this atlas to be something that covers the whole brain and can be a resource for the whole community.

I want to hammer this point a little bit more, that where these sort of first draft atlases have gotten to are, in fact, transcriptomic classifications where we don't understand that much about the types. However, where we have looked, there's good correspondence to many other features. And adding this information onto this classification really adds value in that people can now interpret their data in that light. So the things that will clearly be coming soon are spatial tissue organization, what Patch-seq can bring to understand properties of cells, and then whatever the community can begin to aggregate. And also, by doing this across species as a design principle and trying to identify homologous cell types, we can both infer properties that can't be measured in humans and also understand human-specific types of properties. And with systematic efforts, we can slowly begin to annotate this whole taxonomy. And a taxonomy of 5,000 types maybe isn't actually a crazy thing to begin to annotate, especially as a community. So it's really a plug that these atlases are going to gain value and importance over time as they become annotated.

Now, I'd like to spend the rest of the talk showing how we're trying to use these atlases to gain a better understanding of Alzheimer's disease. This is a project called the Seattle Alzheimer's Disease Brain Cell Atlas, or SEA-AD, as I mentioned before, which is really a collaboration between the Allen Institute for Brain Science, the University of Washington, and Kaiser Permanente. I'd particularly like to call out my colleague, Dirk Keene, over here on the right-hand side of this, who's been a key collaborator, both in SEA-AD and in BICAN, for helping to really transform how we do brain preparations and neuropathology. There's a lot of crossover between these two consortia, which I think is a real benefit here that we're both trying to make the reference and utilize the reference as a paradigm for how the rest of the community can do this as well.

So the basic idea, the aspiration behind this, is that these new technologies will let us ask, in a very, very detailed way, what is happening, where it's happening, and when it's happening over the course of Alzheimer's progression. And to try to bring together sort of more classical view of Alzheimer's as pathological peptides that accumulate over time and lead to neurodegeneration and eventually cognitive decline with these new tools that we have to look at very high resolution. And not only to understand sort of the types, but to map trajectories, to begin to understand the series of events that are associated with Alzheimer's that may give new kinds of targets for thinking about the disease or even where we might intervene. So the idea really is to build on what we know about pathological progression in the brain. For example, a beta progression with Thal phasing, phospho-tau with Braak staging, and to analyze those regions, where, on the one hand, we can look within regions across disease severity, and then, on the other, we can look across brain regions with individuals, within individuals to try to see how they progress over time. And I'm going to focus largely on the first half of this today, but I think both of these will be really revealing about what the real pathology is at the level of cell types and the circuits that they make up.

So in a snapshot, SEA-AD is really an integrated effort to try to marry together quantitative neuropathological analysis of a series of brain specimens spanning the range of pathologies in Alzheimer's with these modern molecular techniques, by which I mean single-cell RNA-seq, ATAC-seq, or now multiome, across a donor pool that's sufficiently large to really have the power to look for cellular changes, and then also that we can map in external data sets as validation data sets to this. And so we've taken advantage of two Seattle-based cohorts, The Adult Changes in Thought from Kaiser Permanente and the UW Alzheimer's Disease Research, ADRC. And the goal is really to get this detailed understanding with these technologies, but then to create a community resource where everyone can visualize, access, take advantage of these data sets. Many collaborators here in the audience that have helped with that.

So in brief, Dirk's team assembled a cohort that looked across the spectrum of AD pathology. This is not a case control study. We're trying to treat pathology as a continuous variable. And we've selected a set, in this case, of 84 donors that we began in the middle temporal gyrus with, that represent, from no pathology all the way to high pathology, both from Braak and Thal, Cerad, that also include those with comorbid pathology. So it has to have Alzheimer's pathology, but then also, comorbidities are allowed. This is very common in the patient population. A key element here is trying to use pathology as a way to define a trajectory of disease, or a pseudo trajectory. So typically, Braak staging, for example, or ADNC may be used to do this, which is a brain-wide thing that is largely based on binary calls. So we wanted to capture the burden of pathology in the region that we'll be looking at and to have sort of an aggregate burden of pathology, not just a single marker, but a whole series of quantitative metrics that could be obtained. So we use machine learning approaches on pathology images, and I just want to point out that this really captures the burden of pathology much, much better than a staging criterion.

On the bottom here, you can see, within a particular Braak stage, for example, there could be huge variance in the quantitative burden of disease in that region, and we're able to capture it with this. This leads to many different sets of metrics, and we're able to use those then to build a pseudo trajectory, the continuous pseudo-progression score, we call it, or CPS. This score corresponds very well with conventional staging, with cognitive scores, with labeling for a beta and phospho-tau, GFAP, perhaps an inflammatory element as well. And so really, the pseudo progression seems to capture this pathology and allows us to order the donors along of pseudo progression.

Now, we want to be able to look at the finest level of cell-type resolution that the BRAIN Initiative has been defining, and so a design principle on this was really to do pretty deep sampling in the number of cells per donor and the amount we sequenced per cell. And we took the SEA-AD data set and integrated it with the BICAN or BICCN data set in NTG. This allowed us then to be able to map these data to, first, subclasses and then, iteratively, down within subclasses to a supertype resolution, a much, much finer resolution. And what this allows us to do is to predict the supertype for all of these data and have a very high-resolution mapping. And in some cases, we found that there were new types of cells that we didn't have in the reference. This was true for non-neuronal populations. It was true for at least one disease state. And so we added those now to the references. So this actually addresses a question that had come up before, "What do you do when you have new cell types?" When you integrate in and you see there's not good mapping, you can embellish the reference and add those types.

So at the end of the day, we have about 139 supertypes that we're able to map against. And one of the first things that you can do now is to look for evidence of cells that are particularly involved in disease, that change over the course of disease, either down that might represent selective loss or vulnerability or up for some state, for example. And what we can see at a glance is that specific types of cells are indeed affected over the course of disease. We can see this with cognitive status in the first row or the more conventional ADNC staging in the second. But when we now use this more quantitative local score with the continuous progression, we see the same general types of cells, but the effect sizes go up substantially. And what we can really begin to see is that there are groups of cells that are particularly affected. These include a subset of the SST cells and parvalbumin cells. It includes excitatory neurons in the upper layers of the cortex and some of the CGE-derived interneurons. And particular states actually go up for astrocytes and microglia, as expected. One important thing I want to point out is that, if we look across regions, now looking-- adding the dorsolateral prefrontal cortex, we see a remarkably similar phenotype, but it's a bit delayed because that's at a later stage in disease progression in these individuals. But I think this is a really good validation that we can see the same types of cells are affected in different parts of the cortex, likely with the timing difference.

Now, the pseudo direct analysis also lets us look for temporal changes. And we can see now things that are happening early in disease and things that are happening in latent disease progression. And so for example, the excitatory neurons, which we expected to be lost, are actually lost quite late in the pseudo progression, whereas in contrast, some of the inhibitory types, the SST types in particular, show a steep decline early, really before there's a lot of pathology there. And this is a consistent phenotype across the middle temporal gyrus and the dorsolateral prefrontal cortex. We also see different trajectories for the non-neuronal cell types. There are early upregulations of both the astrocytic and microglial states associated with Alzheimer's here.

DANIEL MILLER: Five minutes.

ED LEIN: Okay, so MERFISH now comes in to add something additional. We can get the spatial organization of these cell types. And so it turns out that, actually, these interneurons derived from the MGE form a bit of a continuum. The effective cell types are on one particular part of this continuum. And with MERFISH, we can now show that the vulnerable cell populations really are all localized in the upper layers of the cortex. And this makes them co-localized with several other types that are also localized in the upper layers of the cortex. Most of the CGE-derived interneurons are in the upper layers and the affected types in the excitatory pool are the layer two, three intra and cephalic projecting. So it really adds this other piece of information that this is not SST neurons as a whole. It's a subset of them, and they're localized in the upper layers of the cortex right alongside others that are subsequently affected.

And then finally, we're able to take advantage of the knowledge base in BICCN to start to interpret what these subtypes are. And here, we have a lot of Patch-seq data derived from human neurosurgical resections that have started to identify the morphologies and the physiological properties of these specific subtypes. And I think you can see at a glance here that these SST neurons are very diverse. This is not a monolithic class. Within these are the famous double bouquet cells, for example. And these are the vulnerable subtypes, whereas there are a number of other SST types that are not affected or don't seem to be lost in the same way. And so the point I really want to make here is that we can take our interpretation of these data to a whole different level by taking advantage of this knowledge base and get to the sort of finest level of granularity that you would see in cellular circuit neuroscience as interpreting what types of cells are affected and, therefore, what the consequences of that may be.

Now, I just took one example here. We get to look at everything. And so really, we start to put together a synthetic picture of the events that are happening over the course of disease. We have increases in the neuropathology. We have changes in the cell type abundances, some of which happen early and some of which happen later. There are molecular changes that are very specific, and some of these have been very specific types of cells. And so I think this is a really powerful way of being able to get a brand new look at the real consequences of disease and how this unfolds over time, in particular, so we can begin to understand the early events in that process.

Now, I just want to end on one final point here, which is, first of all, I want to acknowledge that there's a lot of work in this field, in particular in AD, using single-cell techniques. Some of these, the researchers doing this are part of the panel coming up next. And so a lot of really terrific work, most of which has actually been focused on the dorsolateral prefrontal cortex. And the existence of all of these data sets provides the opportunity to try to integrate them and see if we can cross-validate results across studies. And so this is what we've been trying to do. Despite many differences in experimental design and cohort makeup, we're able to take these data, map them against the SEA-AD reference, and integrate this. And it integrates really quite well. We can get very high confidence in predicting supertype identity across these. And now, we can start to ask the same questions. Now, only a couple of these studies had enough to actually do this proportional analysis, enough cellular coverage. But what we can see here in the red supertypes labeled on the bottom is that many of the core things that I described in this NTG data set initially can be replicated across studies, including the somatostatin subtype effect, including the layer two, three IT neurons, and the microglia.

So we are actually able to see that there are consistent phenotypes coming from this and that we can integrate all these data sets by using the same reference. And one little point I wanted to add because of the comments that were made in an earlier session, the effect sizes do seem to be significantly lower in the other studies. And one of the reasons for this seems to be that there are just not enough cellular coverage. So there are many of these supertypes that have much less cellular coverage, if at all, and so you simply can't do the analysis. So I think there's also probably lessons to be learned about what the best sampling strategy will be if you want to get results at a certain level of cellular granularity.

So with that, I want to end and just say this has really been a very fruitful collaboration. I hope it's going to be sort of illustrative for how the BRAIN Initiative work can be accelerating for the rest of the field. And we're really trying to make all of these data and resources available to the community at various levels of sophistication to be able to access the data, visualize the data, look at disease trajectories, and map your own data against this new SEA-AD reference. And so with that, I'll end. And I just really want to acknowledge-- there are so many people involved with these projects. It's very difficult to do acknowledgements. But thank my colleagues at the Allen Institute, my colleagues in the HMBA Consortium, SEA-AD, and BICAN at large. So thank you very much.

DANIEL MILLER: Thank you, Ed. That was terrific. I don't think we have a whole lot of time for questions, so I encourage you to hold on to them and maybe bring it into the panel. This panel is titled Brain Cell Atlases and Brain Disorder Research. It's going to be moderated by Keri Martinowich and Tomasz Nowakowski. Keri is a lead investigator at the Lieber Institute. And Tom is an associate professor at UCSF. And I'll just hand it off to them.

KERI MARTINOWICH: Hi. Okay. So Tom and I wanted to introduce the panel to everyone today. So we've had a lot of fun working together on this. And basically, this panel is going to try and move into how we're using the cell atlases and brain disorder research, specifically. So this was a great intro from Ed's talk. So I'm just going to kind of go through three main goals that we have for the panel today, sort of encompassing goals for the field. And so what we really think now, as we try and move from having these cell atlases available, is to define a road map for using these existing data and then proposed or kind of in-progress data to advance understanding of cellular vulnerabilities for risk and resilience and also to identify cellular targets or molecular signatures in cellular targets for therapeutics. And then to do that, I think there's a lot of inherent challenges to analyzing these data types across disease cohorts, which we're starting to find out as we try and do these studies more at scale. And so a couple of the things that we're going to try and focus the panel on today is also talking about how we can actually address those challenges. So what can be done practically to conduct joint analyses and also encourage data and resource sharing to mitigate challenges and attain those goals? So we have four short presentations from Panos, Philip, Chunyu Liu, and then myself. Our panelists are listed here. The panelists include a lot of people who are involved in a number of different disease consortia and genomics consortia, and then thanks to our notetakers. And with that, I will let Tom take over, and he is going to introduce the short presentations.

TOMASZ NOWAKOWSKI: Great. So yes, thank you again, Keri, and thank you, Ed, for inspiring presentation. So our first presenter will be Panos Roussos from Mount Sinai. And thank you for finding the time, Panos. I know you're very busy. If you would like to share your screen. So the presentations will take five minutes. They're brief presentations. I will just do a quick reminder when 30 seconds are up. And then we'll take about two and a half minutes for questions. So please type in any questions that you might have in the Q&A. So Panos, I cannot hear you, but I can see your slides.

PANOS ROUSSOS: How about now?

TOMASZ NOWAKOWSKI: Yes, I can hear you. I can see your slides. Take it away.

PANOS ROUSSOS: All right. Thanks so much. It's a great pleasure to participate in this session today and discuss some of the ongoing work. And I would just like to take a moment and thank Keri and Tomasz for the invitation and for organizing such a terrific session. So PsychENCODE was established in 2015 by NIMH, and the goal was to bring together multiple institutes, investigators in human brain tissue and omics resources, so we are able to better capture disease processes affecting the human brain across multiple serious mental illnesses. In 2018, PsycheENCODE published a series of manuscript in Science, Nature Neuroscience, and other journals, describing changes that occur primarily in gene expression across these mental illness. The human brain data that they were used for those manuscripts were almost exclusively generated using bulk tissue data or cell-type specific studies, but by using FAC sorted in nuclei to isolate very broad cell types. Now ongoing work from the consortium has utilized single-cell and spatial omics profiling to be able to expand the cell type resolution, that we can actually capture disease signatures in affected brains. This work will be presented in a series of manuscripts that are expected to be published in spring of 2024.

For example, in one of the manuscripts that you can also see here, we use single-nucleus profiling in the dorsolateral prefrontal cortex of cases with schizophrenia controls across two independent cohorts. We were able to find reproducible changes in gene expression that primarily affects subtypes of excitatory neurons and involve multiple transcripts that are highly enriched for common and rare disease genetic variants. Now besides PsychENCODE, there's also many other disease-oriented consortium. And you also have the chance to hear from Ed Lein about the SEA-AD. And there's also many others that they're actually generating single cell and many other omics in diseased brains that include not only mental illness, but also many other neurological and neurodegenerative disorders. So for example, Psych-AD is one of those consortium. It's supported by NIA. And the goal is to uncover the molecular mechanisms that underline the manifestation of neuropsychiatric symptoms, specifically, in patients with Alzheimer's disease. Now as part of this project, we have done extensive single-nucleus gene expression profiling in the dorsolateral prefrontal cortex that includes a very large cohort of 1,500 unique brain donors. And that has resulted in a generation of a resource that includes more than six million of single-nucleus profiles.

Now because of this diversity and how heterogeneous this cohort is in terms of sex, age, ancestry, and disease status, this can really allow us to not only perform a cross-disorder analysis, but also explore the genetic architecture of gene expression within cell types as well as also start looking at gene expression trajectories across the full lifespan. So ongoing cross-disorder analysis suggests that many of the transcriptional changes are not unique for given trait, but there's a substantial sharing across trait, which is more obvious for specific traits and cell types. Therefore, understanding the unique and cell patterns of molecular changes that are present across multiple disorders, it's a critical step to define a cell type-specific disease atlas. Now in addition, the degree of transcriptional similarity across traits, it's similar to the genetic co-heritability among those traits, which actually indicates a substantial causal genetic component that needs to be further explored so that we can actually define factors with a higher likelihood for causal role. Now the availability of population of single-nucleus profiles can really allow us to explore the heritability underlying each of those genes and each of those specific cell types. And for example, here, by applying machine learning approaches, we can capture heritability for more than 23,000 genes that include more than 22,050,000 gene cell combination.

Now having those molecular signatures, you can integrate with the heritability for neuropsychiatric traits, and then you can actually start linking specific risk loci with very specific cell type molecular mechanisms. And one example here, for example, in schizophrenia, we can actually do that and identify and prioritize 370 unique genes. So overall, BICAN is creating the human brain cell atlas. The disease consortia can contribute to these efforts by annotating each cell type for implication with human brain diseases. This effort can utilize disease signatures that we can actually capture using two different approaches. The first one is by examining disease association using omics data from brain tissue of case controls based on a cross-disorder design. The second approach is to annotate cell types by capturing the cell heritability, underlying gene expression, and other molecular markers with disease heritability based on common and rare risk variation. And just to acknowledge, there's so many people and individuals involved in this effort. So I would like to acknowledge the coordinated effort across all this consortium and of course, the continuous support from NIH and many other foundations. Thank you.

TOMASZ NOWAKOWSKI: Thank you so much, Panos. We'll open up for a couple of minutes of questions. Maybe to get us going, I will just ask a question. I'm very curious, how much coordination has there been done in disease brain research to ensure that when you're sampling a particular brain region, that this brain region was similarly identified across the different brain cohorts?
PANOS ROUSSOS: Yeah. So this is an excellent question. And I will say that for that, it's an area that we have been doing some work to make sure that we can actually capture and profile similar brain dissections in brain region across different disorders and across different consortium. However, this is not done using a systematic effort. And to that end, I think there's always some kind of small differences, but still differences when we actually pick up same kind of brain region, but coming from different sources of brain tissue. And over there, we can see we have some variability in terms of the cell types. I will say the variability is mostly capturing different ratio of grey to white matter rather than capturing very different cell types. So I think for harmonizing and be able to combine individuals with same diagnosis across different brain collections or to be able to do a more systematic cross-disorder analysis, there is still room for improvement in terms of standardizing the way that we actually do the tissue dissections.

TOMASZ NOWAKOWSKI: Great. Thank you so much. So our next speaker is Philip De Jager, who is Zooming in from Columbia University. Thank you, Philip, for finding the time to present.

PHILIP DE JAGER: And thank you to the panel for inviting me to participate in really this excellent conversation with all of you today. So let me just project here. Okay.

TOMASZ NOWAKOWSKI: Yeah. You're good to go. Yes. You're good to go.

PHILIP DE JAGER: Okay. Thank you. So I mean, again, I think we've had two excellent introductions already to this topic. I'll emphasize a few more points, given the shortness of time. Also, everything I'm presenting today is actually on a bioRxiv preprint. So I'm going to go pretty fast, but please, the reference is there or reach out to me if you have any questions. So these are some disclosures. They're not relevant to this talk. I think one thing I would like to emphasize that maybe I think Ed already touched on is that we're limited also in terms of the analyses in terms of what other phenotypic data is available. Part of this can be done postmortem, as Ed had alluded to, but even more important is antemortem data. And so, unfortunately, this takes many decades of hard work by the people who are recruiting subjects. And I don't have time to get into this. We work a lot with my colleagues, David Bennett and Julie Schneider, who are at RUSH University. And David created, in the '90s, two terrific projects that he still leads today, the Religious Orders Study and the Memory and Aging Project. They're quite similar to the ACT that Ed introduced in that they follow older individuals and then collect their brains prospectively. So this is a very different design from most brain banks, which are, in more cases, brought in from different clinics. Again, there's a lot of antemortem data, especially cognitive measurements prior to death, which are very important, longitudinal also.

Our particular project here-- again, we have two papers, one in press and then another that's on bioRxiv, currently. This is work that I did in collaboration with Naomi Habib and Vilas Menon. We generated data from the ROSMAP cohort on 424 participants and 1.6 million transcriptomes. Again, I think some of the key things here that Ed brought up is having a nomenclature that we can easily go across and to be able to add these different studies together. Now what we, I would say, want to emphasize is, also, I think this is from my background in statistical genetics is the importance of statistical rigor in these analyses. And so what we did is to-- and that's why we decided to focus on one region in one cohort and just to maximize the number of brains that we could profile from that particular collection. And what we see here on panel B in the middle is the results for different subtypes of cells. So microglia 13, for example, is the first row. And then you have three different traits. If you look at the bottom, you see amyloid-beta tangles and cognitive decline. So there are two proteinopathies that relate to Alzheimer's disease and the trajectory of cognitive decline prior to death based on up to 20 years of observation.

Again, each of the subtypes is associated. You can see here that, for example, microglia 13 is associated with all three traits. Very important, I think, and critical, I think, for our paper is that we were able to replicate these results and then to combine the results of the discovery and the replication analysis into a meta analysis, which has the more definitive results. And also, of course, because it's a meta analysis, there's more people. And so more populations become significant. I think critical to the next step is integrating this with other collections like the ACT that Ed talked about. But again, I think statistical rigor is going to be really critical because these are very expensive experiments. And we can't skimp on the sample size. Sample size and the depth of phenotyping is also critical to do more advanced modeling. So beyond simple associations, we want to try to infer trajectories and relationships of different traits that are correlated with one another. They're strongly correlated but not perfectly correlated. And so we can use that to disentangle where a particular cell type may have a role. So we used here structural equation modeling to try to resolve where the two subtypes of microglia that we found associated with the proteinopathies of Alzheimer's and cognitive decline as well as an astrocyte subtype, how are they related? And so what we're showing here is a diagram where, for example, we propose that microglia 12 is contributing to the accumulation of amyloid on the left. Then we have microglia 13, which mediates part of the effect of amyloid on the accumulation of tau. And then microglia 13 mediate-- sorry. And then astrocyte 10 mediates part of the effect of tau on cognitive decline. So this type of modeling is really critical because it allows us to prioritize a particular cell type to a particular phase of the disease. And again, we can only do this with the antemortem data.

And importantly, here, what you see is that we replicate all these interactions except for the first one, microglia 12 on amyloid proteinopathy. That's the only one that's not replicated. But all the other modeling is actually replicated in the replication data set. Moving on to, again, another form of trajectory analysis is-- and this is all cross-sectional data, right? So these are models. They're not necessarily causal because we don't have longitudinal data, of course. But we can also use different approaches to project here instead of individual cell types. We have individual people. And then we have-- so we embed a cellular landscape at the person level. And what we propose in this paper is that there are two main trajectories that we see and that here, each dot is one individual. And you can see that there seem to be two trajectories, one going towards the bottom of the screen, one going to the left. And what we see actually the trajectory going down is strongly enriched in a sort of monotonic way for tau with accumulating pathology and cognitive decline. So we call this the progression to Alzheimer's disease. The other trajectory, it's not clear what it is. And so for now, we kept it a relatively benign name of alternate brain aging. Again, some people are demented, but that you don't have this monotonic worsening over time.

And finally, I would say one idea that we proposed in our previous paper and we've further developed here is that the individual cell subtypes like microglia 13 and astrocyte 10 that I talked about earlier, they don't function in isolation, of course. And so we've built these cellular communities which seem to be driving the association with disease. What's shown here is the different subtypes of cells are organized through five communities, one of which highlighted in the purple, community number four is strongly associated with the pathology. And you can see on the trajectory on the bottom right, in the progression to AD trajectory, the red line is the C4, and the increase in frequency as you go further down that trajectory. If you look at alternate brain aging, there's no such enrichment over time. So again, I just wanted to highlight a few points. Replication, statistical rigor, I think, is going to be essential as we move forward with large studies. Again, trying to understand the causal chain of events and the heterogeneity of the older brain, this is where, again, we need those large numbers to overcome this heterogeneity and to be able to do more modeling. And finally, that we should think about cellular communities. And to that point, maybe just to let people know, we had a new grant funded recently. We call it the Cascade Project, which is to do large-scale spatial multiomic profiling of the same collection, actually. And this is to try to understand these other communities in a topological fashion, which is one of the next things we have to do.

TOMASZ NOWAKOWSKI: Thank you. If you could wrap up.

PHILIP DE JAGER: Yep. That's it. Thank you very much.

TOMASZ NOWAKOWSKI: Perfect. Thank you so much. So I'm afraid we're out of time. We're going to take more questions and looking forward to the discussion later. So our next presenter in the interest of moving us forward is Chunyu Liu from Upstate Medical University. Chunyu, take it away. Thank you.

CHUNYU LIU: We're back again.

TOMASZ NOWAKOWSKI: Currently, we can see your presenter point of view.

CHUNYU LIU: It's coming back again. Okay. A moment.

TOMASZ NOWAKOWSKI: Display settings, I believe.

CHUNYU LIU: Oh, yeah. I know. I have this problem last time. I don't know why. Okay. It's work this time?

TOMASZ NOWAKOWSKI: Yeah. That's perfect. Thank you.

CHUNYU LIU: Okay. Good. So thank you, Keri and Tom, to allow me to continue my story I started two days ago. So the key topic today is, how can we improve reproducibility through quality control? Primarily related to this case in precision and accuracy issue related to single cell sequencing data. Hope you still remember, we hoped single-cell RNA-seq can produce accurate and precise quantification. Unfortunately, it really falls into this category. So the implication is the study using a relatively small number of cells is likely to yield unreproducible results with some false positive and false negatives. Particularly, I want to emphasize minor cell types are likely to suffer the most. So we just want to use one example of differential expression to show the case, the impact of those quality issues. So in this case, we used the data we talked about last time, the LPS-treated phagocyte in different species. We used the suitable data from single-cell to evaluate that data compared to pooled  cell as a ground truth comparison. Here, the differential expression is comparing the treated with untreated cell.

So the thing we're evaluating is how many differential expression, DEG, we can capture in the two data sets and how the datasets detected are consistent whether the DEG can be replicated. And just to remind you, the data we're dealing with here, clearly, as I showed you last time, the data are actually pooled in this whole four data sets. As you can see, even though the sample size, and the cell sequence, are very similar, but these data really give you much less DEG and the reproducibility into the DEG is much less than the other datasets. So we also have data to show the effect size evaluated from this data. It's also related to the number of cell you lump together. So when you use more cells, the reproducibility improves in terms of the accuracy of the effect size. Also, the DEG can be replicated also, definitely increases when you have more cells lumped into the pseudobulk. So why precision and accuracy is important for DEG analysis? So in principle, we know precision actually is a critical factor related to power. When you use imprecise or noisy data, typically, it requires a much larger sample to achieve the same power. And certainly, accuracy or the bias will really mislead your difference, give you wrong results. So both of them really drive the issue of reproducibility.

So just to use a cartoon to show how technical noise can really kill your signal, so I want to emphasize, technical noise won't matter much if you don't really compare that to the biological variation. I want to show you, if the effect size is huge in this case, then adding the technical noise will not really kill your signal. You still have a significant finding. If the difference between the two group are very similar, then it can really kill your signal. That's critical information. And unfortunately, if you look at the case control for most of the brain disorder and psychiatric disorder and neurological disorder, you will see the change is typically small. So this is the paper in 2018. You can see schizophrenia, autism, bipolar, and depression. The total change is typically less than two. We have another database archived. Most of the published brain disorder transcriptome in case-control comparison, it has, really, a very similar observation. So besides case control comparison, I want to emphasize, this problem really goes too much beyond. If you do eQTL mapping, the genotype correlated expression is involved. When you do cell classification and when you emphasize the different cell type you're comparing, also will be involved quantification of the marker genes, as I mentioned earlier in the first panel.

So the warning I want to give is when you use expression quantification based on individual gene or pools of smaller number of cells per cell type, it's likely to yield some false findings. Those will be not reproducible. So I just want to pick up one paper I noticed recently by Alan Murphy. They evaluate one of the nature papers on Alzheimer's, which detects a lot of differential expression. And they add in additional QC and show the number of DEG significantly reduced, but they claim they can detect stronger changes. And their QC is just related to removing some doublets and doublets and low-risk genes, and it did not really consider noise level. So even for their finding, we have some concern too. So take-home message, we want to offer some recommendation for either analyzing existing data or future single-cell study related to disease. So I think it's better to consider the technical noise into the QC procedure. And the biological variation and relative technical noise are the two major factors and will determine the confidence of findings. And sequencing more cells for the targeted cell types will be maybe ideal, particularly, when the sample size is restricted or limited. When we're talking about the postmortem brain, it's hard to get that many samples. Certainly, better quality of RNA is always preferred. So this is the acknowledgment. Our study is primarily supported by PsychENCODE. Thank you.

TOMASZ NOWAKOWSKI: Thank you, Chunyu. This was great. Again, we're out of time. So we're going to move on and then perhaps have more time for discussion. So our next speaker is Keri Martinowich, who is at the Lieber Institute for Brain Development. Keri, we can see your slides. You can take it away.

KERI MARTINOWICH: Okay. Thanks, Tom. So I'm going to switch gears a little bit and talk a little bit about the challenges of moving spatial transcriptomic studies into cross-disorder cohorts in neuropsychiatric diseases, specifically. And so I just want to give a shout out quickly to the team leaders that I work with, Kristen Maynard, Leo Collado-Torres, and Stephanie Page, who are at the Lieber Institute, and then Stephanie Hicks, who's our data science collaborator at JHU Biostats, who we work with really closely. Okay. So I don't think I need to give a lot of kind of background about why we want spatial information. This has been gone over for the past two days. But basically, I mean, just to say that spatial positioning is influencing morphology, the connectivity, and the physiology. And on this slide, there's just some examples of that for the dorsolateral prefrontal cortex, which I'm going to talk about mainly today. And so, really, it would be preferred. Single-nucleus RNA-seq provides a lot of information, and you can infer spatial kind of location. But of course, it would be preferred to have the actual gene expression with the X and Y coordinates in the intact tissue. And so a few years ago, we set out to do this. We wanted to develop some of the first spatial transcriptomic studies in the human brain. And so we chose to do this in the dorsolateral prefrontal cortex or DLPFC. So there's a lot of data that's been generated in this region, especially in disease cohorts. And so we use the Visium platform from 10x Genomics, which I think a lot of people are familiar with. So I'll just go through quickly.

Again, this is a transcriptome-wide spatial transcriptomics platform. And basically, there's a capture area. It's 6.5 millimeter squared. There's 5,000 expression spots that are on that capture area that have, basically, spatial barcodes on them. The tissue is cut and laid down. You do a histological stain, in our case, H&E, and then you basically do on slide cDNA synthesis. The spatial barcodes are captured, and then you can basically map the gene expression back to the image that you've taken. And so when we did this in the dorsolateral prefrontal cortex, our kind of initial goal was just to benchmark this and really see if we could capture sort of known features in the cortex, in our case, histological layers. And so this is just sort of an example of what the data looks like. These are spot plots basically pulling out individual genes, so SNAP25, a marker of neuronal expression mapping to the grey matter makes sense, MOBP, oligodendrocytes mapping to the white matter, and PCP for a putative layer 5 marker being where it should be. I'll just point out that the spots are 50 microns, so this is not at cellular resolution in the DLPFC. So we segment and then actually are counting the number of cells on the images. So there's an average of about three cells per spot.

And so for this first study, our goal was really to kind of just generate molecular profiles for human brain lamina in the cortex. And so at this time, there was not a lot of spatially aware, unsupervised clustering methods available. And so we took a semi-supervised approach here to manually annotate spots using known marker genes from the rodent literature and also the histological image, so the H&E image. And so we basically manually annotated all the spots on these. So this study was three neurotypical donors, and we had basically four spatial replicates for each donor. And so what we did is then sample those spots according to layer and white matter. And basically, what this allowed us to do was to generate molecular profiles for the human brain lamina and the white matter. And doing this basically showed that there was hundreds to thousands of differentially expressed genes. We did a number of studies to compare this to known rodent marker genes. This is all published, if people are interested in that. At this point, what we really-- that's kind of a first benchmarking study. We really wanted to move to do case control cohort studies. And when we started to think about this, we became sort of overwhelmed at thinking about how to do this and all the problems that we might have to kind of scale infrastructure and then the problems that we would have with analysis across disorders. And so some of those key things in the sampling compounds were that we kind of realized we were really going to need to standardize the dissections to be able to do neuroanatomical matching and make sure that we have laminate inclusion. We're trying to match the amount of grey versus white matter because this is a common concern in single-nucleus RNA sequencing studies that throws a lot of compounds.

We were also really nervous about spatial gradients across the anterior-posterior axis of the DLPFC. It's a pretty large structure. And although people have not noted that this is a place where there's a lot of spatial variation, being someone who grew up in rodent studies doing unbiased stereology, the idea of taking 110-micron section from a huge human brain and then kind of calling it a day seemed really kind of not okay. And so we really were nervous about that. We knew from the original study that adjacent sections we're-- that they basically were really well correlated, but we really wanted to look across the anterior-posterior axis to see how much variation there was. The other thing is that right around the time when we were sort of doing our first studies, there was an explosion of kind of computational methods that were coming out for spatially aware, unsupervised clustering. And so we wanted to start being able to have a larger data set to be able to assess those methods. And then as I had mentioned, this is not at cellular resolution. So we wanted to benchmark some of the methods that were coming out for doing deconvolution at spot level.

And so this larger study, we designed basically to address these questions. And in this study, we took 10 really well-characterized neurotypical donors. And again, took the DLPFC, but this time did really careful dissections and took three blocks from each donor. So we took a block from that donor within the DLPFC from the anterior, from the middle, and the posterior. And then from those N-equals-30 samples, each of those blocks, we basically did Visium, the regular Visium with the H&E. We cut thicker sections and then with matched cellular composition, did single-nucleus RNA sequencing. And then for a subset of them, we did the Visium spatial proteogenomics platform, which trades out the H&E for immunofluorescence. And I'll talk about that a little bit more, why we did that. And so with this study, the goals were really to kind of standardize anatomical validations for the dissections, compare the unsupervised clustering algorithms and compare to the manual annotations we had developed to be able to optimize spatial registration because we had the paired single-nucleus RNA sequencing data and then to benchmark the spot-level deconvolution algorithms that were coming out since we had proteomic data for cell type markers and then to do this within donor comparison over three anterior-posterior locations.

So just really quickly - I'm not going to say too much about this - we spent a lot, a lot of time optimizing the dissections for the DLPFC for these types of spatial transcriptomic studies and coming up with a pipeline for how to kind of optimize and make sure that we have all the lamina that we want and really get these cut in the way that is optimal for spatial transcriptomics. With this data that we generated from these N-equals-30 samples, this was really enough data to really try out some of these spatially aware clustering algorithms that had come out. And so at first, really, what we wanted to see was, could we actually reproduce and just get the histological layers that we had seen before with these algorithms? And the answer is yes, so that we were able to compare these to our manual layer annotations that we had developed earlier.

TOMASZ NOWAKOWSKI: You have 30 seconds or so.

KERI MARTINOWICH: Oh, okay. And then clustering at higher resolutions, we were able to basically come out with kind of different sort of novel spatial domains. We also were able to register this single-nucleus RNA sequencing data. So we did this for the new PsychENCODE data that's coming out, that Panos mentioned, in 2024. So we were able to really kind of register this across eight large-scale studies and also provide some spatial kind of context to those studies. The benchmarking for the spot-level deconvolution, as I said, basically, we did this with four broad cell types. And this allowed us to basically give a ground truth for number of cells per spot. What we were able to do with this, both the regular Visium H&E and then that Visium proteogenomics was really to do this to look over the anterior-posterior axis of the DLPFC. There are some differentially expressed genes across the AP axis, so we noted 512 of them. But we were able to see the differences across spatial domain, and the donor far outweigh the variance from gradient over the AP axis. And then with that proteomic ground truth, we were able to show that there's no changes in cell counts per spot or cellular composition per spot over the AP axis.

So the studies that we have underway, so now with that kind of data in hand, that preparation, we have a study with 240 DLPFC blocks that we have dissected and anatomically validated. And we're running spatial transcriptomic studies on those. And the data acquisition is about 50% complete. And then we have additional benchmarking studies from those same N-equals-10, well-characterized neurotypical donors from the DLPFC in four additional regions, the hippocampus, the nucleus accumbens, amygdala, and the dorsal ACC in preparation for case-control studies. And just these are the acknowledgments, people I mentioned before on the first slide, and also shout out to Tom Hyde, who runs the LIBD Brain Repository and really has helped us with a lot of the dissections.

TOMASZ NOWAKOWSKI: Great. Wonderful. Thank you. So yeah. Now we're going to open up the panel. And if I can just-- and housekeeping can ask all the panelists to turn on their cameras, if they may. And I will hand back to Keri, who will get us started on topic one.

KERI MARTINOWICH: So for the first topic - I think I had mentioned this initially - we had talked about wanting to define a road map for advancing understanding of cellular vulnerabilities and then talk about some of the ways that we can start doing this. So the first topic is really kind of focusing on what can be done to maximize the utility of existing single-cell and spatial data that we have from human tissue, both reference data, incorporating genomics data, and then also the case-control data. And maybe just to kind of start off the conversation, we had some questions listed, so maybe just to get everyone started, the first question we had had on here-- I don't know if Seth is there, if he wants to field this one. We basically wanted to start off with talking about maybe someone giving a little bit of background about some data that's been generated by different disease and genomics, consortia, and talk about kind of centralized data repositories for those data.

SETH AMENT: Sure. I can try to do my best. I think that the Alzheimer's disease cohorts have been very well represented in the last few talks, and the schizophrenia and other PsychENCODE-related cohorts as well. One that my group has been involved in that hasn't been mentioned so far is the single-cell opioid responses in the context of HIV consortium, which is a NIDA-funded consortium on substance use disorders and HIV, which will have a few hundred samples from cortical regions, striatum, amygdala, midbrain, and then a lot of work in animal models to try to cross register the changes. So there's a lot of data coming on that have already been generated and many more that are underway. It's clear that this is just going to continue to accelerate. The question about repositories, I think, is an important one. So I can represent the NeMO archive, which is the genomics data repository for BICAN and also for SCORCH, and we've thought a lot about how to bring together some of these additional resources. It may not be possible initially to get all of the raw data into one place because of the ways that different cohorts are built, whether that's university requirements or consents around these cohorts, making it difficult to get raw data into one place. So that means that we're probably moving towards something that's more a federated model, and some support from that program has been helpful. I think that there's a need to do that at an even larger scale to try to enable the kinds of very comprehensive studies that, I think, everyone would like to get to.

TOMASZ NOWAKOWSKI: Sorry. So I'm very naive about this, and maybe you can explain a little bit more. When you say federated model, do you mean a situation where individual investigators would have to process their data using the same pipeline, essentially, in which case, we really need to think about how to discuss such that the pipelines are aligned? Or do you mean data should go into one place and be re-analyzed using the same pipeline and then distribute? Could you just elaborate a little bit on how you envision sort of what can be done now or in the next six months or so?

SETH AMENT: So the immediate-term answer to that is that one product of BICAN is pipelines that make it possible to do uniform processing, at least, for the initial stages of the analysis, so single-cell RNA seq and single-cell multiome pipelines, to take raw sequencing data and bring that to the level of counts and ATAC-seq fragments. So because we are serving as the data center for both SCORCH and BICAN, one thing that we are doing already is taking SCORCH data and pushing them through the BICAN pipeline, and that seems to work fine. That doesn't necessarily mean that the data need to all live in one place - and often, that won't be possible, right? - but bring the data up to the cloud for that uniform processing using publicly available pipelines, bring it back to a secure location so that people within a particular consortium can work with it and then proceed from there. Within our consortium, SCORCH, there's been a lot of talk about the timing of when people are comfortable making data fully public. And that just appears to be a lot later than the point where the data are generated. So I mean, I think that there are pipelines. We can start to do some things in uniform way. I think that the conversation about uniform cell type standards is really exciting and that we're reaching a point, at least, for some regions of the brain where that can be also standardized. But for now, the data heterogeneity means that I think it's very difficult to build a single pipeline that can be used for everything. Others may have other thoughts on that.

TOMASZ NOWAKOWSKI: Philip, go ahead.

PHILIP DE JAGER: Thanks. Yeah. And I think this is sort of a perennial problem, of course. And essentially, it's a big time sink also depending on all the different efforts going on. And there are some efforts within the AMP-AD program, for example, to do this periodically, but it could be more structured, probably. And I think perhaps an alternative, certainly, short-term approach might be to come together and decide what may be a shared reference. And then everybody can project their data onto the shared reference, at least, to relate the-- and you can do your own analysis. But this may be one way to have a common framework to try to see what may be shared or distinct from each of the data sets. So that's for one problem. And maybe sort of related to this is, one, unfortunately in some ways, but I think maybe actually quite helpful for some of these issues is that in the ROSMAP project, the same individuals were profiled by single-nucleus RNA seq by different groups. This was sort of not done consciously. We received samples. But unfortunately, there's a fair amount of overlap. So about 150 individuals have two sets of single-nucleus data from the same region. And so we've been looking at this. And as you might expect, some subpopulations translate well, others don't. But that's sort of one data set, at least, that could be helpful to explore some of these challenges.

TOMASZ NOWAKOWSKI: Gustavo?

GUSTAVO TURECKI: Yeah. On a related point about the SCORCH data is that the SCORCH data, by the very nature of the phenotype that are included, it's going to be very comorbid. Yeah? So individuals will be affected with a number of different pathologies in addition to opioid addiction and HIV. So that would bring as well a number of other issues to discussion as to how the data should be analyzed and how the different comorbidities should be looked into down the analysis. So I think that that's another interesting point to take into account.

SETH AMENT: Yeah. So I think that establishes a real use case for why we need to have resources that we can compare across diseases because a lot of the-- a lot of things will require-- or it'll be enabling to have things that-- multiple data sets from different diseases annotated in such a way that we can make the comparisons.

TOMASZ NOWAKOWSKI: Ed?

ED LEIN: Yeah. Just to add, first of all, Phil, I really love the idea of everybody maps to a common reference. But in addition, I think some common metadata across the various consortia could go a long way. And these things are happening now. They're happening at the level of BICAN. So all the donors have to have a certain set of metadata. They're happening through places that want to archive and present the data like Human Cell Atlas. CZI CELLxGENE, you're sort of forced into having a certain set of metadata. And these conversations keep happening again and again. This is a place where perhaps it could really be standardized so that data can be reused in many kinds of analyses if we have that data.

TOMASZ NOWAKOWSKI: There's, I think, a really interesting question from the Q&A, which I was wondering if I could read out aloud and maybe the panelists can comment on. How do we address the issues of data privacy, PHI, when trying to put together the data together and analyze them together? Do families of these donors know when their data is being ported over or added to or combined, for example, with public data commons and especially when it comes to potentially sensitive metadata like psychiatric disorder status? So anybody would like to comment on that? I think it's really worthwhile discussing. Gustavo, you have your hand up first.

GUSTAVO TURECKI: Yeah. So I think that certainly would depend a lot on the jurisdiction from where the data comes from. Yes? So for instance, we work in Canada. A lot of the data in Canada, it's collected through coroner's reports, and the coroner's reports are public by the nature of public documents. So that's one point. The other point is that you can obviously as well de-identify a lot of the data and make it public in a way that you don't identify individuals. Yes?

TOMASZ NOWAKOWSKI: Thank you. Philip?

PHILIP DE JAGER: Yeah. I mean, this is a complex question that has a number of different dimensions. I mean, for example, the advanced stage itself is an identifiable feature, actually, if you're over a certain age. I forget if it's 90 or 95, for example. But also, technically, a brain is not a human material because once the person is deceased, the person is no longer a person. It's not a human subject anymore. And so it's just a piece of tissue, technically. Of course, there are a lot of other issues surrounding this. And I think one way that the ROSMAP studies have dealt with this, which I think is quite nice, is as we began to generate more data from these cohorts over the last 10 years, David Bennett, my colleague who runs the cohorts, actually started to consent people for widespread sharing of high-dimensional data. And so because it's a prospective study, you can actually consent people properly. The participants are consented properly up front, and it's explained to them what is involved in sharing. So that's sort of the best-case scenario, where you actually have the participants being involved in the discussion and agreeing to this type of sharing or not, so they can choose. And if not, then you can't share the data in the same extent.

TOMASZ NOWAKOWSKI: Great. Thank you for commenting on this. I think these are very important discussions that we ought to have. And relatedly, can anybody comment on the issue of genetic ancestry and how much has been done to ensure that in these various disease and disorder cohorts, how well have they been controlled, for example, for genetic ancestries such that when you embark on some of these common analyses, ancestry doesn't become the major driver of a signal. I wonder if anybody would like to comment on that point, Chunyu or Seth, anybody who's embarked on common data analysis?

CHUNYU LIU: Yeah. So I can say a few words, right, given my lab has been devoted to study the population diversity in terms of brain transcriptome. So certainly, the population difference is quite obvious when you look at genetic composition. And so if you study things like eQTL, it will be a major factor you have to address properly. And at the same time, I want to say, really, there's not lots of non-European brain data available for us to study. Just have limited data. And I really hope we can see more brain tissue collected from non-European populations so we can really do more thorough evaluation on that interesting aspect.

SETH AMENT: I can add one point, which is I think this is an opportunity actually for BICAN to produce a reference of diversity with respect to genetics. I don't currently know of a major effort to systematically do that across all of the brain regions and cell types. I think that there's some work that's getting done in that space. But I think it would be wonderful to be able to make that sort of mapping and QTL analysis across the BICAN resource as something that will help all of the consortia.

TOMASZ NOWAKOWSKI: Thank you. Oh, sorry, Keri.

KERI MARTINOWICH: I can just jump in to say more from a collection aspect. I would agree with Chunyu. There's really not a lot of data that's been collected across diverse ancestries. There's been a couple of initiatives that have happened, really, in the past three to four years that have really kind of tried this. But these initiatives to collect brains, then you're talking about probably a decade out for this to really kind of ramp that up, so this is a long-term process. But there are several of them in place. So Lieber has one, for instance. Again, we've been doing this really strategically to fill this gap. But again, it's not something that happens overnight. So it's a really long-kind-of-term process to actually acquire the brains and then to get the data.

SETH AMENT: Yeah. Absolutely.

TOMASZ NOWAKOWSKI: Philip and then Ed.

PHILIP DE JAGER: Yeah. I mean, to fill up on Keri's point, I completely agree. I think this brings up-- well, I think first of all, in terms of getting a more diverse population, there are some projects leveraging existing samples. So one from AMP-AD, which has a large number of Latinos and African-Americans, is coming out this year for three regions. So it's a start, but it's not where we need to be. But I would say an important component of this is education. And so we started a multiple sclerosis brain bank at Columbia three years ago. And we're actually developing educational material for potential participants and trying to engage with a variety of communities, particularly, minority communities, to try to encourage brain donation, to educate the population about the importance of this donation. And it gets back to one of the things I brought up in my presentation, which is the richness of the phenotypic data available on these participants. And I think while we certainly need to leverage the existing collections today, I think perhaps even more important, we need to be-- we need to start collecting living participants and doing the proper long-term studies to generate the next generation of samples. So of course, it just takes time. But I think we need to think about this now.

TOMASZ NOWAKOWSKI: That's really wonderful. One thing just to comment before I hand back to Ed, really, the focus of this particular panel is to try to identify items that can be-- action items that can be done with available data while highlighting, obviously, these longer-term prospects. So thank you for commenting. But as you chime in, it would be really great to try and focus your answers on what can we do now with the data that we have and that are already available because a lot of data, as we heard from all your talks, are already available. So Ed.

ED LEIN: I'm not sure what I can say about that, Tom. But I did just want to plug that BICAN has invested in brain collections through the NeuroBioBank that get around problems with any of our local banks that we don't have. For example, Seattle bank doesn't have much diversity. But by assembling specimens from different parts of the country, we can get that. And I don't know if Steve McCarroll is still on the line, but one of the main UM1 projects in BICAN is to do that, is to try to look across very diverse populations and try to get a handle on that, that could be useful for the community. So just wanted to say some of that is happening in BICAN now, but it does not do living individual consenting. And postmortem specimens, you may or may not have certain kinds of information. So it has some limitations.

TOMASZ NOWAKOWSKI: Thanks. Hyejung.

HYEJUNG WON: Yeah. I would like to point out that especially when it comes to seeking the molecular profile difference in the ethnicities, we also have to think about the genetic difference versus environmental difference. Because for example, if we profile African-American versus African descents from UK, their eating patterns and everything could be quite different. And I think I saw this talk somewhere, but when they profiled African-Americans versus European-ancestry Americans, the biggest difference was observed at the level of immune cells in the brain rather than the neuronal subtypes. But then the speaker himself wasn't really sure whether that was caused by the genetic difference or whether the environmental exposure actually differs between people live in different regions and people might have totally different environmental exposure. So when it comes to the concerns, when it comes to ethnicities, I think we also have to parse out what really comes from the genetic origin versus just the population difference in their environmental exposure.

TOMASZ NOWAKOWSKI: Great. Thank you. So Gustavo, your point is next, and then we're going to move on to the second topic.

GUSTAVO TURECKI: Yeah. No. I was just following up on a comment by Philip that I think it's really excellent suggestion to collect premortem data and do follow ups. But that only works for neurodegenerative diseases or for people where the death is predictable. The problem is that for illnesses where you don't really have any predictable death, that it's impossible to do. So I think the answer for that is try to collect as much information as you can, yes, and be as precise in both lifestyle as well as medical history and the history of the individual. But then there are challenges of how you incorporate all these different levels of data, particularly, for illnesses where the pathology is not very well known. So it's a very different story when-- what Ed presented is wonderful. Yes? But you have to know well the pathology behind the illness. When you don't know very well exactly what's going on, then it's way harder to actually incorporate all these other data.

TOMASZ NOWAKOWSKI: Great. Thank you. These were a really lively discussion, but I think it highlighted something really important, which is that there is a clear need to begin to embark on a common analysis and putting the data together in one place using some of the approaches and strategies that Seth, for example, has outlined. It seems like a really immediate thing for us to be thinking about and perhaps using data to analyze and even understand what sort of ancestry composition do we have in existing cohorts. So we were hoping with Keri that the next half an hour that we have for this panel would be dedicated to sort of defining some of the more concrete steps about what should we be thinking, how do we go about it, how do we standardize issues. For example, analysis of data from dorsolateral prefrontal cortex, even though as Keri mentioned in her talk, there is some variability. It's not as extensive, but is a focus on dorsolateral prefrontal cortex an opportunity in a way that many of these cohorts have been profiled in that region to have a focused analysis across consortium with available data using some of the BRAIN Initiative data as a reference? What do you think the prospects are? What are the challenges? I know, Ed, you've thought about this problem for quite some time. Do you want to get us started? What have you experienced?

ED LEIN: So I guess, I mean, I think this was sort of an aspiration until pretty recently. I think that the techniques have gotten good enough to be able to integrate data sets using these references. And that ability to map the common reference kind of overcomes a lot of other issues that you might have in trying to integrate just based on gene expression and things like this. So I think it's totally possible to do this now. And in fact, I think we helped to provide an earlier version of a prefrontal cortex taxonomy to PsychENCODE that got used across a set of studies there. So it certainly can be done. It takes effort. I think this is one of the biggest things. I mean, it's fine to say, "Let's do this now," but it actually takes some serious computational effort and dedication and time to do it. And so I kind of view it as a supplementary type of activity, albeit extremely valuable. So if there was a mechanism and we can free up some time for people or find new students or something that could do it, it's doable now. And it could be really very valuable for a whole variety of reasons. I think one of the things that comes out of this is a head-to-head comparison of people's different assumptions and their experimental design criteria. It's pretty arbitrary how much people sequence and how much they multiplex and how much coverage they get per donor. And I think that you can now kind of systematically look at that in one place and say, "If you want to get these kinds of results, you need to apply your methods in such a way." And that could be really helpful because it's very, very hard to understand right now what is the best strategy when you're sampling in the same brain region.

TOMASZ NOWAKOWSKI: Yeah. So I know that BICAN has had a lot of discussion about how do we standardize those anatomical dissections, for example, and the extent to which the photo documentation is really important. If you're now sort of embarking or thinking about embarking on this common-analysis efforts, what sort of sample inclusion/exclusion criteria would we have to consider or would your recommendations be to include all the data that are available but use available information? In other words - and I think several of you have touched upon this before - to what extent can we really believe and understand the metadata that are coming from various patients? What are people's thoughts on sort of focusing on the samples for which we have-- or data sets for which we have very deep phenotypic characterization versus the ones that have less metadata information, but perhaps really high quality and are relevant? What should we prioritize? Philip.

PHILIP DE JAGER: Yeah. Honestly, we need to do both, but I would say there are different tasks, right? So I think having the deep-- well, first of all, going back to Ed's point, I think the metadata is really key. And especially when it comes to pathological phenotypes, it still amazes me that all the different collections do slightly different things in terms of measuring amyloid and tau and other proteinopathies. And that's actually something that's very doable and would enhance the utility of the existing data, right? If you just do the same stain in the same way in the brains that are already profiled, then you can sort of accumulate a much larger sample size, potentially. So I would say that's sort of probably a low-hanging fruit. In terms of the other question, I think, again, integrating the sort of deeply sequenced samples with the more sparsely sequenced ones is going to be fruitful. We certainly hit this problem where we maximize the number of individuals, but at the cost of having about 3,800 nuclei per person. And what we found is that we got the different subtypes of cells, but often, the number of cells per person was relatively small. And then you can't really-- you sort of get stuck with not being able to do good statistics at the cell subtype level. But perhaps, if we have a good reference, we could probably infer some of this. So we can probably impute some things to some extent. But it gets also to the next phase of studies and sort of guiding them to probably sequence deeper than we did. That would be my recommendation.

TOMASZ NOWAKOWSKI: Thank you. Dayne?

DAYNE MAYFIELD: Just to follow up a little bit on the idea of consistency, a lot of this will be on the investigator to really communicate with the brain banks and be very clear if you want dorsolateral prefrontal, where exactly is it that you want? And as Ed was saying, if there's a common reference where there is nice photography and it's very clear what the common reference anatomy is, then you can communicate that with the banks and get a lot better quality tissue. Very often, it's easy to just request tissue. And when you request dorsolateral, I mean, that's a lot of tissue. And there's a lot of investigators wanting it. So I would just recommend good communication with the brain banks to really have a clear idea of how your sample is going to match up with samples that you'd like to integrate in your data.

TOMASZ NOWAKOWSKI: Ed?

ED LEIN: I apologize. I'm speaking too much. I just wanted to say, I will mention this in some comments in the next session of how BICAN is trying to do this with kind of going back to the preparation of brains in the first place, but the documentation of those, it went through a portal system here. And a switch in the approach with the brain bank to drawing on a slab image what you would like so that the investigator picks it rather than receiving something that some third party said was DLPFC and that's what you work with.

KERI MARTINOWICH: Yeah. I would just second that, what Ed said, that having the pictures is really important because we hear that a lot. I mean, for single-nucleus RNA sequencing, people are just getting pulverized tissue in a tube. And you, really, at that point, can't know what that came from. So I think that having those pictures, if you have slab photos of the slab before and then the slab after and you really know where that block came from, this really helps you to be able to put that back into space as a kind of basic thing. And this is not hard for brain banks to do now with iPhones and things like that. It's really not hard. One other thing that I wanted to bring up is even the most basic metadata, I think, across brain banks and across sites is not very well standardized right now. Things that we collect and almost take for granted that we're using PMI and RIN, these things actually mean different things to different people across sites, which is something that I think is important, just the sites coordinate, because how people calculate PMI is not always the same. And so those maybe have to be adjusted or could be adjusted, but it's important to know. And then, also, RIN is something that most people-- a lot of brain banks just RIN the brain in a standard space, somewhere in the cortex. But as we move to other regions - we've seen this a lot - that just because the brain has a RIN number of 7.5 - and that's what the brain got - if you actually RIN sections in a different area of the brain, they can be radically different. And so I think some of these even standard metadata terms are something that having a dictionary of what those mean and what they mean to different people would be really important.

TOMASZ NOWAKOWSKI: Seth?

SETH AMENT: Just one additional point about the sample metadata, I find the digital pathology that's being done for Alzheimer's really inspiring. And I would love to start to establish what the equivalence may be for the psychiatric and substance use cohorts that we're working with, whether that's something about synaptic morphology or other aspects. At a sample-by-sample level, being able to quantify that, I think, would be extremely valuable for making sense of what we see with the single-nucleus profiles. And keeping the samples in such a way that we can do that in matched ways, I think, is also really important.

TOMASZ NOWAKOWSKI: Gustavo?

GUSTAVO TURECKI: Yeah. Just to follow up on what Keri said, so that's really important. I think we need to standardize a lot of the variables that we typically report. But we also need to better understand which are the ones that make the most impact on the work that we're doing. So for instance, PMI, it's a measure that it's often requested and we have to report everywhere, but PMI really means many different things. Yes? So if a person, let's say, died in the middle of winter outside and then was found 10 hours after and then put on the fridge at the morgue, the quality of the brain tissue is going to be very different from another person who died in middle of summer and stayed 10 hours outside and was not put in the fridge at the same time. And so there are some of these measures that we often are asked to report that are quite variable in terms of their impact on the biological measures that we do. So I think that in addition to standardize, we need to come to terms, what is really important in terms of the effect on what we are looking at.

TOMASZ NOWAKOWSKI: Right. So what your recommendation, essentially, would be is for cohorts or cases where we have this deep information, a meta analysis or a common analysis could help us, at least, disentangle the major drivers of that variance. And that probably also speaks to the PMI issue where in our experience and other people's experiences, even some of the samples that don't have the most amazing PMI or the most amazing RIN scores, sometimes, actually, get pretty decent data. And so sometimes, that correlation is not ideal.

GUSTAVO TURECKI: Exactly.

TOMASZ NOWAKOWSKI: Great. Great. Chunyu?

CHUNYU LIU: Yes. I agree. Certainly, the more data, the better data, the most accurate data will be always preferred. But I just want to remind everyone, there's also some nice statistical procedures have been developed that can capture those hidden variables we can pretty well control. And it has been proven to be fruitful to produce some reproducible results, at least, for some of the analysis, differential expression or eQTL mapping. So for that reason, I really feel we should maximize the use of all the data that have been generated today.

TOMASZ NOWAKOWSKI: Absolutely. I think that's really a historic opportunity for us because we're coming to the point where we already have data as opposed to having to sort of plan for new data, which is obviously exciting. One thing that I want to also touch upon and maybe elaborate on are opportunities that exist for leveraging not just single cell or even spatial transcriptomics data, but also some historical data. And I think Keri brought this up a little bit earlier. There is a wealth of bulk-tissue RNA sequencing data sets that can be deconvoluted, for example. And I know Hyejung has been thinking very deeply about the activity of cis-regulatory elements and some of the more functional insights, especially as you think about predicting the role or the significance of the GWAS signal that Ed and others have been bringing up upon. So what are the opportunities that we actually are facing? So okay, well, we're going to put the data, we should think about putting the data in one place and performing meta analysis. What are some of the things that we could do to take advantage of what already exists? I wonder if people can comment.

HYEJUNG WON: Because Tom pinged me, I will start. So this kind of comes back to what kind of other consortia can benefit from BICAN. And I know that a lot of BICAN has already established collaboration with a lot of NIMH-based consortia like PsychENCODE and SCORCH. But there are also a lot of NHGRI-based consortia like IGVF. And in the IGVF consortium, which stands for Impact of Genomic Variation and Function consortium, we're really interested in identifying the genetic variance. What are their functions in the context of cell types and also in the context of gene expression? And I feel like there are a huge level of single-cell RNA seq attempt there. And one of the focus areas is the brain. While I don't think there's much crosstalk between BICAN and IGVF when it comes to the reference data sets are available. And another thing that these GWAS fields people are super interested in is we now would like to really functionally test the variant effects. But then we all know that-- I think Bing has given a very nice presentation today that cis-regulatory elements are very much cell-type specific. So then the question would be, in which cell type should we really test this variance? And I think BICAN kind of can provide a good proxy for such things.

HYEJUNG WON: One question in that case will be, okay, for schizophrenia, Bing has shown that all excitatory neurons and all inhibitory neurons seem to be associated with schizophrenia, does that mean that we don't really have to go into specific cell types or shall we just go to the very broad glutamatergic neurons? Will that be sufficient or should we be more granular in terms of that assay? And also, I would like to point out that PGC or Psychiatric Genomic Consortium has started a functional genomics working group. And I think they're also trying to bring a lot of QTL resources, bring a lot of single-cell resources, bring a lot of functional genomics resources or validation resources all over the place. And I think BICAN can provide a really good reference data and also suggest not only just provide the data, but really give very hands-on advice in terms of what cell types will be the best cell type to go for these assays or how much granularity we should really think about when it comes to understanding the disease.

HYEJUNG WON: And finally, what I really would like to talk about is that I was very impressed with Philip's talk where he tried to parse out the causality from just the postmortem expression signature. So I'm very interested in what is the genetic etiology of disease and how much of that propagates to the gene expression difference that may not be the full profiles that we are seeing from the postmortem brain expression data set because postmortem is not only about coming from genetics, but it can be also in response to many different environmental and life histories of individuals. So it would be really nice if we can make some sort of direction from genetics to the very direct mechanism and then how that propagates into what we see from the postmortem brain data sets. That's all.

TOMASZ NOWAKOWSKI: Well, Philip, if you'd like to comment on that or-- yeah.

PHILIP DE JAGER: Thank you, Hyejung, for your comment. Yeah. I mean, I think this is definitely an important area. And we get, I mean, a lot of fair comments with this type of modeling because, of course, it's cross-sectional data. And so we have to be very careful when we present that this is just a model, and we have to somehow be clever about how we're going to demonstrate this longitudinally. So those are definitely some big issues to think about. I would say one point sort of maybe to bring up for the conversation, since we're talking about QTLs, I mean, I agree with several of the other panelists that we saw, actually, a lot of subtype-specific QTLs. Even though we had less power to discover them, we saw that that's the level where a lot of stuff is hiding. And so, again, that speaks to sort of going deeper. And perhaps, I think going back to Ed's presentation, also, we could try to guide people to have a target subtype, right, because then you can really sort of develop the proper power calculation to know how deep you have to go to really go to town and really completely characterize a particular neuronal subtype or microglial subtype, whichever one you really care about.

PHILIP DE JAGER: And maybe a separate point is that the data sets from iPS-derived cells are getting larger. And so one thing that we were able to do is actually to validate-- or to see whether our QTL is actually translated to induced astrocytes and induced neurons. And some did, even with relatively modest sample sizes of 40, 50 lines had been profiled systematically. And most of them were in the same direction as in the brain, but some of them were actually in the opposite direction, significant but in the opposite direction. And what does that mean exactly? Of course, it's a little early. But I think that's where these context-specific effects, which I think we're all interested in, are going to be hard to-- and we have to be really careful when we go to the model system. Even if the gene is expressed, the effect of the variant may not be there because the model is not in the right state.

KERI MARTINOWICH: Can I just bring up one more kind of point for discussion that we didn't touch on but I was reminded of when Hyejung talked about the PGC is that one thing we had to kind of talk about or wanted to talk about was about genetic enrichment and kind of bringing in those consortia that are developing the GWAS. Because I think there's a lot of people in the field who are confused about, I think, the - I don't know - protocols and how to approach this are sort of all over the place, right? There's not a lot of standardization of what's the right thing to do right now, how to do genetic enrichment correctly. And maybe there's not a right answer to that, but it'd be good to hear from people about maybe setting some field standards for that for some of the larger consortia. And then, also, I think for the genomics consortia, a lot of people don't know where to find the data. And so you see a lot of papers that are using older, outdated GWAS. And if there was something where people were all using the same sets, if there was a way to access just the summary stats from-- and it was all up to date-- because sometimes you see things that are not up to date. And that is something that seems like sort of a low-hanging fruit, easy thing to do to help standardize things. So if people have comments on that, that would be great.

HYEJUNG WON: I can actually add to that. So when it comes to the PGC data, I think there is a UNC server that hosts all the PGC GWAS for each psychiatric disorders. And they also point out which year and what paper this data is coming from. So that could be a really good data set to just first look at when it comes to psychiatric disorders. But I realize that when it comes to neurodegenerative disorders or addiction kind of related GWAS, there's not really one resource that we can really go for.

TOMASZ NOWAKOWSKI: Chunyu, do you want to follow up on that?

CHUNYU LIU: Yeah. So I really like Keri's point. So I guess in many of our lab, actually, we are curating our own collection of all kinds of genome lists so we can do the enrichment test. But I don't think that's a best practice. It might be better if our whole community can work together if they have a central repository of most updated genome list of summary statistics. It will be the best for the community to do good things together.

TOMASZ NOWAKOWSKI: Yes. Are there any good models? Sorry. Before I give the microphone to Dane, are there any good models that people can think about or you can think about while Dane is raising his point? But if there are good models for how things are done in a similar way or have been done in the past that we could follow that have been particularly successful, I think that would also be quite informative. So while people are thinking about it, Dayne?

DAYNE MAYFIELD: Yeah. I just wanted to follow up on the previous discussion about the integration of the genomics data and the idea of central repositories, which is a great one. One of the limitations that we've had over the years is getting material transfers in order because the more you want to share data, the larger that problem becomes. And every university has their own material transfers. So some kind of generalized material transfer would really aid that process of being able to utilize the central repository. So keep that in mind when you want to share data. Yeah. At a place like University of Texas, it's a very slow process to get material transfer. And so you have to really plan ahead. So it'd be nice to be able to bypass some of that time component.

TOMASZ NOWAKOWSKI: Thank you. So we have five minutes left. So maybe thinking about, are there any sort of specific recommendations or good models to follow that things that have been done in the past, maybe, not always in the context of transcriptomics or even in the context of brain data? But I'm obviously looking at Hyejung. In your experience in genomics, are there examples that we could get inspiration from?

HYEJUNG WON: Well, I don't really perform differential expression analysis very often, but I can give some examples coming from the IGVF consortium where there are a lot of-- for example, for a lot of functional validation studies that either use massively parallel reporter assays or CRISPR perturbation at the single-cell level, people have realized that there is not really a unified pipeline that everyone can just rely on. So people use all the way from linear regression model or just even the T test or Wilcoxon test without really considering any confounding variables all the way to something that is more refined. So one of the aims for the focus groups of the consortium is to bring all these people who are really interested in one assay and then try to benchmark multiple existing tools out there and see which one has the best capacity. And then out of that, we are trying to also work with the initial developers to kind of change their pipeline so they can be more easily used for other people. So I feel like single-cell RNA sequencing platform also has the same thing. Some people really hate UMAP versus some people use Python versus R. And as a user perspective, I'm very confused. So if there is one uniform pipeline that you guys can kind of present as BICAN, this is how all the data sets have been processed and this is the data format that everybody can chime in, then I think that will be very beneficial.

TOMASZ NOWAKOWSKI: Seth?

SETH AMENT: Yeah. I can speak to just that last point. I think that one level of data-sharing standard that there's been some progress on is not so much you have to use R or Python, but formats for describing datasets. So Seurat objects have been around and all those things, but having specifications for how those annotations are occurring and formats has been one topic that's been a working group that we've been involved in over the last year, leading to some recommendations and something called the FOAM standards. I don't know that that's been widely adopted so far, but I think it would be wonderful if we could start to share data in a uniform way that way or by some standard because it just makes it so much more straightforward than to do a meta-analysis and integration.

TOMASZ NOWAKOWSKI: And perhaps as part of this common format, you can also integrate some of the annotations, cell type annotations and taxonomies that BICAN has created. Great. Keri, do you want to make any final remarks before we break?

KERI MARTINOWICH: I don't have anything. I think Ed has his hand up, though.

ED LEIN: Can I make one little final plug? I mean, we've talked about sort of how to integrate with other parts of the community. But the use of transcriptomics allows us to correlate across organ systems. And I think there's a huge opportunity to integrate with HCA and the hub map sort of efforts that are trying to coalesce around the other organs. We're producing the BRAIN here, but by doing it in the same way, you would be able to do analyses across the whole body. And I think that that's going to be really powerful in the future if we can make sure to accomplish that.

TOMASZ NOWAKOWSKI: Great. Well, thank you all the panelists. This has been a very productive discussion. And thank you for sharing all your thoughts. And I will hand over to Daniel.

DANIEL MILLER: Thanks, Tom and Keri, and thanks to the panelists. It's really very rich discussion.

JOHN NGAI: It's been an amazing three days. I think we've learned a lot. A lot of really great science has been presented. A lot of great questions have been asked all along the way. In a lot of ways, I feel like that last discussion was great because it kind of crystallized a lot about what BICAN and this whole effort is all about, which is to really use these resources-- to generate these resources so that we can get functional information about how the brain works, but as well as to push this as a resource for better understanding disease processes. And we heard a lot of really exciting talks today about that.

So again, we are looking-- we're standing up this whole BICAN project to develop a resource for the community. And going back to what I mentioned, I think - I can remember - on the first day, this is one of three transformative projects that the BRAIN Initiative has launched. And we really, truly do intend these projects to transform the field. And as I think we can see already, just from the very-- just from the first five or so years, it really is having an impact. But I think the impact can really grow if people really put their minds to it.

So I'm just going to take a break for a second. I'm going to hand it over to Ed. Ed has a brief slide deck that I think really encapsulates a way of approaching this. And we really appreciate folks' input about how we can better engage not just the people who are working on this directly in BICAN but really, at the field at large. And again, as I mentioned earlier, this isn't just for people to use. We're really hoping this will be an interactive resource where people who are benefiting from it can also contribute as well. So, Ed, can I hand it over to you for a bit to maybe frame the discussion in some concrete terms?

ED LEIN: Yeah. Absolutely. Thanks, John. So we could have done this at the beginning, but actually, I think it's quite good to do this at the end. Yong asked if I could give a little look under the hood of how BICAN is being organized to try to have this community try to produce these foundational references for everyone to use and to build off of. And so I'm going to try to represent some of the efforts in trying to coordinate this group of researchers into a common ecosystem to be able to produce these data, analyze these data, and sort of make them more formal resources for the community. So I'm going to do this very quickly. Those of you that are in BICAN, sorry, I'm trying to represent all of your work, but it might be helpful, I think, for those of you that are not part of it to see how this is really being orchestrated.

So let me just start by saying, really, a big goal - this is actually a slide that gets used a lot for the knowledge base as part of this - is to produce this reference or these foundational set of references for the entire community. And this is not intended to be a single cell genomics project only. This is something that can hopefully feed and have meaning for the entire community, everything from the cellular circuit, developmental systems neuroscience, to neuroimaging, to disease consortia we talked a lot about here, to the other big NIH projects like the Armamentarium and CONNECTS that can build on top of these data that are being generated here and then connect with the rest of the community so it amplifies the impact of what we're doing here.

This is a much more orchestrated and coordinated effort than the BICCN phase was. Really, what NIH had in mind is that we would create this centralized ecosystem where there would be more control over the specimens that go into this work, particularly the human specimens, but also non-human primate and even mouse, that there would be centralization of the sequencing so that this all runs through the same sequencing centers, a lot more standardization, and also allows the data to flow directly into the data archives from the get-go so that these data become available early. And to do this, also, we needed not just the experimental groups that are generating the data, but also a big informatics component to create this ecosystem, both sort of at the level of the specimens and handling that side - I'll mention that briefly - at the level of the sequencing data, and at the level of the archives. And so these are the so-called cubies that are part of this that complement the other experimental efforts.

The initial set of projects is actually quite wide-ranging, largely focused on human and non-human primate. So you've heard about a couple of these during the day, but they're efforts to map across the entire brain of the human/non-human primate and the developing human and non-human primate. And then also looking across different modalities, so getting really into multiomics, for example, with the Ecker project, and individual variation with the McCarroll Project you heard about as well. There is a big emphasis now on development as well, and that is one of the big projects in the mouse. In fact, the only of the very large projects in the mouse is to very comprehensively sample development so you can map across species with sparser data as well. And so we had to encapsulate all of this into this ecosystem and have a flexible system that could incorporate new projects that will undoubtedly be funded in subsequent phases.

I just want to highlight a couple of the components that I think are important and relevant for the discussions. One of these is, to study human brain, you have to have access to specimens. And you want those specimens to be carefully prepared in semi-standardized ways, at least. And so to meet this need, the NeuroBioBank is part of this consortium and several other banks: the University of Washington and UC Irvine. And for a period even preceding BICAN, there was a lot of effort to standardize methods and try to bring the best brain preparation methods to this, including thinner slabbing so you can improve your accuracy and precision of sampling, better freezing methods that will be good for both single cell and spatial methods, etc. And in fact, there are more ambitious methods as well in the works, but essentially, to be able to prepare these in a standardized way and capture information about the donors in a standardized way as well.

This includes standardization of anatomical frameworks. And this is one of the big components here, is to have common coordinate frameworks across the species. So people are used to this in the mouse. We need this in non-human primate and human. And standardizing on how we refer to structures across the species. So we've selected the Song-Lin Ding's adult human reference atlas as the basis of this. Everybody's mapping against the same reference framework across human, and then those are being extended across the non-human primates. We've established - this is the work of GQ Zhang and others - a specimen portal, as we talked about in the last session, to be able to have people look at the specimens, have access to information about the specimens, identify the closest reference atlas plane of section to their slab, and then actually annotate what they're selecting, what they're going to be analyzing. And this works both as a documentation process; it also works as a request process for the brain banks so that you draw a region and you request that region. And right from the get-go, you assign what anatomical region it is, and its photo documented and captured in the system. And that's a necessary step then for things to move downstream. And so this begins a sort of universal identifier process.

So we were really encouraged to come up with a series of joint milestones to help drive the overall consortium focus. And Yong showed this at the beginning. I just want to reiterate here. We've really tried to hone these joint milestones around the products and deliverables that will come from BICAN. These include the data, raw data, more annotated data, includes standards, SOPs, the references themselves. It's the annotated references that we've talked about a lot. This also includes the CCFs, software tools, and publications which, of course, incorporate the scientific progress and knowledge that's coming from this as well, which is a very important part of this. All these sort of fit into a standardized end-to-end pipeline for data. And so this is a generic version of this, where we start with our sampling plans. You run experiments. It goes through this more integrated analysis. Along the way, we're establishing standards, and these end up in the data archives and being presented to the world or published on. This involves a lot of within-pipeline tracking of information. GQ Zhang and Kim Smith, in particular, I want to call out for developing this whole system of assigning identifiers and tracking both information or actual materials all the way through the system so that the whole process works. And then this is a generic version that can be applied to a variety of different data modalities.

Just very, very briefly, on the specification side and the standard side, we want a metadata model. We need protocols that are open and documented standards for library preparation for annotating the process of classifying that data and fair processes for everything. On the reference side, a big goal is to produce these cross-species taxonomic classifications, as well as the CCS standards, and to begin to get towards the standard nomenclatures and ontologies that we were discussing earlier in the day today that can all be sort of incorporated into the knowledge base where those entities then allow you to aggregate all kinds of information in that and then deploy it to the world. On the publication side, we really want to implement fair principles all across the board here and clear references to the actual data that underlies those and ways to have outreach and to make sure that this impacts the community as a whole, including enhancing diverse perspectives, for example, to underserved communities. And then, finally, we have many tools that will come for this. So we're inventorying these, really trying to use this whole federation of tools to disseminate information to the community as a whole. And so this was really just meant to be a very high-level overview, but these are a set of milestones that we jointly came up with in the consortium as a ways to sort of formalize some of these processes of how we will standardize, how we'll produce, and how we'll disseminate information.

I want to briefly touch on the second topic I was asked to do, which is the joint sampling plan. So, as you probably know, most of these consortium efforts begin with an open call where people propose an experimental design. And then, after they're funded, there may be some redundancy. There may be some different perspectives once you're seeing who else you're working with. And so we came up with a joint sampling plan to really try to harmonize across the data sets and maximize the overall package that will come from all of these projects. And so this is trying to get as much anatomical coverage as possible of different cell types to try to coordinate across brain regions so we're looking at the same things, both in the adult and in development, and particularly to have a focus on disease-relevant regions for a human brain.

This led to the selection of the basal ganglia circuitry as an initial area of joint focus, as I'll talk about just very briefly again. But it provides a consortium-wide focus to try to really hammer down a description of the regions that make up those circuits that then can be built upon to actually understand the circuits through CONNECTS or to develop tools to target components of it through the Armamentarium, for example. This prioritization, I should add, is-- over the course of the entire project, it is possible that some of that prioritization could be affected by the needs of the community. So just in very broad strokes, we have coverage of prenatal development, of postnatal development, and the adult across many different brain regions with many different modalities. As I mentioned, we are putting this initial focus on the basal ganglia circuitry. And the reasons for this, of course, are their involvement in many different kinds of disorders, both movement disorders as well as addiction, OCD, and other disorders. If you think of all of these different efforts focusing on this one region, we get to answer all kinds of questions about this and hopefully make a resource that becomes meaningful for a very wide part of the community. It is also a manageable part of the system, and that's an important component because it's hard to tackle the whole brain all at once. And so we have an initial focus. We can deal with everything from having to come up with annotations and nomenclatures as well as deal with tissue availability issues at the outset of this project. So everybody in the consortium has standardized to the same anatomical nomenclature and put their sampling plans in this context so that we can then begin to manage this and go for integrative analysis across the groups.

So let me just finish by saying we want the community to use these resources. This is the goal of this, is to come up with something that's meaningful and useful for the community. And a pitch from me to this community is that we would much rather have you contributing to making this something useful to you rather than having redundant efforts where everybody redefines things again and again and every paper has a different nomenclature system. There's just no need for that with this kind of investment anymore. And so this is going to really kind of follow some of the elements of the genome in the past, formal releases of the references in terms of the taxonomies within and between species, trying to get to these more formal nomenclature systems with aliases as appropriate, having many tools to visualize this, tools to map against it, like BLAST was for the genome, this MapMyCells, as well as other tools that you've heard about, provide those kind of capabilities for being able to map against the reference to label transfer and talk the same language. And then finally, this sort of this aggregation of information through the knowledge explorer and knowledge base. So this is sort of the trajectory that we're on. We're trying to create references akin to the genome that the community can help to kick the tires on and add information to and see what works for you and really accelerate everyone's research in the field. So let me just end by saying this really is a great community, very collaborative and interactive, that wants to try to do this, is inspired by this mission. And I want to thank everybody for that and hope that we can convince all of you to help to participate and use this reference. I'll stop there.

JOHN NGAI: Terrific. Thanks, Ed. You're right. It could have been the start, but it's actually just as well that it's at the end so we can actually think about these things with all the discussions we've had along the way. So it'd be great to open things up for general discussion. Again, I think many of the important and pressing questions have been answered along the way, but especially in that last session. Maybe if I just get things started, there was quite a bit of discussion about the importance of surveying samples from diverse populations, not just to be looking at brains from Western European males, as it were. And I think this is a very important point to look at tissue across lifespan from both sexes, different ancestries. At the risk of stating the obvious, I think there's at least two interlaced reasons for doing that. First of all, it's just better-- it's good biology, right? I mean, we can learn a lot through study variation. I mean, people do this in many, many different organisms. When you look at all the different strains of mice, flies, what have you, people looking at natural genetic variation, the effects of environment, the effects of environment on top of genetic background. All these things can be done, and I think we really have a great opportunity to do this in the human population using not just transcriptomics but these different modalities, different molecular modalities as kind of a foundation for that.

So I think, just in terms of doing better biology, it's critical that we do keep in mind looking at samples from diverse populations. But the other point here is in terms of looking at disease. I mean, to the extent that genetic variation or other variations can influence biology, it also gives rise to disease vulnerability and resilience, right? So I think, just in terms of just doing better science, it's really, really critical. So this is something that we have taken very seriously. We thought long and hard about it when we kind of launched this project. And we're really pleased at what's come in from the group that now does form BICAN.

MIKE HAWRYLYCZ: Could I make a comment, John? Is that okay?

JOHN NGAI: Yeah, of course. Of course.

MIKE HAWRYLYCZ: Yeah. I think, I mean, this is-- we never before, certainly, have we addressed in his big community context the challenges and the desiderata for interpreting and understanding these data. But I think, to really profit from this, we really need to write down a roadmap. We need to write down a program of analysis, of study, of characterization that all of us can-- that it provides kind of a manifesto that people are really willing to drive toward. I think, otherwise, without doing that, we just-- things go by the wayside. There's no formal documentation for what the goals really are and what we've learned and discussed. So I would really like to advocate for such a really coordinated report to come out of this with-- that everyone can kind of contribute to who participated.

JOHN NGAI: Yeah, and I hear you volunteering to lead the charge.

MIKE HAWRYLYCZ: I'm volunteering to sort of get it going anyway. If people want to do it, I think it's really, really-- I think it's necessary to profit from what we've done.

JOHN NGAI: Yeah, I appreciate you raising that, Mike. It would be great if some of you folks can get together and think about putting together, at a minimum, some kind of a white paper. I believe that we'll have somebody kind of give a summary of what's happened here, an accounting. But that's kind of just an accounting of that. But if there can be a distillation by some of the folks here into how that might inform future efforts-- what are the opportunities? What are the gaps? What are the opportunities? What are the challenges? And what's it going to take to address those challenges and seek out those opportunities? That would be quite useful for, certainly for us at NIH, but for the field in general. Hongkui?

HONGKUI ZENG: Yeah, I think a roadmap would be a really great idea towards the ultimate goal. I feel that, of course, we just got started, but we are also already in year two of the BICAN, right? And I think at least the ultimate goal of BICAN is to create a comprehensive whole brain atlas for the entire human brain as well as non-human primate brains, and along with variability studies, along with developmental studies, especially in mouse and some human/non-human primate stages as well. So five years will go by really, really fast. So I feel like we should have some kind of a roadmap of milestones, maybe for different projects, just so that we know how we can actually achieve that goal. It's great to have the basal ganglia as a first study, but I do have the worry that, if we focus too much on just basal ganglia and maybe trying to push a publication package or something like that out, that will consume a lot of our time. And the package may be out in year three at the fastest. And it could be even dragged into year four. But then, what's going to happen with the whole brain, right? So I think we always should have the whole brain to get this work done in our mind. I guess that should be the expectation from NIH as well, right? You don't want this drag. No, that's the expectation, right? Maybe one more year, but we really need to have a clear roadmap in terms of how to get this done in five years.

JOHN NGAI: Yeah, so I appreciate that, Hongkui. So I have a couple of thoughts on this, the so-called basal ganglia mini-atlas. And when I saw the term mini-atlas on Ed's slide, I kind of both shuddered and chuckled. So some of you may-- some of you here know from firsthand experience that, when we started the BICCN back in - what was it - 2017, it was kind of decided that the group should just focus on one little part of the brain just to make sure we could all get our stories straight. And that was the so-called Primary Motor Cortex Mini-Atlas Project. And that took, what, four years, and it was hardly mini. So I appreciate that comment.

At the same time, I think that, as the project progresses, what we really do need to see is to see some actionable data along the way, not just to wait until the very end to see it. And I think that was part of the rationale behind, I think, a very good choice to look at a basal ganglia because it is defined. It has a lot of great biology going on, but it also has a lot of important disease relevance. And in the companion transformative project, the connectomics project, they're also focusing on basal ganglia. So we're hoping this will go together. But it's a very good point about, "We don't want to get too distracted by the sub-project," but we have faith in you, folks, that you're going to continue to break Moore's Law and somehow be able to pull this off along the way. I mean, the acceleration is kind of-- it's easy for me to say the acceleration is almost a given or an expectation. But we do have to be mindful that, yes, the goal here is to provide a complete atlas of the entire human brain and NHP brains, but we do want to obtain actual information along the way.

This was actually a concern that was raised when we were standing up or deciding how to stand up a connectomics project, the connectivity project, which was we don't want to wait 5 or 10 years to get one brain, and we don't want to just be developing methods for the sake of developing methods, but we wanted to be able to get some actionable biology along the way. So hopefully, we can find a balance. We have a lot of faith in the folks involved in the project. So we'll just have to keep a close eye on that. Arnold?

ARNOLD KRIEGSTEIN: Yes, just a quick question. I know we've discussed the topic of roadmap, and I was just curious about the NIH's vision for the future. I know there are additional RFAs, part of the BRAIN Initiative that are being funded even now. And they're overlapping, but not entirely overlapping with the current cohort of grants and individuals. So what do you see about this community evolving down the road? And how is it going to change, and how long? And what are the goals the NIH has in mind?

JOHN NGAI: That's a good question. I mean, we're, right now, taking stock of how far we've come. 2024 marks the 10th year of BRAIN funding projects. And I hope we'll all agree that there's been great progress in a variety of different efforts. And we're taking stock of that right now to see what opportunities can we really pursue in the future. Literally, what's the next new thing? What's the next big thing? Building on the progress here. So that's all kind of a work in progress, but something we are very, very, very intent on sorting out. But in the meantime, there's just a ton of work to do here. Right? I mean, the investment in BICAN is topping $100 million a year right now as we get through the next year. And so we just want to make sure that we can manage this huge project. There's a lot riding on this, but we have every expectation we're going to achieve the goals. But then it's a fair question on what's next. And this is something that we will be engaging on in the next year or two. So I don't know if I came close to answering your question.

ARNOLD KRIEGSTEIN: Well, sorry, that's the beginning. And we'll have to see how it evolves. Yeah.

JOHN NGAI: Yeah. I mean, it's not so much an iterative process, but it's a generative process. Right? I mean, we were able to stand up BICAN because of the success of BICCN. I mean, as many of you know, I was a BICCN investigator back when I was a, quote-unquote, "civilian." And that was based on, I think, a great gamble by Andrea and others at NIMH and a brain that, hm, maybe there's the technology here that's going to make this thing happen. So it's going to be generative. I think, not to sound corny, but it's really up to the folks working the projects to really steer where the future may be. Our role is to kind of read the field and try to maybe anticipate a bit and kind of help people get there.

ED LEIN: Maybe I can pose a question. I think we've heard a lot about the different disease consortia and how they could use, fruitfully, the outputs of this. But what's a good mechanism for that? So we could put the data out and make some tools for people to access it. That's a little of a hands-off version of it. But whenever we bring up one consortium, another one gets mentioned that we should be talking to, and it gets a little challenging to try to coordinate such things. What do people think are good mechanisms for trying to have broad outreach to various other consortia that are relevant or would benefit from these data? Keri, I see you have your hand up. You're muted.

KERI MARTINOWICH: Sorry. I was actually going to ask a question about that before, is whether BICAN was actually proposing, as a next step, to do disease-oriented on diseased brains. And I guess maybe, I mean, that could be an option, that that's a next step, and they do that. But I guess joint consortia, some kind of initiative that's between some of the existing consortia where people have already worked with some of those large cohorts, that maybe that would be an option, is to have some type of next step that is BICCN/PsychENCODE or BICCN/whatever that are joint consortia to do those things.

Because there are a lot of people who already have worked with those cohorts in other types of-- first with bulk data and then with the single cell data. But I guess that had been my question, is, "Is that the idea of the next step of BICAN, is to actually do the diseased brains themselves?" And I guess the other question that I had there is, "If so, is there a priority list for a disease?" Because I think some of these neuropsychiatric and neurodegenerative diseases lend better or worse to-- whether you're looking at cause versus consequence, which is always kind of the elephant in the room with these case-control studies.

JOHN NGAI: Yeah, so the short answer is no - and I hate using the word no - for the reason that-- I'm going to qualify this a bit, don't worry. The short answer is no because BRAIN is really trying to work in this disease-agnostic space where we develop tools, resources that will benefit the missions of all the NIH ICs studying the nervous system and beyond, clearly. But the more nuanced answer is that we-- I mean, the purpose is to help support this disease-focused research. And there are other consortia around the world, as you know, including those that are funded by NIH. We heard about some of them here. And in fact, as we heard again here that some of the-- what's in common are the PIs, the people actually working on the projects.

So we are really working hard behind the scenes to better coordinate these consortia around NIMH. And it's a bit of a lift. BICAN, it's going on two years old, and this is going to be one of our challenges. And there's a lot of work going on behind that. The sister program to BRAIN at NIH is Blueprint, NIH Blueprint for neuroscience. BRAIN encompasses 10 institutes and centers, and I think Blueprint's about 14. And here, there's a really great opportunity for us to interact with these other ICs, their missions, their consortia to really leverage the resources. And again, it's not going to be just a one-way street. I mean, the huge benefit to BRAIN is going to be the data that comes in from these other disease-focused consortia, so basically, a two-way interaction. So yeah, we're thinking about that, and there could be an opportunity for joint funding, but BRAIN itself is not going to take on disease-focused research, but certainly, we're more than interested in making sure we can enable that. And there are different ways of doing that. Lydia?

LYDIA NG: Just like how Ed presented all the joint milestone and that focus on basal ganglia, there's nothing like a focused project to bring it together. So is there a one- or two-disease consortium that's highly related to the basal ganglia that we can make a concrete project from, so really test that two-way street? Because if we talk about disease consortia in general, we're not going to serve anybody. It will be too generic - either too much done and nobody uses it - or too shallow. Well, if we target one or two of these, then we can actually establish those two-way street. And basal ganglia seems like the focus area. So I think getting that target and picking those consortia will be very helpful if that's the step that we want to take. And then you can build a very concrete roadmap against that.

JOHN NGAI: Yeah, thanks, Lydia. Great points. Bing Xing?

BING XING HUO Hi. I just want to call out this collaboration between the NeMO team and Broad. This team has been a part of CUBIE for BICAN, and also happened to be the Data Coordination Center for the opioid response in BRAIN in the context of HIV, the SCORCH consortium. It probably was not intended. The NIH didn't intend to fund the same group for two data coordination centers, but it just happens that way. I would say having these type of data coordination centers talking to each other in a more-- between BICAN and other BRAIN initiative projects with these disease-focused BRAIN projects probably would be a path for the different standards to be transferred, the pipelines to be shared, and outreach, and even more resource sharing. I understand we are-- for example, we are preparing a molecular resource workshop for BICAN, which part of it would also be advertised to the SCORCH consortium where they can also leverage something similar.

JOHN NGAI: Yeah, certainly, I don't think the intent should be for BICAN to dictate but really, to function as part of the larger ecosystem. And I'm glad you brought up SCORCH because SCORCH was launched by NIDA, inspired by what was going on in the BICCN and now BICAN. This is kind of a direct-- it was kind of a direct result of what we're doing here in the BRAIN Initiative, the Cell Census Project. So there is stuff going on, but it's also good to realize we're-- even given all the progress and the amazing studies that have been done in a short amount of time-- it has been a bit of a short amount of time, and it's going to take time for the other ICs to spool up these projects, especially the larger projects, right? They don't happen on a dime. And frankly, it took us over two years of planning to launch BICAN, right? And that was on a project that was already rolling at a pretty good clip through BICCN. So Andrea had her hand up too. I don't know what happened to Andrea, but Yong?

YONG YAO: Yeah, I just want to say, although BRAIN Initiative will not fund directly the disease research, but actually, all the program officers—

HONGKUI ZENG: Yong, I can't hear you well. I don't know about others.

JOHN NGAI: A little faint.

YONG YAO: Okay. Is it better now?

JOHN NGAI: Yeah, yeah, good, good, good.

YONG YAO: Okay. Yeah, I just want to mention, so the BRAIN Initiative staff are actually located in ICs, in institutes, so working closely with many colleagues, different disease-related programs. So yeah, SCORCH is one example, and also for NIMH PsychENCODE and SSPsyGene recently. And actually, one slide was shown, yesterday's discussion, on how to use NeMO and the Terra common data processing pipeline. So all these are happening in parallel. So I think there is a good chance for NIH program officer to collaborate and coordinate and for further collaboration between BICAN and other disease research programs and consortia.

ANDREA BECKEL-MITCHENER: John, I put my hand down, but I will go ahead just really quickly.

JOHN NGAI: Go ahead. Please. Yeah, go ahead.

ANDREA BECKEL-MITCHENER: So both you and Yong have covered some of the items that I was also going to mention. I think really emphasizing that a lot of NIH staff are on this call and participate in BICAN in a variety of ways. So we have really great input from the ICs and trying to understand their priorities in this space as well, as well as them learning from what BICAN is doing and the data sets that are coming out of this particular effort. The other thing I'll mention, because I'll be bad cop, John, to your good cop, is that looks like we're entering a much more challenging funding period. Many of you have seen the fights on Capitol Hill here in the United States, and this could impact our ability to support larger projects. That said, I think John would agree that this is a priority area for the BRAIN Initiative. And I think that the 10 institutes that participate in BRAIN have also seen the value of this kind of an effort. So we have to weigh the output of these projects, the impact of these projects against what looks like it's going to be a much more difficult time coming up in the next few years.

JOHN NGAI: Thanks, Andrea. GQ?

GQ ZHANG: Yeah, I want to chime in with regard to disease-related efforts. Actually, BICAN is my first entry into BRAIN Initiative Disease Neutral, but my prior experience is with the National Sleep Research Consortium, as well as the Epilepsy Research Consortium. And my group were serving as a data coordinating center for those projects. And BICAN implicitly actually benefited from those prior experiences. That's why I was confident we can contribute. But this workshop is a great start, but maybe there's opportunities and questions. How do we bring back the message? This is soon-to-be ready resources that can be leveraged by different disease communities. How do we communicate the progress and achievement and available resources? Because talking to other investigators, they always would like to have a healthy control as reference in any of the data settings. So if we have that, that's a great gift to the entire research community, especially disease-focused one because they don't have normal healthy controls. Usually, they don't.

JOHN NGAI: Yeah, it's a great point. And how do we get people on board, so to speak, whether it's the investigators or NIH leadership or the public or members of Congress that, as Andrea said, are responsible for keeping the engine running? And as I said, it's generative. I mean, you better believe I use some of the examples of what was presented here to advocate on behalf of the initiative to keep this thing running, right? So it's not going to happen overnight. It's not going to happen just because a group of us here say it should happen. And this is really helpful because we certainly have the will, and now, with the science being done, we will do our darnedest to make sure there's a way, right? But I mean, if you just look how far the field has come in the last couple of years, I mean, high-throughput DNA sequencing and following on single-cell sequencing and then droplet-based sequencing has truly revolutionized the way we can do biology today. So my kind of mantra is look for the generative opportunities to get there. I mean, these are big lists. I mean, the thought of sequencing-- not sequencing. The thought of characterizing every cell type in the human brain, I think, five years ago was aspirational, but it's no longer aspirational. It's just totally doable. So I saw a hand pop up and pop down. Sorry, I'm kind of blabbing on here. Yeah, Jeremy. Jeremy Miller, yeah.
JEREMY MILLER: Yeah, I want to just note that we're coming towards the end, so I won't take too long. But I mean, I think a lot of the stuff that we've already talked about this week will help. We want for people to use-- whatever come out of BICAN, it has to be something that we deliver in a good, standard format using really easy-to-use tools. We need to have outreach to all these communities. We need to directly collaborate with them so that they know how to use them correctly and to make sure that we're providing things that they want. And I think one challenge that has kind of indirectly been touched upon is that we'll need to be able to have flexible taxonomies so that when there are these novel types that show up in diseases-- or they'll be able to either recognize that it's missing from our healthy human brain taxonomies or have a way of kind of ingesting that into ours, depending upon-- so I think it's a multi-pronged approach for what we'll need to actually get people in different disease consortia to use our work. And I think we've touched on, at various points, kind of all the pieces of it.

JOHN NGAI: Great, thanks, Jeremy. Ed, and I'd like-- Ed, I don't know if you have a comment about all this stuff, but I'd like to toss it to you to kind of pose kind of a very forward-looking [question?] for the group.

ED LEIN: That was where I was going to go, John.

JOHN NGAI: Okay, thanks.

ED LEIN: Raise my hand to get the floor. So you mentioned the word generative. So that triggers maybe our last topic here. We've heard a lot from people about how transformative gen AI is going to be on the field, how it's changed the game for us. But it's not quite clear, I think, to many of us where exactly or how that could happen or how we should try to be at the front of that in BICAN. So I wonder if some people are willing to bite on this topic. How can that be applied fruitfully to us to help to take advantage of all the things that we're doing?

JEREMY MILLER: Yeah, let me just start off with that, Ed. I mean, I think one question becomes, do we have the depth of data to really apply these? I mean, at some level, yes, you can build language models that answer questions, potentially see, uncover relationships that you may or may not have described that are there. But I think that it's unknown whether or not, I think, we have the depth of data still for something truly profound there. I think that that's something we need to look at. And of course, you have to try, but it's a question still.

JOHN NGAI: So if I can just address that real quickly, I mean, so there's certainly the application or the invention of these approaches to understand the data we have, right? And we saw that a little bit. I would like to think that-- Hongkui, how many cells were analyzed in the mouse project? What, 35 million, 32 million? Yeah, dude, if you can't do it with 32 million cells, do something with 32 million cells, I don't know what you can do. That's my flippant answer. But then the other question is - we talked a lot about these ontologies, these taxonomies, this whole messiness of multiple names, multiple nomenclatures - how do you align all that stuff? How do you bring in new phenotypes onto potentially new or not new cells? And maybe, so my question for the group is, maybe, can we use these generative models to kind of get our heads around it? I mean, it's going to be really hard for a human or even a set of humans to annotate or to keep track of that amount of data. We're really going to need to come up with a different way of going about it. I mean, there's biological principles that might be unearthed. Just from an organizational point of view, is there a way to kind of leverage that? [Shab?]?
SHOAIB MUFTI: Yeah, thanks. Just a few thoughts on the question you posed around on how we use AI. So I think there's broad categories we can think about AI. One category comes into mind that enhancing the productivity of our scientists, right? And in that category, you can think about things like natural language search and all that. I don't have to write Python code or whatever the code. I can just write English code and have posted the natural language queries to this. The other areas which we are also looking very closely with is improve the navigation of our tools. Right now, if you go to your atlases and all that, you are working through drop-down menus. You don't know what it is. You can think about a world where you can just have chat boxes and you just write your questions. And the thing John touched on is that this whole annotation side of things, right? These are cumbersome things, these are the big manual thing, and we can offload it. So that's the kind of, I think, short-term low-hanging fruits. We can definitely improve productivity. Then there are the more interesting areas come into play, right? Can we really discover new things out of the data? You can do multi-modality. So you can think about building a human-brain large language model, right? So you can train it on the current mouse data or synthetic data without having all the data there. So those are more a research area, but really promising and can possibly revolutionize the field. But I think there's a lot of low-hanging fruits right now using productivity, and we can really bring those tools there and solve problems, and we can apply them.

JOHN NGAI: Yong?

YONG YAO: I actually want to yield to Joe because he was invited for some CZI initiative.

JOSEPH ECKER: Yeah, just to briefly summarize - I put a link in the chat - one of the big initiatives that they have been pushing at CZI is to build a sort of world-leading GPU cluster that will be available to investigators outside CZI. And they're organizing a meeting of real AI experts from industry, biologists like myself, and bioinformaticists to meet, and I presume the first thing that they're going to want to talk about analyzing is all of HCA. How much data in HCA can you use to drive new generative models? And so that meeting will happen in March. And I think there'll be something that comes out of it that will be able to help BRAIN leadership. But we're not alone, I guess, is the message. It doesn't have to come from, necessarily, within our initiative, but we are producing-- it's not all of the internet, John. So if that's what's trained on the GPT4, and so it's less. But I think one of the questions I have for the AI experts, is to gather up how much data that we've produced as a whole and ask, is that enough to drive new models? And who would be willing to spend time on that?

JEREMY MILLER: That's part of my question, Joe. I mean, I'm not trying to be negative. I'm just trying to-- it's an interesting thought though. I mean, just one thing-- sorry, Yong, one thing. I mean, John's point was a little bit like, "So we've clustered. We've applied somewhat basic methods to the identification of these cell types by fairly-- basic kind of clustering type of things." But as John is alluding to, if one went back to actually genome sequences with really deep kind of generative models, one might see something different. And we should do that, right? These are exciting. I'm really excited by that resource that you pointed to, Joe. That's interesting. So, yeah.

YONG YAO: Yeah. I think a huge challenge actually is to engage community experts to curate the ChatGPT or large language model can accelerate probably a lot of annotation efforts for the 5,000-cell clusters. But still, I think we probably need professional expert curation to make sure this knowledge is correct. So I think there's huge opportunities there. And I saw Ming just posted his RFA for data integration. I think this could be something for you to think about for the future. And probably, we should also think about some other new opportunity to stimulate this kind of effort.
JOHN NGAI: Yeah, yeah, I appreciate all these comments. So thanks, Joe, for pointing out the CZI efforts. I mean, again, we're part of a larger ecosystem. We don't want to just build a hardened silo here. Certainly, what we're looking at in just the BICAN group is-- we have a lot of pretty weighty problems to deal with that we need to figure out how to approach, and in a future-facing way. But this is just one piece of the kinds of projects that BRAIN is supporting. We have this whole connectomics project, which is going to be, I think, orders of magnitude more complex, not to mention now as we start getting into functional data, physiology, and what have you. So one of the big challenges for BRAIN, I think, looking forward for the next 5 or 10 years is how actually to integrate these different data modalities so we can have a more truly comprehensive picture of not just the cell types but the circuits that are driving behavior. And of course, this reminds me we just stepped up this brain-behavior quantification synchronization project, which is aiming to collect better high-precision data on behavioral outputs together with circuit activity analysis. So this is kind of almost-- BICAN, as big as it is, is almost the tip of the iceberg in terms of what we need to deal with in terms of data. So this is part of what stimulated the question about, what are going to be the opportunities for using these generative models, and how can we best apply them? I think we're getting on time here. Are there other comments or thoughts? Again, this has all been very, very helpful. Oh, sorry, [Shab?], I see your hand up still again. Last comment?

SHOAIB MUFTI: Yeah, just a very quick comment, right, along this line of generative AI. So one of the things which is with the thought, right, there's a lot of interesting work going on in the industry right now in the space. So the question is, what are the opportunity to build some bridges to the work that's going on in the industry, whether it's the Microsofts and Googles or Amazons of the world, right? So there's a lot of areas they're working on. Is there a way that we can tap into that ecosystem for what we're trying to do for BICAN and what kind of opportunity, what kind of model that would be if we do something like that? So I don't have a suggestion, but I think the way I'm seeing it, there's an opportunity there.

JOHN NGAI: No, absolutely. And in fact, we do have some industry partners in some of these big projects as well.

MIKE HAWRYLYCZ: Just one other thing. So to John and Yong, I mean, I think that we probably will have to ask you to help to indicate the next direction here with where you want to go with this, I think. Do we want to do a report or a paper or whatever? And how do we want to wrap this up, I guess?

ED LEIN: Yeah, I'm going to defer to Yong, how he would like to proceed. I'm certainly enthusiastic about having a thoughtful perspective piece written about this. I mean beyond just a meeting summary. We do try to-- we do use these workshops to help us understand where the gaps and opportunities and challenges are. Our approach isn't to dictate what the field does. It's really just to enable you folks to kind of reach these goals, to identify goals and help you to reach them. So Yong will be in touch.

YONG YAO: Yes, I'm glad if you want to volunteer.

MIKE HAWRYLYCZ: Yeah, I mean, I'd be happy to survey the participants to see who would like to participate in putting something like this together. I think it would be kind of an exciting thing. Given all the conversations and all the discussions, I think it would be-- I was sort of envisioning a white paper and then potentially a prospective publishable paper that could have a subset of the key ideas. I mean, the advantage is that people would get behind it then. I think, without that, it becomes just a, "Oh, that workshop."

JOHN NGAI: Yeah, I mean, so white paper would be very, very useful for all of us. A prospective paper I think is great because it gets it out there for the broader scientific community to see and as well as to engage and get buy-in. So let's talk about that. I mean, to the extent that people can generate and sustain enthusiasm for doing this, this is great. We've had other workshops where there was initial enthusiasm and people kind of petered out as time went on. It was kind of like an exponential decay. But I mean, I'm all for that. And again, Yong will be in touch with you folks to see about what we can do to help you pull that together. Okay.

MIKE HAWRYLYCZ: Thanks, Yong. Thanks, Yong, for all your-- I'd like to thank Yong for all his work in suggesting this and motivating this workshop and putting it together, which has been terrific, so.

YONG YAO: Yeah. Thank you for staying to the last minute of the workshop, three-day workshop. It's really long, but I think we got a lot of food for thoughts. And so we'll work after this workshop. Yeah, stay in touch.

JOHN NGAI: Yeah, thank you, everybody, for participating, for contributing. A lot of great ideas, a lot of great discussion. I know a lot of work went into this. Thanks, of course, to Yong for putting this all together. Laura Reyes did a-- Laura Reyes did a huge amount of work together with Yong, not just putting this together but actually making sure everything worked. And I think that really worked quite well. So [crosstalk].

MIKE HAWRYLYCZ: There was a big organizing committee too.

JOHN NGAI: Yeah, and the large organizing committee. So thank you all. We will be having workshops in this general area coming up. So just stay tuned. All this is all under development.