Skip to main content

Transforming the understanding
and treatment of mental illnesses.

Celebrating 75 Years! Learn More >>

The BRAIN Initiative® Cell Atlas Workshop Day 1: From Single-Cell Genomics to Brain Function and Disorders—Data Integration and Annotation


YONG YAO: So I'd like to give a shortintroduction about the BRAIN InitiativeCell Atlas Network, or BICAN. BICAN was founded by the NIH BRAIN Initiativewith a total of about one hundred milliondollars per year over five years from2022 through2027, succeeding the NIH BRAIN Institute CellCensus Network, or BICCN, from 2017 through 2022.So BICAN is made up of multipleprojects and institutions across the worldwith complementary capabilities. Four UM1 centers will have the primary responsibility of generating comprehensive and high-resolution brain cell atlas data in human and nonhuman primates. A group of U01 laboratories will complement and enhance the UM1 centers in creating comprehensive and high-resolution brain cell atlases in mouse and across developmental stages. Four U24 coordinating projects will coordinate BICAN operations and establish common data analysis pipelines and the knowledge base of brain cell types. The R24 BRAIN Initiative data archives serve as BRAIN data repositories and provide public access to all data and analyses generated by BICAN. BICAN is also collaborating with six NIH NeuroBioBank sites to establish a diverse postmortem human brain collection and two sequencing centers for the centralized sequencing contract services. A team of NIH staff will participate in BICAN activities providing oversight and program guidance.

In 2021, BICCN in an international collaborative effort by more than 300 scientists at more than 45 institutions across three continents published the first comprehensive census in the motor cortex based on the analysis of the cells’ molecular, anatomical, and the physiological properties. The effort was team science on a scale unprecedented in the neurosciences, and the work has set the stage for comprehensive human brain cell analysis. The overarching goal of the BRAIN Initiative Cell Census and Atlas Network is to build reference brain cell atlases that will be widely used throughout the research community, providing a molecular and anatomical foundational framework for the study of brain function and disorders.

In October last year, a group of BICCN papers were published in Science and Science Advances on human and non-human primate cell atlases. In December last month, a group of BICCN papers were published in Nature on the mouse brain cell atlas. This collaborative effort has provided unprecedented detail in cell census and atlases in a whole mammalian brain with about 7 million single-cell RNA-seq  profiles and 6 million single nucleus RNA-seq profiles, and about 15 million MERFISH single-cell resolution spatial transcriptomic profiles, and several millions of single-nucleus DNA methylation. As a result, more than 5,000 cell types can be defined molecularly based on gene expression in a hierarchical manner. The diverse transcriptomically-defined brain cell types can now be mapped to individual anatomical regions in a common coordination framework, providing a comprehensive, anatomically annotated cell type map in the brain. Furthermore, the epigenomic profiling of single brain cells sheds new light on how the expression of individual genes are regulated in different cell types. Building on these exciting results, BICAN will now expand the horizon to whole human brain across the lifespan with more than 100 million single-cell genomic profiling and about one billion cells of single-cell genomic profiling, spatial genomic profiling, and moving toward linking single-cell genomic profiling typing to brain function and diseases.

In addition, BRAIN Initiative will founded a new center to study single-cell protein expression signatures in situ and the chemo-architectural variation between individuals and on cell morphology, synaptic structures, extracellular matrix, cell micro-environment, etc. using high resolution light sheet Imaging. Given the expanse of the brain cell census and cell atlas in generating large scale data, about 100 petabytes of single-cell genomics and imaging data, the BICAN program and investigators have put in effort defining the expected products covering the five categories over the whole five-year period including publish the data via BRAIN data archives in raw and increasingly curated and annotated datasets, and also standards, SOPs, and foundational references including common coordinate frameworks, publications, and software tools. So we'll have more discussion in this workshop on how to enhance the potential BICAN products. The workshop purpose as laid out here include fostering the development of the data standards for the integration and annotation of single-cell genomic data and to systemize and automate the process of data to information to knowledge and develop pipelines where feasible, sharing new insights on brain cell functional studies, developing strategies with brain disease research community to maximally leverage BICCN and BICAN data, and finally to develop a community map for the analysis and annotation of single-cell data. So this is a three-day workshop with a highly compact workshop schedule as laid out here, and I hope this workshop will be successful. And in particular, I want to thank the workshop planning group working hard to make this workshop feasible.

ANDREA BECKEL-MITCHENER: Thank you, that is great to start us off. We just want to have a few remarks from John Ngai, the director of the BRAIN Initiative. I'll just say John joined NIH in 2020 to run the BRAIN Initiative, including this set of programs around cell census and cell atlasing. He was an original grantee in the cell census pilot work that started in 2014 and he liked it so much he couldn't stay away. He actually joined us here at the dark side at NIH, so we're of course thrilled to have John. I'm going to pass it over to you, John, for some remarks.

JOHN NGAI: Thanks, Andrea. I also want to join Andrea and welcome everybody here. We have, wow, 365 and counting folks on the call right now, which is just terrific. Great. Thanks to Yong and Laura and the entire planning committee for putting together what I'm sure is going to be a really exciting workshop over the next three days. Just to give you a little background, the vision for the NIH BRAIN Initiative as a whole was initially laid out in 2014 when the enterprise started. This is laid out in the first strategic plan the BRAIN 2025 report, which was authored by a group commissioned by then NIH director Francis Collins as a working group of the advisory council to the director. That plan was really used to lay out our entire vision for what we're doing even today 10 years, literally 10 years, later, and that plan was also updated in 2019 by what we call the BRAIN 2.0 working group.

Now from the BRAIN 2.0 working group report, they acknowledged the pace of discovery both within neuroscience and in adjacent fields, and they said this is really a great time to think about investing significant resources to take advantage of these opportunities and to put these resources into large projects that really do have the potential to transform the field of neuroscience. So the word transform, we use a lot, innovation, revolution, but I think these aren't just words of conceit. I think what you'll see from this week's workshop is that really great progress has been made, really disruptive progress has been made, in any case, inspired by the BRAIN 2.0 report, which again came out in 2019 just before I joined officially.

In 2021 we launched what we called the BRAIN 2.0 transformative projects. Again, big, aspirational goals. But I think as we'll see, we really are on pace to achieving them. So the first today that I'll just mention is the BRAIN Initiative Cell Atlas Network. This is a centerpiece of this week's workshop and focuses on developing comprehensive human and non-human brain cell atlases using many different approaches in an integrated fashion. Again, this is the main topic of this week's workshop. Now as Yong summarized  for us just now, BICAN  builds on the remarkable success of the BRAIN Initiative Cell Census Network, or BICCN, which recently just published the draft cell atlases of non-human primate and human brains as well as a comprehensive cell atlas of the adult mouse brain. We're going to hear a lot about that a little bit later right after I finish with my remarks. The second big project is the brain connectivity across scales, or BRAIN CONNECTS project. This is recently launched and the goal here is to develop the tools and technologies required to generate connectivity maps on whole mouse brains at synapse resolution and on big brains at projection level resolution, and that's just getting started. And then the third project is what we call the Armamentarium which is a big fancy word for big toolkit for precision brain cell access, and the goal is to develop, validate, and disseminate tools for accessing brain cell types across multiple species. Now the first two projects will provide the parts list and the wiring diagrams, or the ground truth information for neural circuits, and these and the resources from an Armentarium project will allow researchers to test hypotheses about the roles of specific cells and circuits in generating behavior in models of both health and disease.

So ultimately, we expect these three projects together to inform new precision circuit therapies including precision gene therapies for human brain disorders So big aspirations, big goals, big expectations, but I think what we see based on the experience in the cell census project, this is actually all doable if we put our minds together. And the keyword is together to do it. So for these projects to truly transform the field, it's really critical that the information and resources generated or coordinated and disseminated for access by others in the field. As Yong described, the BRAIN cell census project represents a triumph of team science at a scale that's really unprecedented in the neurosciences. It enabled the discovery of amazing proportions, the development of truly revolutionary tools to get those discoveries, and frankly would not have happened, certainly not on this time scale, if it were not for teams figuring out how to work together. So in many ways, BICCN and BICAN serve as a model for organizing and conducting these large resource generating projects that ultimately will accelerate discovery of hypothesis-driven projects that both a BRAIN Initiative, other institutes at NIH, and other foundations across the world are now really actively supporting.

So over the next three days we'll be hearing about some great science that will lay the foundation for critically important discussions about how best to shepherd and steward these ensuing resources so that they will indeed transform the field and impact disease focused research and eventually revolutionize the search for new cures so I would urge us to all keep top of mind that our work is not done until these enriches are broadly disseminated and accessible to all. So again, welcome to all again, thanks to Yong and Laura and the program committee for putting together this week's workshop. And I will hand it over to the next person to continue and dive into the science.

ANDREA BECKEL-MITCHENER: I do want to introduce Hongkui Zeng as the Executive Vice President and Director of the Allen Institute for Brain Science in Seattle, Washington. She was one of the first awardees also in the BRAIN pilot program and has been a central leader on this effort since 2014. So Hongkui, I’m going to set a timer for 15 minutes and just give you a heads up, that will mean 5 minutes left. We do want to have a couple of minutes for questions if we can and I’ll be moderating the Q&A and selecting questions for you at the end of your talk.

HONGKUI ZENG: Ok, thank you, Andrea. I’ll share my screen now. Thank you very much for giving me the 
opportunity to present today, and thanks Andrea for the introduction. I've updated my talk title to Cell Type Diversity and Organization Across the Whole Mouse Brain. The ultimate goal of the BRAIN Initiative cell census effort from BICCN to BICAN now is to bridge cell types and brain function. An effective way of doing this is by building a brain-wide cell type map across four dimensions as shown here across species, modalities, and space covering the whole, brain the entire circuit network of the brain, and across time from development to aging. Our starting point at BICCN was on the mouse and using a set of molecular approaches from single-cell transcriptomics to spatial transcriptomics, to create a molecularly defined initial framework of cell types across the brain. And as Yong mentioned earlier, we have now been able to do that for the entire mouse brain. From here then, we can extend our investigation into the other dimensions across species, across modalities, and across time. I would say that there’s still great challenges ahead of us in particular, for example, in understanding the cell type diversity and definitions across the different modalities. With this kind of effort, we hope that we'll be able to gain a deep understanding of cell types, what they are, what their cell properties are, how they change under different conditions and what the conservation and divergence look like across the different species.

So again, also as Yong mentioned earlier, the first major milestone we achieved in BICCN was the creation of a multimodal cell type atlas of one single brain region in the primary motor cortex across several different mammalian species from human to non-human primate to mouse. And this work was published in Nature in 2021 and has allowed us to obtain several organizing principles of cell types: hierarchical organization of the cell types, multimodal correspondence or Integrations of cellular properties, discrete and continuous variations, and cross-species conservation and divergence. And now just two years from that initial work we also achieved another milestone goal, which was to obtain a high-resolution cell type atlas across the entire mouse brain as shown in this collection of publications in Nature just about a month ago. In this collective work, we achieved comprehensive, whole brain coverage using single-cell transcriptomics, single-cell epigenomics, and spatial transcriptomics approaches from a group of researchers, labs within BICCN.

In my talk today, I will mainly talk about the transcriptomically defined cell type atlases focusing on the work that we have done at the Allen Institute, but I will also mention the work from Xiaowei Zhang’s lab and Evan Macosko’s lab because our findings are very consistent with each other, and I believe that Joe Ecker and Bing Ren will talk about their work on the epigenomics front on Wednesday. So through the combination of two large scale whole brain datasets, single cell RNA-seq of 7 million cells profiled and the spatial transcriptomic data set across the whole brain using the MERFISH platform of 4 million cells profiled, we have obtained a cell type taxonomy atlas of over 5,300 clusters. The taxonomy is organized in a hierarchical manner with the 5,000 clusters grouped into supertypes, subclasses, and classes. Remarkably, in this whole brain taxonomy we found that neuronal diversity dominates in the brain with more than 5,200 of the clusters as neuronal clusters and the non-neuronal clusters remain as a minority even though we profiled large number of non-neuronal cells. And the whole brain taxonomy can also be visualized in this UMAP form that is colored by different metadata information. What we can see from this presentation is that the cell types are defined not just by their molecular identity such as the
neurotransmitters they express, but it can also be defined and is very much driven by their regional
specificity. And in fact, the 34 classes that we have defined, especially the neuronal classes, are really a combination of their molecular identities and their regional specificity.

Furthermore, by registering our spatial transcriptomics MERFISH data set to the CCF and performing quantitative analysis of the spatial distribution of all the cell types in the brain, we find a high degree of correspondence between the molecular identities, the transcriptomic identities, the cell and spatial specificity as shown in this graph and shown at only at a class and subclass level. The other two studies from Xiaowei’s and Evans Labs also showed a similar kind of high correspondence between transcriptomic identities and spatial specificity.

Furthermore, we also identified an interesting dichotomy in cell type characteristics in different brain regions, namely in the ventral part of the brain, mainly coming from hypothalamus, midbrain, hindbrain, we find that there is a much larger number of clusters or cell types in those areas, and those clusters tend to be a lot smaller, containing fewer numbers of cells and the Clusters are more similar to each other based on the number of differentially expressed genes between each pair of clusters. And also the clusters in those regions are spatially more confined into restricted locations. The dorsal part of the brain, on the other hand, especially in cerebral cortex or the pallium structure, the thalamus, and cerebellum, we find that there are fewer numbers of clusters with cell types. The clusters tend to be larger, containing more cells and they are highly divergent based on the number of DE genes between each pair of clusters, and especially they are also more widely distributed. We hypothesize that this dichotomy between the dorsal and ventral parts of the brain probably originated in evolution because of the differential functions of those two different parts of the brain, with the ventral part of the brain, hypothalamus, midbrain, and hindbrain mainly carrying out survival functions like feeding, reproduction metabolism, homeostasis, regulation, and things that are essential for the survival of the animals so that they may be subject to more evolutionary constraint. So the cell types within those regions have not changed much even though they are older, more ancient structures. The dorsal part of the brain, especially cortex and thalamus, for example, mainly carried out adaptive functions to allow the animal to adapt to new environments and gain evolutionary advantages. And for that reason, maybe that's why the underlying cell types have evolved much faster and diversified much faster. So this is an interesting
hypothesis that can be derived from this kind of unbiased analysis of large-scale comprehensive data sets.

We also wanted to understand the distance relationships between different cell types, and we did this by
examining the gene expression correlation matrix using different kinds of genes. This is a correlation matrix generated by 500 of the most significant transcription factor marker genes showing the gene expression correlations. And here are additional matrices showing similar kinds of correlation matrices using about 500 functional genes, about 800 adhesion molecules, and all 8,500 marker genes that have been identified in our whole brain transcriptomic taxonomy. So the difference among those matrices,
actually, is quite remarkable. It shows that the transcription factors with just the 500 genes are actually able to distinguish, showing the differences between different cell types at different levels much better than the other families of genes, and then even the entire 8,000 marker genes. So transcription factors are best able to distinguish the cell types at different hierarchical levels not only at cluster level, but also especially at higher levels, at class and subclass level. So in fact because of this, we have used mainly transcription factors to define the hierarchical organization of the cell types across the entire brain.

Furthermore, we are also able to identify transcription factor gene modules, 52 gene modules shown here that comprise a transcription factor code that defines and specifies the 5,000 clusters, and they have hierarchical organization at class, subclass, and super type level suggesting that transcription factors are really a major gene family defining cell type identity.

Now in the next couple of slides I'll just quickly go over the different families or neighborhoods of the cell
types that we have subdivided. In order to further understand the extraordinary diversity of cell types in
the brain, here we defined seven major neighborhoods based on major brain structures and generated subembedded UMAPs and examined the spatial distribution of the cell types within each of the neighborhoods. And showing that even as we dive deeper into the specific structures of the brain we continue to see this extraordinary diversity and specificity in all different regions, pallum glutamatergic, subpallium gabaergic, and hypothalamus and extended amygdala areas shown in this slide, thalamus, midbrain, and hindbrain glutamatergic and gabaergic neuron types shown in this slide.

In addition to the neuronal types with extraordinary diversity, we also observed very interesting heterogeneity and diversity in non-neuronal cell types. Even though they are much fewer than the neurons, the diversity is particularly pronounced in astrocyte and ependymal cells classes and here we are just using two slides to show the beautiful transcriptomic diversity and their corresponding spatial specificity of the astrocytes down to cluster level and ependymal cells down the hierarchy into cluster level showing the specific location and specific expression of individual ependymal cell types that are in different parts of the ventricles of the brain.

Without getting into much detail, we also revealed extraordinary diversity of the neurotransmitter contents in the different neuron types across the brain. This slide shows the all the cell types containing modulatory neurotransmitters: dopamine acetylcholine, norepinephrine, serotonin, and histamine, showing not only the diversity among the modulatory neurotransmitters, but also their co-release, co-expression, or co-release patterns with glutamate and GABA. And we can see all kinds of combinations in different cell types or clusters. This is also illustrated in Xiaowei’s spatial transcriptomics MERFISH dataset with the very specific clusters expressing a specific set of neurotransmitters found in a very specific location of the brain for all kinds of neurotransmitter types, dopaminergic, serotonergic, cholinergic, etc. Furthermore, Xiaowei's data, our data, and Evan’s data as well also showed the extensive and also extraordinary heterogeneity and diversity of the neuropeptide gene expression across the different cell types across different parts of the brain. Some neurotransmitter neuropeptides are highly specific to only a few, one or two clusters, in the brain, whereas some other neuropeptides are widely expressed in many different kinds of clusters across the brain. Xiaowei’s group and we at Allen are also able to use computational approaches to impute the entire transcriptomes from the single-cell RNA sequencing data into the MERFISH space so that we can now spatially visualize the gene expression patterns of any genes in the transcriptome even though the genes may have not been profiled directly in a MERFISH experiment, the 500 gene panel or the 1100 gene panel Xiaowei’s group has used.

ANDREA BECKEL MITCHENER: Sorry Hongkui, about four more minutes. Apologies.

HONGKUI ZENG: Ok, almost done. Here are some beautiful examples of imputed gene specificity in comparison with the original in situ hybridization expression pattern of the genes themselves, and high degrees of correspondence. So here's my last slide, trying to make a point of the high degree of consistency across the different data modalities that we have generated in this consortium. This tSNE plot shows the more than 5000 clusters identified in the single nucleus RNA-seq-based cell type atlas generated from Evan Macosko’s lab, and using a hierarchical mapping method we showed that the 5000 clusters from Evan’s atlas correspond quite well, shown in this matrix with the 5300 clusters derived from the single cell RNA-seq data from the Allen Institute.

So with that, I just want to conclude by summarizing what I have told you about today. From our earlier single cortical region studies we have found several principles of cell type organization including hierarchical organization, coexistence of discrete and continuous variations, and similarity and difference across species. Now with the transcriptomic cell type atlas for the entire mouse brain, we identified extreme diversity and complexity and distinct features of cell types in different brain regions. Transcriptomic types often have regional and more fine-grained spatial specificity, revealing an integrated molecular and anatomical organization of cell types across the brain. The relatedness between cell transcriptomic types likely reflects the developmental origins and/or the evolutionary homologies. And in this regard, transcription factors really carry that historical, developmental, and evolutionary information and they are the major determinants of cell type classification even in the adult mouse brain. We identified extraordinary diversity and heterogeneity in neurotransmitter and neuropeptide expression patterns, and we expect similar things regarding other molecules, thus illustrating myriad modes of intercellular communication. And finally, I showed very quickly that we observed a high degree of consistency across different data modalities in this particular collection of work between single cell RNA-seq, single nucleus RNA-seq, spatial transcriptomics, and also methylation-seq and ATAC-seq, giving us high hope that we will be able to generate an integrated cell type atlas across modalities for the whole mouse brain.

With that, I just want to thank a large number of scientists especially for my talk, scientists, technologists, and managers at the Allen Institute in making this work happen, and also thanks our collaborators, Xiaowei Zhuang’s lab as well as funding from the BRAIN Initiative. Thanks for your attention and I hope I have a few minutes to take questions.

ANDREA BECKEL MITCHENER: Thanks, Hongkui. That’s a great overview of the mouse atlasing project, which was amazing. We do have a few questions in the Q&A so I encourage others to ask. It looks like Dr. Yong Yao is typing an answer to one of them which is more of a technical clarification, so I'm going to move on to one of the questions. How many mice were used to generate the single cell data sets and were there across institution validations for the RNA-seq clusters. So what sort of validation did you do?

HONGKUI ZENG: Yeah, that's a great technical question. That's important for us. Because we use single cell RNA, we have to use a fresh mouse every day for dissection and isolating live cells for sequencing. So we used more than 300, about 320 mice for this study. 780 10x scRNA-seq libraries used. So it's not coming from a few number of animals, it's really a large number of animals, all C57bl/6 background, and we used stringent criteria for QC.

In terms of cross-institutional validation I wanted to use the MERFISH data as a great example of validation. Both our data and Xiaowei Zhang’s lab we generated all together seven whole brain level datasets using different gene panels completely independently and we obtained almost exactly the same spatial location for almost every cluster, certainly supertype level, subclass level, high degree correspondence. But for a large number of clusters more than 90% of clusters we also obtain the same specific spatial location, so we think that's really good cross-institutional validation. Also, in the last one I showed, the corresponding between our clusters and Evan Macosko’s clusters is another way of validation but we're continuing to compare our datasets with other public datasets as well.

ANDREA BECKEL-MITCHENER: Yeah, very important, thanks. We don't really have time to move on to more questions, but I just want to note that that in the Q&A there's a lot of interest in transcription factors, and this is a really interesting piece that has been revealed through this resource. I want to remind all of our attendees that the data sets are public. They're available and we really encourage anyone who's interested in any aspect of this to get your hands dirty get into those datasets and start doing your own analyses. That's exactly why this resource was created, to be broadly available and to answer some really interesting questions around biology. So I'm going to leave it at that. But we are collecting these questions and hopefully those of you online can also help answer these questions live. I think you should be able to see the Q&A, so you might want to get into there. I can try and answer some questions as well.

ANDREA BECKEL MITCHENER: Fantastic, so I think we will now move on to our next speakers. I'm thrilled to introduce two folks who grew up in the atlasing program as far as we're concerned. Welcoming Aparna Bhaduri and Tom Nowakowski. Both Tom and Aparna were trainees in some of the original sets and they have gone on to do fantastic independent work as well. So Tom is an Associate Professor at the School of Medicine at UCSF, and Aparna joins us from UCLA as an Assistant Professor. I won't spend any more time except to turn it over to the two of you. Again, I will give you about 15 minutes. I'll give you a five-minutes left call and then you can wrap it up and hopefully answer some questions as well. So on to the two of you.

TOM NOWAKOWSKI: Great, thank you so much Andrea. I’m going to share my screen, can you see my slides and can you hear me ok?

ANDREA BECKEL-MITCHENER: Yes, all looks good.

TOM NOWAKOWSKI: Great. So it's a real pleasure to join this webinar and discuss together with Aparna and provide an update on the progress on understanding and mapping the developing brain atlases. We wanted to both provide an inspiration and justification for why we need to better understand the developing brain, what that means potentially for our ability to understand the ethology of developmental disorders, to provide update of where the BRAIN Initiative Cell Atlas Network program is currently, and to highlight some of the unmet needs. Thanks to Hongkui's beautiful introduction, we really have been appreciating and are mesmerized by the tremendous diversity of cell types that exist in the adult human and non-human brain. This inspiration obviously dates back to the days of early anatomists who have been classifying cell types purely based on what they look like and recognize the sheer complexity of morphological types. Thanks to the work that Hongkui described, we now know that the adult mammalian brain can contain as many as 5,000 or more distinct transcriptomic types, and the transcriptomic definitions have really provided an anchoring for our understanding of how diverse cell types in the adult brain may be. But this is of course only just a starting point towards understanding where these cell types come from during development and what are their developmental trajectories that set up this tremendous complexity. One of the reasons why we want to understand how these cell types are created during development is because there is a whole range and a whole spectrum of neurodevelopmental, behavioral, and intellectual disorders that affect quite a large number of individuals and young individuals in industrialized countries, and we're only just beginning to scratch the surface on understanding what the diversity of cell types may be in the developing brain.

Starting from early anatomists, a common concept has been that the diversity of cell types in the
developing brain may be significantly reduced compared to adult brain, and a lot of those ascertainments have been made based on the descriptions of cellular diversity that was purely based on morphology. Thanks to the initial work from the BRAIN Initiative that was supported, some of the early single cell RNA sequencing transcriptomic studies have almost immediately revealed that even despite
the apparent reduced morphological complexity of cell types in the developing brain, there still exists quite a striking molecular diversity. So using single cell RNA sequencing, we can discover new cellular types or cellular states that exist in the developing brain that wouldn't be apparent from basically looking purely at morphology. The reason why this is important is because it provides us with an insight into the molecular cell types that may be present during development.

One of the reasons why this is important and will advance our understanding of the etiology of neurodevelopmental disorders is because thanks to the advent of human genetics studies we now have identified, supported largely by the NIH but also by Simons foundation and other foundations, we have now defined or identified genes with genome-wide significant excess of damaging mutations in patients who have been diagnosed with neurodevelopmental psychiatric disorders including ASD and related neurodevelopmental disorders. One of the goals for understanding the possible etiologies of these disorders is to identify where and when during brain development these mutations or these genes are most important, i.e, to identify the select vulnerabilities of cell types and developmental periods in brain development. And even some of the early studies have highlighted that prenatal periods of brain development may be particularly relevant and vulnerable to these developmental disorders and these periods of development highlighted here correspond to some of the processes when the neurons and glia are just being generated. And a lot of these insights could be gained even prior to single cell analysis and the advent of single cell data sets, but by integrating the gene expression enrichment analysis with
single cell atlases, we can actually now begin to pinpoint specific cell types which are enriched for expression of high confidence neurodevelopmental risk genes, providing insight into the possible vulnerabilities of the developing brain to damaging mutations that are found in patients. So this is one very concrete application of how cellular resolution atlases can provide us with new insights and new hypothesis for how neurodevelopmental disorders might occur.

Another practical application of these datasets that I wanted to highlight has to do with evaluating the fidelity and robustness of in vitro models of human brain development. As some of you may have heard, stem cells derived from skin fibroblasts that may be derived from adult individuals, including healthy individuals as well as patients with neurodevelopmental or psychiatric disorders, can be derived at scale using the advances in stem cell technologies, and these stem cells can be subsequently differentiated in vitro into what is known as brain organoids. These are in vitro models that we can use to study the etiology of neurodevelopmental disorders thanks to the advent of gene engineering technologies as well as technologies that allow us to perturb specific gene expression that you will hear about from Aparna. One of the benefits of the data sets that the BRAIN Initiative has created is that we can now quantitatively compare the fidelity and robustness of gene co-expression networks that emerge in these in vitro models and how they compare to the developing human brain in vivo. These comparisons can, in a quantitative and rigorous way, evaluate and highlight the remarkable similarities but also some differences be between those in vitro models and developing brain. So these are some of the very concrete and practical utilities of these data sets that we wanted to highlight.

Just to summarize where we are with the work supported by the BRAIN Initiative Cell Census Network, there's been a number of pilot studies highlighted over here supported by the BRAIN Initiative that were overwhelmingly focused on the developing mouse brain, but we have some limited number of studies that that embarked on mapping developing non-human primate and human brain. A lot of these studies have now been concluded and have resulted in a number of publications, some of which are listed here, many of which have been published in high impact factor journals which I recommend that you follow up on and read, which are very exciting.

Now that we embark into the next phase of this effort that you will hear about more from Aparna in a second, that have really been pioneered now by the investment from the BRAIN Initiative Cell Atlas Network which seeks to create more comprehensive atlases of the developing human and non-human primate as well as comprehensive datasets from the developing mouse brain and some of these milestones and goals are listed over on this slide. We hope to in a few years provide you with additional
updates on how these datasets can become more useful and advance our understanding of the developing brain. With that I will hand over to Aparna who will speak a bit about her works and her lab’s efforts to integrate and make sense of this data and put it to practical use. Aparna, over to you.

APARNA BHADURI: Thanks, Tom. Thanks for that amazing introduction. I think that one of the goals of this section of the talk is really to provide an example of what Andrea was talking about, where we can really take data that's out there and try to integrate it to understand key biological processes. So this is one of the things that we've been trying to do and is also something that we're trying to extend as part of a joint analysis to additional datasets, so this is hopefully just the first step of these types of explorations that we are planning to do through our UM1 funded work as part of the BRAIN Initiative.

Our goal for this project, which is led by Dr Patricia Nano who is an incredibly talented postdoc in my lab, was really to create a meta-atlas of the developing human cortex, and the idea was to create an integrated analysis of seven transcriptomic profiles where we integrate datasets, many of which were funded by the BRAIN Initiative and some that were independently conducted by other labs focusing primarily on developmental periods that are embracing the peak stages of neurogenesis during the second trimester but also that flank this time period depending on the dataset that was included. Based on this conventional integration you can see that the major cell types that we would expect to exist during human brain development and cortical development specifically are represented. What Patricia sought to do was to use an orthogonal strategy to identify novel gene co-expression networks that would allow us to identify biological processes that correlate and are parallel to some of these cell types in order to really address some of the issues that Tom raised, where we know that there are transient cell types and that they're important biological processes. But we're still learning to understand how they connect to the cell types that we see in the adult.

So briefly, the method that was used was we individually analyzed each of the 96 individuals and performed unsupervised clustering for each individual. This allowed us to really correspond to gene signatures that were variable within an individual and then to avoid batch effects. This also allowed us to integrate between individuals to find the strongest signal that allowed us to make meta modules with biologically meaningful information rather than technical variation. Patricia went on and annotated these modules and many corresponded to interesting biological processes, some of which we believe are related to processes such as cell division or cell fate, and then others where it's unknown, but we're hoping that using some of the patterns of activity for these modules, we can better understand their role in development. We used some positive controls, so for example modules related to vascular function were expressed in vascular cells, similarly for microglia and immune function, as well as some of the glial subtypes, which gave us confidence in the specificity of our modules as well as our ability to then extract novel information from other modules within the dataset.

Our goal was then to identify modules that could give us insight into potential biological processes such as the initiation of cell type specification, the refinement of these identities, as well as a little bit of mechanism which I'll talk about at the end. Beginning with initiation of cell type specification, we focused on module 156, which based on our annotations, was related to radial glia biology, neuronal activity, and response to stimuli, and as you can see by the module activity, was broadly expressed in radial glia. Though interestingly, its activity was increasing through radial glia across development. What was particularly exciting to us, and this is one of the ways that we thought to leverage BRAIN Initiative data, is when we looked at the module activity for this module in adult data that was generated by the Allen Institute, we saw a specific expression within glial populations. We then wanted to see whether there was any evidence that we could really represent this particular module with some spatial and temporal specificity, and indeed looking at the co-expression of two of the genes PDLIM5 and QKI, we saw that they were specifically co-expressed at later stages in gestational week 20 ventricular Zone samples, suggesting that this module is active in progenitors that are later on in development. Well, we haven't yet performed experiments where we can mechanistically test what happens when these programs and this module is eliminated from the cells. We think that this is consistent with the idea that there are certain biological processes that yield specifications such as towards a glial fate that we can find from these types of datasets, and that this type of an analysis can also be additionally hypothesis-generating.

When thinking about the refinement of cell type identities we focused on neuronal modules that were somewhat broadly expressed across neuronal populations in our developmental dataset, and one thing that's been a challenge in the field is really identifying which of the cells within broadly upper or deep layers would become individual subtypes. And although there was some subtype specificity for these three modules 134, 94, and 189, they were broadly expressed across the number of excitatory neurons and their module activity was increasing across newborn neurons and excitatory neurons across development. However, again when we leveraged the adult data, we saw that for some of the modules and most notably here you can see it striking for 134, there is layer specificity to layer four. And we again used immunostaining to look for co-expression of some of the module genes that are expressed within module 134 and saw that they're colocalizing in upper layers during development. And we can zoom in into these two panels and see that there's more localization in the upper layers compared to the deep layers and that the colocalization sits right above CTIP2, which is typically a marker for deep layer function.

We also wanted to pursue whether there were any of these modules that could give us some mechanistic insight into cell type specification. One of the modules that was really exciting to us was this module 20, which we saw to be expressed broadly across our neuronal populations and that was increasing specifically in deep layer identity cells but not in core layer. More specifically, when we looked at this module 20 within the adults data again, which has been immensely helpful in contextualizing our analyses of development, we saw that there was specificity to these deeper layers and that across layers five and six there was specificity to the subtypes annotated by the Allen Institute as FEZF2 transcription factor marked subtypes, connecting to some of the ideas that Hongkui was presenting in her talk. What was interesting though is that this module doesn't contain FEZF2. However, in development we saw that FEZF2 expression was preceding module 20 expression and activation which suggested to us that possibly module 20 is regulated by FEZF2. Indeed, when we leveraged data that existed from FEZF2 ChIP-seq, we were able to see that more than half of module 20 was regulated by FEZF2, and this included the transcription factor TSHZ3, which was the only transcription Factor found in module 20. And as you can see there was some concordance of TSHZ3 expression in the subtypes that we were interested in as well as correspondence in its activation in timing with module 20.

ANDREA BECKEL MITCHENER: Just giving you a three-minute warning or so.

APARNA BHADURI: Perfect, we're just about wrapping up. So we really wanted to test whether or not TSHZ3 and FEZF2 could be involved in the specification of this and again, as Tom mentioned, we've carefully compared organoid data to the primary data from the developing brain so we understand some of the strengths and weaknesses of the model. So we sought to use the system by knocking down TSHZ3 and FEZF2 and perform single-cell RNA sequencing on the knockdown cells compared to the non-knockdown cells as determined by mCherry-positive expression where the hairpins had mCherry so we could really perform a controlled experiment in the same environment with the single-cell RNA sequencing. We were able to identify cell types that we typically see within organoids and we're able to contextualize them compared to primary cell types, and broadly the cell types were corresponding to
what we would expect. We saw that there was a modest depletion in module 20 activity with the FEZF2 and TSHZ3 knockdown, but that this corresponded to substantial depletion of deep layer neuron proportion, suggesting that there's a cascading effect where each of these transcription factors has a role in controlling module 20 activity, but then this impacts deep layer specification. We’re really excited about this because this was an example of how we were really able to use data right up until the last experiment to really develop a model, and then test it using models that we have for human cortical development. And the model that we think that we have resolved on is FEZF2 activation precedes TSHZ3 activation, which then gives rise to module 20, which is important for this in between specification event that happens between the earliest stages that we were looking at during neurogenesis and the adult deep layer FEZF2 neuronal specification.

So this strategy has allowed us to start to understand how we can bridge the gap between the developing and adult human cortex and we also have not only mechanistic Insight, but also a resource if other people want to look into this. This is also on bioRxiv if anyone is interested. And we think that this is also a strategy to integrate other datasets, which is one of the things that we are seeking to do as part of our UM1 moving forward. When we think about these Integrations, I think that especially across development, there's a number of challenges that continue to persist. In the image that I've shown here, and I'm sorry I think that it's not showing the recognition, but from Rong Fan and Nenad Sestan, and you can see that we are seeking to really go from the molecular level to actual physical structures using spatial transcriptomics to make the link and then to link that all the way back to MRI data and really coordinate framework integration. And so this is one level of integration that we're seeking to do as part of our project, but we're also thinking about how we can do this between species and then importantly between developmental stages and modalities. And so I think these are different axes of integration that we're hoping to really get a handle on in the coming years. With that I'd like to thank the members of my group the members of our BRAIN Initiative project and all of the funding sources and especially the BICAN funding, and thank you for your engagement and I'm happy to take questions.

ANDREA BECKEL MITCHENER: Great, thank you, Aparna and thank you, Tom. We're being super efficient in the Q&A so it looks like a lot of these questions are being answered along the way by colleagues and just people are interested in the biology. So I don't have a question, I just want to say how this is such a great example of how you take different datasets that are really generated as a resource and people understand the data and really look for these biological insights that may not have been possible before at the level of resolution that you all are able to. So that's great. Oh, we do have a question so I'm going to go ahead and read that. They say thank you, this is exciting. What was the temporal window for the FEZF2 mediated regulation of module 20? Does the intermediate transcription factor TSHZ3 take over at some point? Is that obvious to you?

APARNA BHADURI: Yes, so I don't know if it's obvious, I think it's a good question and I think that we're seeking to understand this timing. But what we were seeing is that during earlier stages of neurogenesis, especially in the human, and this is actually something where the temporal expression was a bit different than what we were seeing in mouse data, was that FEZF2 was active early in those neurogenesis stages and then its expression declined and then PC3 and module 20 were coming up. We are not fully sure if TSHZ3 takes over entirely. Its expression also decreases after a wave of expression, but the other module 20 genes then consistently increase. So we think that it might be kind of a handoff where it needs to get the ball rolling, but then the transcription factors aren't as essential. Tom has his hand up.

TOM NOWAKOWSKI: Oh yes. So this was fantastic. Thank you Aparna. I wanted to ask another question which is how do you think about integrating the datasets across species? Obviously, when we think about the human brain and the mouse brain, there's quite a difference in size and hopefully function as well. So what do you think are the developmental differences that have driven the dramatic expansion of the human brain, and how can the datasets that you're creating lend themselves to our understanding of what made us human?

APARNA BHADURI: I think that's a great question, Tom. I think we could toss it back to you as well. I know you've thought about these questions extensively. But briefly, I'll say that some of the data approaches that we're using allow us to get some crosstalk between datasets of different developmental stages and different species. I do think that there's a lot of questions to be answered in terms of are we really thinking about different species and different genomes, are they consistent, and are we introducing any biases. But those technical questions aside, I think that there's a lot of ways that we can start to use the data to compare timing and try to understand how at a process specific level, there are differences, whereas not all of the timings are going to be moving at the same pace. And then I think that there's a lot of work that folks in the BRAIN Initiative are doing to understand really what this means for how evolution proceeded. I'd also ask you to chime in on that same question if you have any additional thoughts.

TOM NOWAKOWSKI: No, I think I think this is really important to highlight some of the vulnerabilities and has the potential to highlight vulnerabilities of the human brain to various types of neurological disorders. Andrea, do we have time for one more question?

ANDREA BECKEL MITCHENER: We are bleeding into our break a bit but if it's quick we can do it.

TOM NOWAKOWSKI: Yes, there is a question in the Q&A. When would you recommend doing meta-analysis? How do you control for batch effect for example data combined from different researchers?

APARNA BHADURI: Great question, and this was something that was actually a motivation for one of the reasons that we sought to use the strategy that we did where we're looking at each individual and then taking variation that was observed within that individual and applying it to broader patterns that we were seeing. One of the reasons for this is many of the batch correction approaches that exist out there right now are excellent and they do a good job of integrating especially the cell type level but we were concerned about whether any of the gene co-expression patterns would get diluted out and so that's why we approached the strategy that we did here. I think that the batch effect is always a concern and something to be thought about, but I think that what we're learning is by being creative in the way that we address it and acknowledge it and then look for things that could be technical artifacts. We can always still make use of data that has been generated by a lot of people and at some level there's power in numbers and having more datasets from more individuals can give us more insight into really understanding the biology.

AMANDA PRICE: It is now 12:55. So my name is Amanda Price. I'm a Program Officer at NIMH, and it is my pleasure to introduce the first panel on Brain Cell Atlas Data Integration and Annotation. This panel is moderated by Drs. Guo-Cheng Yuan and Jesse Gillis. Dr. Yuan is a professor of computational biology at Icahn School of Medicine at Mount Sinai and a faculty member of the Icahn Genomics Institute. Dr. Gillis is Associate Professor in the Department of Molecular Genetics at the University of Toronto. So I'll now hand the meeting over to Drs. Yuan and Gillis.

JESSE GILLIS: Thanks, Amanda. GC and I are splitting moderating duties, so I'm taking the first one and then GC will follow with the second one. And the way we've organized our topic, which is Brain Cell Atlas Data Integration and Annotation, is into three sort of mini sessions. The first principally focused on where we are kind of from the past to the present. The middle session more on specific data problems that are currently occurring in development and new modalities. And then the third session focused on new technologies and approaches in the future. What directions are we taking? And so for each session, we have two very short presentations or talks that we hope will provoke some questions and thoughts, not specifically to the talks, but to the topics. And so we're beginning with Daifeng Wang. And I won't give long introductions, just from the University of Wisconsin, and I encourage you to look up his very interesting work over the years. Daifeng, if you want to share your screen.

DAIFENG WANG: All right. Thank you, Jesse. Can you see my screen?


DAIFENG WANG: All right. Thank you, everybody. So first, I want to thank NIMH and the BRAIN Initiative for organizing such a wonderful workshop about data, so I'm happy to give a first short presentation about our first topic about cell types. So we know that the BRAIN Initiative has generated a lot of multimodal data to characterize the single brain cells at different aspects, for instance, single-cell omics seq and ATAC-seq measured gene expression, and the Patch-seq can give people the electric physiological features or morphological features. So using those multimodal data, actually we can integrate those data and map them onto some latent space and the cluster cells on this latent space and eventually get cell clusters. So those cells cluster together imply that they share very similar multimodal features and the defining potential settled state. So however, what we don't know actually is that although cell clusters corresponding to a particular cell type or a specific cellular function, so for instance, so definitely we can map our data onto some reference data, for example, transfer the labels to annotate the cell types, those cell types actually defined by some prior knowledge or particular features such as the marker genes. However, those marker genes may not fully represent the underlying function of the cell clusters.

So for instance, we know gene regulatory networks fundamentally control the gene expression and those networks can link the transcription factors to the regulatory element like enhancement promoters to control target genes. But the regulatory networks can change across different conditions like disease types or development stages, such as switching the TAs or the enhancers for different cell clusters even if they expressed very similar marker genes. So that's why we need to leverage additional techniques to better understand cell types or cell functions. So here, I just give idea. For example, we can leverage machine learning to integrate and analyze the single cell multimodal data to see if we can better understand cell types or cellular functions. Definitely, the community has published a lot of great machine learning approaches to do this. So here, I just use my recent work in my lab actually to summarize several ideas that we can leverage machine learning. So for instance, we can definitely use a manifold learning to align the multimodal features and the mapped cells, actually, from different modalities onto the same latent space and form the cell clusters on such a shared same latent space. So the clusters I mentioned such as on cross modal cell types with the similar multimodal feature, instead of single modality like gene expression.

Also, beyond individual features, so we can also infer some multimodal feature relationship such as a gene regulatory network, which may better define the cell type so, for example, to see how the network defined the cell types. And also we know that the multimodal data is not always available. So we can also use the multi-- we can use the machine learning to impute missing modalities. So basic idea is that we can use the available multimodal data to train some machine learning model to infer one modality from the other and then use the applied pre-trained model to a single modality data and impute missing modalities. And eventually, we can input population level multimodal data to some machine learning model, like deep learning model, to prioritize the genes, gene regulatory networks, and the cell types for disease types and the clinical phenotypes. So not just to learn some black box model, but also make the model biological interpretable. So yeah. So due to the time limit, I stop here and I look forward to the panel discussion. Thank you.

JESSE GILLIS: Thanks, Daifeng. That's great. And moving on to Josh Huang at Duke University for a talk I'm looking forward to as well. You're muted, Josh. Yep, there you go.

JOSH HUANG: Oh, okay. Good afternoon. Hope everybody can hear me now and see the slides. Yeah, so as outlined by you at the very beginning, one of the goal here is to go from data to knowledge and to ultimately insight and understanding principles. So one of the challenges coming back to understanding and defining cell types and to achieve a overarching definition that will integrate multiomics data sets that is now in massive amount of scale and coming very quickly. But to integrate this with multimodal phenotypes, we have heard anatomy, morphology, physiology, and ultimately the goal is a secret function. But all this based on biological mechanisms, because we believe there is a biological mechanism that goes from gene expression programs to cell phenotypes and circuit function and principles that involve developmental lineage, trajectory, and even evolutionary history. And all of these, hopefully beyond the sort of operational management style definition by statistical parameters and often arbitrary scenarios.

So just to facilitate the discussion, I'll propose one definition that we have been thinking about. And it's a working hypothesis. So this is what I call a communication element definition of neuron types. And in this regard, the neuron types can be specialized communication devices, which can be defined by their unique connectivity patterns, input-output transformation that actually may be shaped by specific transcription signatures of gene batteries that is the output of gene regulatory programs that ultimately is rooted in the epigenetic landscape. So let me unpack this a little bit. Communication element is defined by its connectivity, anatomical connectivity. What's the presynaptic and postsynaptic elements, and also beyond how they transform physiological input to output in the context of circuit computation. So connectivity is an overarching anatomical feature over location and morphology. There have been a lot of emphasis on morphology, including large-scale single-cell morphology reconstruction, striking shapes. But the goal of neurons is not to enter a beauty contest - I think we all agree - but to really connect to the correct presynaptic cells with extending dendrites and reaching its proper target by the very long axon. So this is a feature that is, I think, essential, but difficult to characterize.

And the other is that we use intrinsic physiology and synaptic properties in slides to characterize physiological properties. But we know that this is highly, highly limited. Ultimately, the goal of physiology is to perform input-output transformation in the context of circuit computation and behavior. Again, a very difficult feature to measure. However, there have been some interesting progress suggesting that these features are not defined by functional batteries, not they are emerged randomly because these features - for example, chandelier cells or layer 5 corticofugal cells - are reliably generated in each individual of the species. There has to be ultimately a genetic program that generates a set of functional genes such as cell adhesion molecules, ion channels, synaptic proteins. And these are most likely orchestrated by transcription programs, as we have heard from Hongkui already, transcription factors that we have found before is very important, but now it's also very important in the whole brain transcriptomic data sets. But these are unlikely to be, again, random. They're likely to be a gene-regulated program that ultimately is shaped by epigenetic architecture through developmental programming. And ultimately, this cell relationship and the communication is the ones that underlie the circuit computation. So we would like to think that this communication definition has a potential to really integrate in key features, and from molecular genetic to circuit function.

So the implication is that neuron types may not be just defined-- may be defined by their relationships in the circuit, how they communicate, not just cell intrinsic, cell-autonomous properties, such as their shape or their intrinsic properties. But more importantly, that these communication features are, in fact, or have to be encoded in transcription signatures guided by a gene regulatory program. So the problem is, despite our incredible progress in single-cell genomics and morphological analysis, we actually don't have a way to measure connectivity and communication. So connectivity require EM connector, which is actually on the horizon, hopefully. And if we borrow the lesson from drosophila, when the hydrolyzing connectome emerges, I think the cell type at that level will pop out when we recognize the connectivity motifs. But even that may not be enough because ultimately it is functional connection. It is the transformation of input to output during behavior that ultimately define the unique communication style. And I don't see how we can do that. It will probably require next-generation technology. But there may be a shortcut. The hope is that these are, again, not magic. These cell types and the way they communicate, at least for the cardinal types, for the supertypes, they are reliably generated during development. They are conserved identical among individuals and probably conserved across species. So there is a genetic program. And if we can use ground truth experimental systems to actually discover the relationship between transcription signatures, including transcription factors and the underlying gene regulatory programs, how they encode and predict these phenotypes. Then I think that these are actually written in the transcriptome and epigenome. If we can learn the rules, then we can actually leverage the very high-resolution multiomics data to advance cell type definitions. So that's my take for today.

JESSE GILLIS: Thanks, Josh. So we're going to open it up to the panel. You can raise your hand to discuss issues. I guess the first question I'll put to the panel and hope to get some response to is if we think of these talks and the previous ones we saw during the day, one of the themes is moving maybe past pure, quote-unquote, "unbiased transcriptomic clustering." So on the one hand, we have sets of cells we're calling cell types that come from unbiased clustering of cells. On the other hand, there's a desire to integrate epigenetic information, biological information, knowledge of function, cross-species information, and then there's developmental variability. And so I would put to the panel, how do people feel about the ad hoc definition of cell type that we currently use of clusters, clustered by gene expression, and then with metadata attached to that? Or almost do we need new language so that when we have new modalities, they can be incorporated, and we can call those things cell types? So is the ad hoc definition of cell types satisfactory or not? Are there any thoughts around that? So we just heard a series of talks where the pivotal achievement of many of us contributing, or other groups is that 5,000 cell types in the brain. Are you comfortable with that claim? Or is that a transient claim that is in progress? Or it's based on principally on gene expression data. Are you comfortable with the claim that cell type can be defined principally from gene expression at this time in the brain, let's say?

HONGKUI ZENG: Can I start?


HONGKUI ZENG: Jesse, yeah, that's a great question. I'd say the claim of 5,000 cell types in the brain is really approximate, right? And I think it's a lay language so that people know what we're talking about. But strictly scientifically, in the papers, we always say very clearly that those are clusters and particularly they are transcriptomic [inaudible] cell types. And I think, like in a paper, when we say cell types, we define all levels of the cell types: classes, subclasses, supertypes, and types. They are all definitions of cell types, right? It's just like the granularity is different. That's why we differentiate them as a cluster. Yeah. I mean, it is a very good point. I think it's hard to convey that to the general community when you talk about cell types. So I think it's a question-- I just wanted to start with a comment.

JESSE GILLIS: No, for sure. I'm trying to be a bit provocative to get people talking, but I will say that if we're not totally happy with that as the perfect platonic definition, what is the right definition? So anyway, if you have some comments.

HONGKUI ZENG: Yeah, at least we say it's transcriptomic cell types, right? We're not saying this is the ultimate cell type. Yeah.

TRYGVE BAKKEN: Yeah, I just wanted to throw out there something that's probably in the minds of others is that evolutionary conservation over millions of years, I think, points to the functional significance of these clusters of cell populations. So that's, I think, one piece to think about. It's the technology we can apply at scale now. And so it provides a framework to then move towards maybe a fuller definition. So I think we have to start somewhere. And then the last piece is, even if you don't think these are-- every transcriptomic type is a functional unit, there's a genetic or epigenomic hook to then target that type and study its function and potentially treat disease. So—

JESSE GILLIS: Right. Agreed. I think that harkens back to some of the debates in ENCODE even, where the fact that something is molecularly useful doesn't mean it's functional in the sense of conserved, but it still can be useful. I'm a big fan of the conserved as functional definitions. So I'm partial to that. Sten?

STEN LINNARSSON: Yeah, if I can jump in, I would say that it's important to keep in mind that clustering is a way for us to organize this high-dimensional space. And whether it's 5,000 or 2,000 or 10,000, it's always going to be a bit arbitrary. But one of the, I think, important observations from the large-scale atlassing of the brain that's been going on is that to a very large extent, the clusters at many levels reflect developmental origins. And so I think that should be one of the primary organizing principles as you build the taxonomy of adult cell types to align them with their developmental origin as much as possible. And that will be possible up to a point, up to a certain level. And beyond that, you will have things that are affected by hormones or local gradients or activity and so on that go beyond developmental origin. I think it's important to at least anchor the framework for a taxonomy of brain cell types in development. And also, that gives you a nice transition between actual developmental cell states and the adult cell types. That's a common language between the two if you organize it in that way.

JESSE GILLIS: I'm curious, just to follow up on that a little bit, if there was a developmental feature that was-- mechanistically, it's real, but you don't think it has functional significance. Meaning it's a spandrel in development, and it produces a cell type that's distinct molecularly, but there's no conservation. So I'm just bouncing them against Trygve's suggestion, would you want to call that a distinct cell type or not? Or is that just bad use of language at that point?

STEN LINNARSSON: If it's not even evolutionarily conserved, then I don't know.


STEN LINNARSSON: Yeah, I mean, it's still there. It's an observation. It doesn't mean that-- I mean, then you can be-- then you should be very skeptical that it has a function. It could be a human-specific adaptation, obviously, but I think those are rare. And I think if we get everything else right, we can worry about those towards the end of the project.

JESSE GILLIS: Sounds good. Thank you. Lior?

LIOR PACHTER: Thanks, Jesse. I appreciate your question. And I'd just like to focus in on something you mentioned, which is the integration of different modalities and how that should play into the definition. What I think is worth pointing out is that there is no real or clear notion even of transcriptomic because transcriptomic already involves multiple modalities. What I mean by that is that, for example, when we perform RNA velocity we actually generate two count matrices, one from reads that are aligning to introns and the other two junctions or exons. And we really can build two count matrices for nascent or proxy for nascent and mature RNA. And if you cluster the one matrix, the nascent, you get different clusters than if you cluster the mature. Similarly, when you think about single nuclear RNA sequence, and this came up earlier, you're already also looking at really a different modality. It's just a different measurement. So I think what we need to think about is how to integrate modalities starts with how do we integrate just transcriptomic modalities in a coherent way to even just to cluster them. And I think that has to happen via some kind of biophysical model for how they relate to each other. And so that's my view is that absolutely integration of all the different modalities is essential ultimately for a molecular view of a cell type or even a cluster. And that starts with thinking about transcriptomic data in terms of its multiple modalities.

JESSE GILLIS: I see. So just to follow up on that a little bit. So typically, I guess people when they look for cross-modality integration in a sense, they're just testing robustness often, right? They're just saying what's kind of the dominant signal across a lot of data. But you're proposing something, it sounds like, much more sophisticated than that, where the differences between nuclear and whole cell RNA-seq would actually be accounted for in the model itself. Do you think we're in range of achieving that or is that more the long-term thing we should aspire to?

LIOR PACHTER: No, that's what I'm proposing. And I think we absolutely have the kinds of measurements we need to achieve that. Yes, I do.

JESSE GILLIS: Great. Fenna?
FENNA KRIENEN: Yeah, thanks. What's interesting about your question to me is that, I mean, yeah, we have an assertion of 5,000 clusters, which might be revised with more data. But we also have, we've heard in this session and I think it's reflected in several of the BICCN papers, is the specificity of a gene category and specifically transcription factors. And there was a lot of interest in the Q&A about that early on. And it seems like that could be a strong and testable hypothesis. And it has a long and rich history in specifying the definitions of cell types in other systems, if you look at Oliver Hobert's work at Columbia in C. elegans, he would say it's not just transcription factors, it's specifically homeobox genes, and there have to be two or three of them in a co-expression set in order to find a cell type. So I'm just curious whether with the interest in transcription factors as a class, even to define adult cell types, do we say, "Okay, well, 5,000 clusters, we don't know if those are types, but we have this strong idea about transcription factors"? Are we willing to say we can now predict whether cell types are conserved or whether they're real or not based on a strong assumption that the transcription factors have to be shared and that they would then give rise to these other functions and other data modalities that we might measure to look at the consequence of that? But I'm just curious whether we are there at least in that code, which is not just data-driven. It's a strong assumption.

JESSE GILLIS: Right. I mean, so I'll certainly let other panelists chime in. And I won't put to you the same question I put to [Sten?], but the way I think about it a bit is, if that were in conflict with some other ad hoc definition, is this simply a useful approximation of the way we think about things, which could be ridiculously useful, so useful, no one will ever think about it another way. But if it was in conflict with some other definition, like conservation or development, which would we ultimately say, if we just knew from on high, someone told us, "Nope, super useful. Mechanistically, it happens, but it's not conserved or it has no developmental origin," or something like that, which would we adhere to? So I guess my own thinking again, I'll let the others chime in, is TFs are so mechanistically important and central that we'd be foolish to ignore them, but it's probably not my own-- if my platonic definition of what a cell type is somehow wrong.

RONG FAN: Thanks, Jesse. So this is a very good discussion. I enjoy so much. I want to kind of bring your attention to one of the papers I read. I put it in a chat box from Gray Camp and Barbara Treutlein. So this is a very intriguing sort of philosophical perspective on what is cell type. And I very much agree with the standpoint that I think evolutionary conserve, but on the other hand, also in Gray Camp's paper that mentioned whether or not that should be defined as cell type, you really need to look at the perturbation, how if you kind of perturb that cell that stay in that particular type or state, that probably can be considered as kind of stable cell type. On the other hand, I really think cells are very dynamical plastic immunology. We call that plastic. And that's not a problem, actually. I'm a physical chemist by training. We often call that thermodynamic states. I think that's totally fine. I don't think that's a problem. Maybe in our communities, should we kind of welcome more this kind of cell state rather than always trying to find the cell type, right? So maybe supplement our atlas with a lot more cell states that can actually help better delineate the map or atlas. Yeah.

JESSE GILLIS: Right. I guess would you be happy with those being predefined? So one of the things we'd like to have is a deliverable, even out of this discussion maybe. And so if there were an ontology of cell types and cell states, even ad hoc ones that could be attached to various types in combination, is that one useful way of having at least some information to attach to the defined clusters that we see? And I'll put that to Rong specifically since you're just chatting about it, but.

RONG FAN: Yeah. Very quick response. So I think the state means the cells can really change their functional activity or something like that. I don't think we can perturb-- okay, if we are looking at the human brain, maybe organoids, that works. But combining data from different layers of multiomics, you do get some level of dynamic or kinetic information that can, to some degree, infer the states. I see Fabian is next. I think Fabian should have a lot more insights about that. Okay, I will hand it over to you guys.

JESSE GILLIS: Sounds good. Fabian?

FABIAN THEIS: Thanks. If I can go ahead, Ron. I was exactly sort of jumping a bit on that dynamic interpretation. I think this one point, really naive one, I think that sort of in human cell arguments we also have been discussing for a long time, so it's not a brain-specific thing, we've been trying to make is that the notion of a cell type obviously is a human-made metaphor, right? That something that we need as an abstraction to think about this complex system that does not actually care about what we call it, right? So these cells are these dynamic stochastic things. They're stable for some time. They transition into other ones. And I think it's extremely useful to give these names. And so you have to build atlas to be able to integrate and then compare things. But I think it's also obviously dependent on what we choose, right? I mean, we talked about evolution conservation, like which modality to choose, right? I mean, if in the end, [inaudible] is what we call a cluster, and we accept that, then it's dependent on the data, the distance, but then maybe also the resolution. Remember, right, if we do clustering, it matters which resolution we give the thing, right? If you keep clustering, sort of keeps getting more fine grain. So if you take a very high level, a very coarse grain resolution, then we get what Hongkui called, I think, maybe level one annotation, so very high level, right? And maybe immune cells versus neurons or something. But then you zoom in and you keep clustering up. I think it's okay to accept that. I think in the end, if you want to build a map, maybe people call things a bit differently, but you need to know how to map between these. So I think that's maybe in the end crucial. If you want to bring together all these different brain data sets that have been generated in this consortium, maybe it's good to agree on either the same notation or a way how to map between. And then maybe, [Sten?] has these, I don't know, 1,500 or whatever setups, someone has a bit more core screen, and then you can maybe map on a certain level.

JESSE GILLIS: Right, I think that's an excellent point. I'll give time to our last two hands raised. But just to amplify maybe that, I think the key idea that we need to agree on definitions, but they are in some sense, or maybe in every sense, arbitrary. So it's more about agreement than anything else. And so they're almost not to be argued except their degree of consistency with previous use. But that if we could just agree on what we all mean when we use certain words, I think that would actually go a long way towards allowing people to conduct analyses that are meaningful to everyone. Daifeng.

DAIFENG WANG: Yeah, so I just saw a quick comment. But if we say typing something, that means that we have some static view of cells. But in reality, actually, cells need to interact with each other. So even the same cell may function differently when they interact with different cells. So I pretty like the Josh's presentation about cell-to-cell communications to see if we can have additional modality data or dynamic data to really match the multimodality or gene expression to those communications like cell-to-cell interactions. So that might give us better chance to understand the cell functions or cell type.

JESSE GILLIS: Absolutely. And the last word to Bosiljka?

BOSILJKA TASIC: So I'll try to be quick. So first, I'm a big proponent of conventions and unified language, regardless of what that may actually mean in terms of the biophysical world and biophysical definitions. Meaning we need to use common language to be able to talk about same groups of cells. That also means we need to use similar methods to define them and things like that. So even there, there could be disagreements. But I feel these operational definitions are essential because we need to be able to talk about things. I think type versus state is one of the things that frequently gets people talking. And I think it's sort of when we measure transcriptomics, we measure both. And we can decide, again, arbitrarily what we will call one or the other because maybe we'll use activity-dependent genes to define more types, more states versus types, and etc. I think these discussions are important, but I don't think they are initially essential. Initially, a common language is really essential that we can refer to groups of cells in the same way. I think really to distinguish this type versus state, we will need detailed biophysical models. We will need to test reversibility probably for states versus types. And I think, to me, that's long distance away. I'm more for utility, common nomenclature, common names, and try to get to that. And these philosophical things are useful to discuss, but I don't think we will resolve them that quickly.

JESSE GILLIS: I guess the actual last word is going to go to John.

JOHN NGAI: Sorry. Didn't mean to do that to everybody. Yeah, I agree. Look, this is a great conversation. I agree 100% with what Bosiljka just said, so I can cut my comments in half. I mean, we do need-- there's a practical aspect where we do need to be talking the same language so people can work together in a way that doesn't conflict too much. But I put a comment in the chat. I couldn't access the Q&A. I view these clusters as just hypotheses to be tested. I mean, let's be real. Let's not take our data too seriously until we actually can validate it through some independent means. Now, you could see validation with other omic modalities. That's certainly fair game or functional. One can also look at this whole question of cell type versus state as a type. What we might agree upon as a type might reflect some kind of self-stabilizing state, which could be robust to perturbation. So the beauty of all this is if we can get everybody kind of talking a similar language, maybe different dialects, then people can start asking these questions and designing experiments to test. "Is this thing that I'm calling cluster, does that actually reflect something that's stable and has a specific function?" So as much as I love all this stuff, I just always have to remind myself not to take our data too seriously. I mean, obviously take it seriously, but not read too much into it until one actually has validation of what the data are actually predicting. And these clusters really are just predictions until proven otherwise, in my view. Very powerful predictions, I might add.

JESSE GILLIS: Right, I think that's a good place to end for this mini session. And certainly, I think something we probably mostly agree on, which is the power of convention, right? That we just need to use the same language at least before we get into the philosophical significance or ground truth aspect. And so I'm turning it over to GC for the second mini session.

GUO-CHENG YUAN: Yeah, so it's a great discussion. So we are going to-- next point, we're going to basically extend a little bit of what we discussed in the first part and focus a little bit more about the caveats in the intricacies of the data integration and harmonization and the annotations. So we have two speakers again. And the first one is Professor Nenad Sestan from Yale University. So Nenad, could you take over?

NENAD SESTAN: Yes. Can you hear me?

GUO-CHENG YUAN: Yep, we can.

NENAD SESTAN: Can you see my slides?

NENAD SESTAN: Okay, thank you so much. First of all, thank you, Jesse and Guo-Cheng for organizing this panel and all colleagues for stimulating presentations. I will make this a little bit more complicated because I would like to talk about two topics that I think need to be addressed better in our atlasing efforts. And so first one will be to remind us that the development of the human brain is a very prolonged process and that it's characterized by transient features specific to cell types and species. And then in the second part, I'll just basically talk about when it comes to what's so special about our human brain and the effort that we and others have done. And basically that we share all cell types, at least compared to our closest relatives: chimpanzees. But the point is or would be that all cell types have changed and that basically even those that are highly evolutionary conserved. So as you probably all know, basically the human brain development is prolonged and many works from many over the many decades have shown both histologically and molecularly that basically some parts of the brain extends all the way to the second half of the third decade. There is a lot of evidence from imaging histology of the prefrontal cortex that also correlates with the emergence of higher order cognitive functions, while the other parts of the brain, such as the cerebellum, seem to be maturing faster, including those that are rooted in the motor, such as the cerebral cortex. So basically it's very important thing is that not all brain regions, and for that reason, cell types mature at the same level as some of them really take. So when we talk about the development, development is not just prenatal versus adults, it's not just one period of the development. And we have to be very careful when we generalize.

The second point I would like-- the second part of this is that when you look at the transcriptomic differences, work from our lab and many other colleagues here have shown over the last two decades that basically, if you look at just the differences at the level of gene expression, whether it's a bowel tissue cell type, that these, when it comes to different regions and development, they are neither constant or static, neither they are increased or decreased with aging. Work from our lab has shown, many others have shown, that basically what it is, is that the transcriptomic differences are largely transient. Actually, they are the most prominent during early fetal and mid fetal development. And then they become a little bit less prominent during birth or perinatally. And then they go up again in infancy and childhood. And again, basically, if you find something that is developmentally regulated in infancy, that does not mean that it's really or generalized for prenatal development. And the same thing is for macaques, actually. So this is not an artifact or uniquely human feature. And also, if you look at the evolutionary difference, there is more and more effort in our effort. Also, these evolution differences are also neither constant static nor the increase or decrease with age. They also have very similar cup shape with this late fetal transition. And of course, as many of you have mentioned, basically now with the single cell, we can follow trajectories, the lineages of cell types. We can patch what we see during the development. But most of those efforts, including those that we are doing, are either focused on set of regions or really fetal versus adult. And yes, we can learn a lot. And that has already been obvious from papers as well as a presentation.

But I want to tell you how really show one example that we have published many years ago - actually over a decade ago - that really exemplifies how hard the problem is that actually we need to search. And some over a decade ago, a postdoc in the lab, who has his own lab, [inaudible] Richmond Michigan found that actually unique set of cells in layer five, just during the mid fetal development from 13 weeks to around 23 weeks of post-conception. Okay. They expressed protein one nitric oxide synthesis one in pyramidal cells. This is a protein known to be expressed in interneurons. And to make story short, it only expresses in this region that will become a future broke up speech area and only during this 13 to 23 weeks of post conception. And it's a very small. There are only 30,000 of these little columns. Maybe there are 300,000 of these cells. We could deduct that there even within the layer five there you need to set subtype and this regulation is controlled by FMRP gene mutated in fragile X syndrome of intellectual disability. And he was able to show that even this is regulated in species-specific manner, just to exemplify how it is. And so in essence, what I would like to say is that when it comes to cell types regional and evolutionary difference, they are neither constant. They are temporary regulated. And in essence, we don't have enough picture of this complexity when it comes to development as well as the cell types. And that most of the data is now mid fetal versus adult. Neither of them are representing well. Even in adult, there are changes, okay, across the decades.

And then the second point I just want to make something that has already come across and will be repeated probably is that even share of homologous cell types display human-specific. Of course, for that reason, species-specific changes. And work from many have shown that many of them may alter transcription as physiology can neurotransmission. And so basically to summarize from our work, it was done on prefrontal cortex in the four species: human, chimpanzee, macaque and marmoset. So closest relative chimpanzee, over monkey, macaque, and marmoset. We have 113 cell types that transcriptomically defined our hypothesis, as John would say, and I do agree with them, by the way. And only one was unique to humans that the type of microglia. But all other cell types have very relatively similar proportions and ratios, which I don't think is something that is essential for our uniqueness, at least between human and chimpanzee. But all of them have both gain and loss of gene expression when it comes to human lineage. So even the cell types that are extremely old and shared by many, many other mammals are changed, not just across rodent versus primate, but also in six million years of our evolution. And that is the main difference. And I'm trying to understand this and how they affect even diversity of cell types and their properties is a separate topic.

And here is I'll finish with one other example showing how we need multimodal datasets. So we observed that actually a subtype of interneurons in chimpanzee do not express TH, which is critical to make dopamine. And the same homologous cell type exist in chimpanzees. And then also we further found out that basically even in the human where they express both TH and somatostatin, they actually cycle between able to make a TH, and also ability to make dopamine and somatostatin. That is not even-- that was discovered actually even decades ago with Xenopus. And it seems that this is a conserved mechanism, but for some reason only works in a human-specific lineage when it comes to this subtype of cortical neurons. This cannot be found on transcriptomic level. Basically, we need to look at proteomics levels. So thank you for your attention. Look forward to the questions.

GUO-CHENG YUAN: All right. Thank you for a great talk. So next is Xiaoyin Chen from Allen Institute. Xiaoyin, please take it away.

XIAOYIN CHEN: Thank you. So I'd like to talk about integrating multimodal data from a spatial perspective. So I think many of us are already quite familiar with spatial transcriptomic data, which basically allows us to look at gene expression and assign them to space, look at how gene expression distribute over space, but these spatial techniques actually allows us to do a lot more than that because space is incredibly convenient as an anchor to assign multimodal data all in the same cells. For example, my lab does a lot of incidental sequencing to look at gene expression and transcriptomic types of neurons in space, but we can also look at genetic labeling in the same cells, either by directly looking at fluorescence or also by using incidental sequencing, and we can also map projections and connectivity of neurons again in the same cells by sequencing RNA barcodes. So all these spatial techniques now gives us the opportunity to obtain these multimodal data at an unprecedented scale, but these data sets also highlight the challenge in understanding these data. So today I'm going to talk about some of the challenges and opportunities that we're thinking about using two example projects that we're working on.

So in this paper, which we just got accepted in Nature in Principle, we used incisional sequencing to interrogate gene expression over one hemisphere of the mouse cortex. So this allows us to look at how gene expression distributed over the whole cortex, and we can use this to define brain parcellations by these molecular features. So for example, in the paper, we define these cell type-based modules, which are groups of cortical areas that are more similar in cell type composition. So ideally, if you want to do this, you want to do this across hundreds to thousands of animals to account for individual variations and look for conservations within the population. And I think with the techniques and we and others are developing, we are not actually approaching that capability. But the other thing you can do when you can interrogate many animals is that you can combine this with the development of perturbation. So as part of the same paper, we did the same thing across eight additional animals, and four of these have undergone binocular enucleation. So we want to look at how peripheral inputs shape cell type development. And as expected, we saw that in the visual cortex, you can see a lot of cell type changes. For example, this yellow cell type in the control animals is quite dominant, but after enucleation, these animals have a lot of these green cell type. And now if we zoom out to the whole cortex level, you can actually see surprisingly that there's a lot of changes broadly across many particular areas. So each dot here, the circle size indicates how many cell types changed in its abundance in each of these particular areas, and then we can look at broadly what this pattern means in terms of the visual cortex changes. So basically, we can establish causal relationship with brain-wide structures by combining this brain-wide approach with perturbation.

We can go beyond gene expression and also link gene expression to projection. So this is a collaboration with Paulina and Jeremiah at the Allen Institute for Neural Dynamics. And here we're interested in understanding the organizational projections from the locus coeruleus. So we inject a [symbol?] barcode library into the locus coeruleus. We sequence these barcodes, which allows us to map these projections at single cell resolution. At the same time, we do incision sequencing to read our gene expression so we can link projections to gene expression to their spatial organization. Now, this is still very preliminary data. But with some simple clustering, we found two groups of projection patterns. So one group of neurons that project anteriorly to the cortex, and the second group that project posteriorly to the medulla and the brain stem. Now, grouping is sort of arbitrary. So how do we know if this grouping is real? Well, one thing we can do is that we can validate the grouping by looking at spatial organizations. So you can see here-- now, if we see where these neuronal somas are, the anterior projection neurons are mostly in the dorsal part of the LC, whereas the posterior projecting neurons are mostly in the ventral part. So you can validate neuronal subpopulations defined using projections, using gene expression by their spatial organization.

Now, the ultimate goal here is to link these two gene expression. And here we saw a more complex relationship. So here, we found four transcriptomic types of neurons in the LC. And you can see that all four of them have both types of projection neurons. So this is similar to what we and others have found in the cortex before. Basically, transcriptomic types don't really correspond to projection differences at a very fine-grained level. And so this really highlights this challenge of identifying gene regulatory programs that specify projections. And I think, perhaps going back to what Nenad mentioned about the developmental axis, this makes sense. And if we look at what happens in development, then we can probably get more insights into this. So as a summary, I've shown you that spatial data provides all these opportunities for generating multimodal measurement at scale. And this brings a lot of challenges as well. And here are some of the challenges that I talked about. And hopefully, we can have some nice discussion from here.

GUO-CHENG YUAN: Great. Thank you very much for sharing your fantastic work. This is really brilliant. So the spatial biology is a very exciting new development in this field. And that's something that I'm personally very, very interested in. So as far as I understand that most of the clustering, going back to the cell type definition, is mainly using the gene expression patterns identified from the spatial platforms. But as Xiaoyin beautifully showed, I mean, typically, it has a lot more additional information there, so connectome, cell morphologies, and things like that. And I was wondering for the people who are traditionally thinking about the cell type for a long time, how would you think about incorporate this additional information to help you better understand the cell types? Josh?

JOSH HUANG: Yeah. Right. So, Xiaoyin, yeah, I think the last piece of data you showed was very interesting. I was thinking about so-called ground truth biological systems. And my definition of ground truth, ultimately, is connectivity. But as we know, connectivity data is not available in most cases until we have the EM, but projection is a good approximation. So in your locus coeruleus neuron, the anterior projection versus posterior connection, there's almost no doubt that they connect to different targets, right? So by definition, they really should be different types. So I guess my question would be we should use that information to guide the clustering rather than using clustering every gene - each has an equal vote - to emphasize transcription factor. But I would say, from our previous study, again, based on this kind of information, cell adhesion molecule clearly will play a role for connectivity, synaptic transmission, signaling. So whether one should really use that information to almost like play with or really shape the algorithm until you actually cluster that begin to make sense. That's my comments or suggestions.

GUO-CHENG YUAN: Yeah. That's great. Can I ask Xiaoyin to-- this is brilliant. So can I ask Xiaoyin to briefly respond before moving?

XIAOYIN CHEN: Sure. Thank you. Yeah. I agree that there's a lot of information in projections. And you can definitely use that as one of the validations for transcriptomics. I think one challenge, especially in looking at projections of neurons, is that people have had more than 10 years of-- or maybe around 10 years of time. I mean, people in a field have about 10 years looking at transcriptomic data. And we have never had access to projection data at a scale that we had until the last few years. And so I think there is a lot of challenge in understanding the diversity in projections, like what should we consider as the same types? What reflects biological types? And there's also the projections are shaped a lot by development and by activity, like we showed in our paper. So I think these are some of the challenges that we need to solve in understanding how we can use the projection data.

JOSH HUANG: My response just very quickly—


JOSH HUANG: I think the problem is that the projection itself doesn't speak. Ultimately, you need connectivity. And the projection also, there is a basic scaffold versus the activity-dependent plasticity. So we need to recognize what is the core feature. But ultimately, that require connectivity. That's my opinion.

GUO-CHENG YUAN: Great, excellent. Hongkui, some feedback?
HONGKUI ZENG: Yeah, I want to comment. This is actually a very, very important point. And it's often very confusing to people, which one is more important, projection connectivity to define cell types or gene expression, transcriptomics, or genomics, right? There's the genomics view. There's the connectomics view. Will they agree with each other or not? So it is a major question. That's why sometimes I don't want to get into the detailed debate because you need to step back to see the big picture. There will be always disagreement, and there are reasons for that. So I just wanted to say that in our 2021 paper, a morphology paper by Peng et al., we looked at 1,700 single neuron projections. And I think across different cell types, really trying to understand what's the big-- so again, I think we need to look at this question at a hierarchical level. Actually, at the major cell type level, subclasses, or even subtypes or whatever, major cell types, there's a high degree of agreement about what is a cell type. Major cell types, transcriptomically defined, they have very, very specific projection patterns overall, and they are very consistent with each other. However, if you go into a particular type, there's tremendous heterogeneity of projection. We're not even talking about connectivity yet. We're just talking about axon projection. Because if the cells have different projection, of course, they have different connectivity. If they have the same projection, they may not have the same connectivity. So that's yet another level of complexity, as Josh was saying. But there are tremendous diversity and heterogeneity of single neuron projections, even within the same transcriptomic type.

And we summarize, so those variations under three different categories. One is region specificity. A cell type could be shared between different regions, let's say somatosensory cortex, visual cortex, and they have completely different projection patterns, even though transcriptomically, they're similar so far. And the second is topography, as Josh said. Topography rules things in nervous system. Everything is based on topography. Position, spatial organization, extremely important. So same transcriptomic type located in different parts, they will have different projection patterns. That's number two. The third one is just randomness, stochastic. It could be activity dependently, exposure dependent. It's the same population of cells. For whatever reason, they will form a subset of projection pattern. Could be possibly due to stochastic. So all these things are in play together. How do you sort it out? But this is all at a refined level. At a higher level, projection and transcriptomics agree with each other very well. But at a refined level, how do you sort it out? So Josh's proposal is a good one. You do supervised clustering to uncover molecules, genes that correspond to a particular projection pattern, for example, topography or regional specificity, right? So that's one thing. Another is that transcriptomic profiles is inherently multi-dimensional. And that's why we debate about how many clusters, because there are different sets of genes corresponding to different cellular properties. And if you use all genes, you can get more and more clusters because you identify groups of genes that correspond to different properties. You can get PC1, 2, 3, 4, 5.

We actually recently have a very interesting paper from our Patch-seq group, Stacy, Nathan, and others. So they showed that morphable electrical properties, different kinds of morphable electrical properties could be corresponding to different PCs in the transcriptomic data. Of course, those may not overlap with the PCs that you use to separate cells into clusters. Sometimes they agree, sometimes they don't agree. Even if you use the PC1 and PC2 to separate cells into clusters, there are still PC3 and PC4 in there that could be corresponding to a different cellular property. And do you want to keep splitting cells? Using all 5 or 10 pieces or not, that's a choice you want to make. But I just wanted to say that at a fine-grained level, there is heterogeneity. They may not match, they may match, or you can always find specific genes that match to a particular cellular property.

GUO-CHENG YUAN: Wonderful. This is a great discussion. That's exactly what we want. So having this kind of data, right, so that's available. This is really making this very, very exciting time for everybody to explore this. Bosiljka, sorry, I mispronounced your name.

BOSILJKA TASIC: Yeah. Yeah. No, you're pronouncing it correct. I have a question for Nenad. This change between the phenotypes, this potential reversibility between somatostatin and TH phenotypes, do you know the timescale? Because that's sort of an example of cell type versus state, what do we call it? And again, I think all of these need to be defined with respect to reversibility and timescale.

NENAD SESTAN: Thank you, Bosiljka. Very good question. I was thinking to mention that that's why I actually showed that example because in five minutes, you don't want to raise too many questions. But that was exactly what I was hoping to do. And thank you. So the TH starts to appear postnatal. So basically in infancy, the first neurons we could see was in the newborn brain, and there are only very few. And it takes probably through adolescence. So this is something that appears in mid or toward late of the development. That fight was missed in all previous studies because they were focused on prenatal development. The second thing is this. If you now take-- we know that these neurons can make a dopamine invitro in cell culture. We don't know yet that in vitro. So that's a separate issue. So is there now new cell type or not? Because if you use a clustering, they do cluster with the chimpanzee. There are two subtypes. Chimpanzees have both subtypes and that have been validated by Ed and Trygve and others also. But in my opinion, and I have no horse in this, but to me, it's a new cell type because 99.99% of genes are the same. It's clearly homologous bodies now can make a dopamine. That's a fundamental shift in neurons ability to function and create. So just to me, that would not be picked up. And so going to what Josh mentioned. So I think that even small minute difference that cannot be picked at the moment can lead into a new cell type. On the other hand, you can have a hundreds and hundreds of genes differentially expressed between, let's say, two species or two cell type, but those changes may not be what drives the functional differences in neurons or glial cells.

BOSILJKA TASIC: Thank you, Nenad. So one thing I maybe misunderstood is this is not in the same species. In one species, this is SST and the other is TH, or did I miss that?

NENAD SESTAN: So basically is this. So chimpanzees do not express TH, so they don't have a TH in cortex, and nobody knows why, okay? And this population of TH-expressing neurons exist in macaque and humans, okay? So basically it was lost in common ancestors of humans and chimpanzees and reappeared in humans, okay? Six million. So for some reason, it seems to be advantageous, or at least was [purified?]. We could not find a human that does not have it. But even in humans, it cycles between making a TH and somatostatin protein. And also in humans has DDC. So human cells can make a dopamine, monkey cells cannot. So these two genes now make these cells that have clear homology in chimpanzee and macaque and even marmoset different because they can make a dopamine. That's a fundamental difference.


GUO-CHENG YUAN: Are there are any more comments, follow-up?

JOHN NGAI: Yeah. I have one if there's time. Is there time?

GUO-CHENG YUAN: Go ahead, John.
JOHN NGAI: Yeah. Thanks. This is a great conversation. Just to quote Yogi Berra, I'm getting déjà vu all over again from two levels. And forgive me on the first example, it's been a while since I taught this. But, Nenad, your study reminds me of this change in phenotype in sympathetic neurons from adrenergic to cholinergic depending on the target. And we actually know the signal. It's lymph, right? This beautiful work, I think, done by Lynn Landmesser and Story Landis way back in the day. Really beautiful stuff. I'm not sure that back then there would have been argument is this a cell type or a state, but maybe we have more information with which to argue. I think Josh brings up a great point about the connectivity. And Hongkui, I agree with your answer, but I think we need to keep in mind-- these are great biological questions, but we need to keep in mind that not all development is deterministic. Not all cues are instructive. There is permissivity in development. I think probably that's the only way you're going to be able to construct a brain with a limited number of transcription factors and other factors. So the déjà vu is we had a lot of these great discussions in putting together that first MOP paperback in 2021. So given the amount of work that's being done to integrate information from all these different modalities, I'd like to see if we can stay away from just claiming one modality, one feature defines a cell. I think it's going to be more complex than that. I think as Hongkui articulated way better than I can. But just to point out that not all cues are instructive. There are permissive cues, there are permissive processes.

JOHN NGAI: And when I looked at the data from those early studies showing that cells that were, quote-unquote, defined "transcriptomically" had multiple connectivity patterns. I mean, as [inaudible] that could be stochastic, it could reflect a permissive process, or it could reflect processes that were occurring in development that had later disappeared or we couldn't detect well in the adult state. So I think we just need to keep an open mind about what the biology is in order to really sift through these data and really define meaning in it.

GUO-CHENG YUAN: Fantastic. Giorgio?

FABIAN THEIS: Yeah. Thank you. This is a really terrific discussion. I just would like to add one comment, which is we don't really know the mechanistic details of how function emerges from the brain. We know very high levels of functional specifications of neurons, such as excitatory, inhibitory, modulatory, maybe projection versus local, at some high level angular and Sherman drivers versus modulators, but not very much beyond that at the level of where function emerges from the brain. So I think that I really like what was said by several people. John actually said, maybe the current classification is just a hypothesis that needs to be tested. And it gives us, especially transcriptomics, gives us a very powerful tool for experimental testing. But I would add an additional layer of testing, which is computational. It's a testing by construction. If we really understand how a circuit works, we should be able to reproduce it in a complete model, in a detailed model. And then the cell types really would be the building blocks of those models. So in many ways, the proof is in the pudding. We'll know that we'll have the right cell types. When plugging those details in models, we'll be able to reproduce the same functions or at least approximate those functions. And remember that the correspondence might not really be one to one with a detailed mechanism because nature must be redundant. And we know through multiple examples, there could be many combinations of genes, and many combinations of circuitry, even connectivity that could produce the same function. Certainly many combinations of the input-output, functions of neurons and the synaptic signaling.

And I think that Josh actually spelled out what the details are for building those models. I think he meant it experimentally, but I think the very same applies to data driven computational models. We do need at least the potential connectivity, and maybe some of the detail of the connectivity could be experience dependent. It could work stochastically, as Hongkui even mentioned. We do need the input, output function of the neurons, at least the ranges of the input function, input-output function, and we need synaptic signaling. And if when you put those kinds of details when we have them, we're quite far from having all those kinds of details even from a single circuit. But when you put all of those in a computational model, then you can see what kind of function emerges. It's almost unavoidable. At the beginning, it will not work. And then we'll go back to the drawing board, maybe alter the definitions of cell types in order to get a closer match. So I think it will be an iterative process, but I would like to add at least the possibility of computational modeling as a very essential tool in the tool set for finally understanding the building blocks of how the brain works.

GUO-CHENG YUAN: That's great. So we do have a mini session on the computational tool coming up led by Jesse. But before we wrap up this sort of mini session, I want to throw out a question that to me is very interesting. From the spatial perspective, I think we see more than just the one cell. Right? We also see its environment. A lot of that is to presume that they work together. I guess to the-- what's the kind of the next step? How do we move beyond individual cells and see kind of the cell niche, so to speak, or neighborhood, however people call it? What's kind of the-- is there like a unified framework that people like to see to see that moving on or?

JOSH HUANG: I can respond that if I may. I think a cell community, but ultimately it's a cell society, right? So to me, if we have the EM, again, I'm glad that this project is funded. I really think it is very difficult and expensive. But imagine if one have the EM resolution connectivity a lot of the seemingly variability projection patterns actually may be reduced because even if the axon has somewhat different branches or different angles, they actually may be connected to the same cell. Ultimately, it is the synaptic partnership that matters. So if we can define that, it's very difficult. I really feel it's a huge progress. So it is very difficult to say now, but my prediction is that when we have to connect them, just like the drosophila crowd, we will not debate with a lot of these for no end because we don't know the connectivity.

GUO-CHENG YUAN: Thank you. Jim, and then Rong?

WENJIN JIM ZHENG: Yeah, so I think through the last discussion and this discussion, I feel like one modality that we kind of ignored is the literature. I think we're all talking about different modality of the data, but I think there are a lot of information that's in the literature about a gene relationships, functions, and all those kind of things, where if you take the gene expression data, project that to the literature, extract all the information related to genes. We probably can get way more information that might be helpful to better define the cell type as well as to identify the neuron connections and all those kind of things.

GUO-CHENG YUAN: Thank you. Rong?

RONG FAN: Yeah. Thanks, GC. I think I just want to kind of bring attention to not just the kind of neuronal cells that are connecting the synapse, but also the non-neuronal cells, right? So the glia, in particular, microglia, how they interact with other cell types. So that cell-cell interaction is so important. And I think we have the developing brain initiative within this consortium, but we maybe haven't done that much in the aging brain. So in that case, how the kind of glial cells age and how that kind of produce the factors to impact other cell types and their paracrine signaling. So that's so important. I read on how to figure out, but, GC, on the computational stuff, you are an expert. But I think in the neuroscience field, so one thing I feel that's a little bit missing or a little bit kind of behind compared to, for example, immunobiology, when you look at the paracrine signaling or cell-cell contact, the ligand receptors and quite well can identify in immunology, but we don't have that very well-articulated, very well characterized in neuroscience. And in parallel, I feel, yeah, we also don't have the good antivirus or good antibody panel established because eventually that interaction is not done by messenger RNAs, right? That's done by proteins. Can we kind of develop the protein panel and the antibodies to really map down that cell-cell interaction in that context?

GUO-CHENG YUAN: Great, thank you. Because of time, we have to move on. So, Jesse, hand it to you.

JESSE GILLIS: Thanks, GC, and thanks to everyone. So our last topic is tools and computational innovation. And our first mini speaker is Bo Wang from here at U of T. Oh, well, I see Josh for sure. Maybe Josh can go first and I'll touch base with Bo by email.

JOSHUA WELCH: He was just on here. I saw a minute ago. Maybe he jumped off. But I can go. Sure.

JESSE GILLIS: Thank you.

JOSHUA WELCH: All right. Hi, everybody. Thanks, Jesse and GC for organizing this session and inviting me. I want to talk about several ways in which we can think about innovating on the computational analysis front. These are just some examples of things I've been thinking about, but I'm sure there are lots of other exciting opportunities and would love to hear your thoughts as well. So I think the common theme here in the three projects I'm going to tell you about is basically new opportunities created by having multiple kinds of data at the same time. So the first example is basically thinking about how we can use single-cell co-assays to understand gene regulation. So an example of how you can do this in a way that we've been thinking about is with a tool we developed called MultiVelo, which extended the RNA velocity framework to incorporate 10x multiome or related data types where you have gene expression and chromatin accessibility measured in the same cells. And so one of the basic questions we wanted to answer with this type of data is, where are the cases when gene expression and chromatin accessibility are actually out of sync or decoupled? And so we developed this differential equation modeling framework for the transcription cycle of a gene that has phases where chromatin can be closing while transcription is still occurring, or transcription can shut down while the chromatin for a gene is still open. And so this model provides a mathematical way of quantifying and identifying places when gene expression and chromatin accessibility are out of sync. This is especially relevant in a developmental context, but I think it is also going to be super interesting to look at in adult cells as well.

Another thing that I think there's a lot of room for innovation and exploring is emerging data types where you have spatial coordinates and molecular measurements, but also imaging or morphology features. And one of the earlier speakers presented some of this type of data. So something we've done in this area is developed an approach that we call MorphNet, which can take the gene expression profile of a cell and predict the cell's morphology. And we trained it on several different types of data, including MERFISH, where the morphological readout was nuclear morphology as quantified by DAPI. And we also trained it on Patch-seq data where the morphology is much more rich dendrite axon morphology. And in both cases, basically the idea is we can train this deep generative model using paired data where we have gene expression and morphology on the same cell. And then at test time, we can take a new gene expression profile, pass it through the encoder into the gene expression latent space, and then use a generative adversarial network to generate an example of a morphology of a cell that has that gene expression. And one thing that's challenging about this type of relationship is that it's not one-to-one in either direction. So in an adult neuron, the gene expression profile at any given point in time doesn't completely specify the morphology and vice versa. So rather than thinking of this as an exact prediction problem, we're thinking of it more like conditional distribution sampling, where the more you know about the molecular profile of a cell, the more you can narrow down its morphology up to the limits of how that modality specifies the morphology causally. So I think there's a lot of room for innovation in relating these types of data as well.

A last example of how you can innovate in this space in terms of new data types that you can put together is in terms of using the actual anatomical locations of cell soma to infer something about their anatomical properties. And this is a project that we were funded by as part of the BICCN that we're now wrapping up and just about to publish. And what we did in this project was we used our LIGER algorithm to integrate dissociated single-cell profiles. Then we used spatial transcriptomics data to map the spatial distributions of each cell type. And other groups have done parallel initiatives in a similar direction. But the key idea that we are pursuing here is once we know the spatial distribution of a molecular cell type across the brain, we can use that position within a common coordinate framework to learn something about other anatomical data sets that are registered in the same coordinate system. So for example, we took a data set of brain-wide cellular resolution vascular imaging data that was measured across multiple mouse brains. And we co-registered that in the same coordinate system as our spatially mapped cell types. And then we calculated a vascular density score for each cell type and used that to identify cell types that have either high or low vascular density locally. And we were able to create a sort of vascular density Manhattan plot where we can identify cell types that rise to the top in terms of their local vascular density. And if you make a spatial plot of where the cell types are located with regard to the major vessels in the brain, you can see that we're getting something pretty sensible here where, for example, this cell type in the hypothalamus is right near the conjunction of several arteries.

And one thing that you have to be careful about here, which we thought about extensively is that you have to rely on essentially the accuracy of registration as well as the conservation of whatever anatomical data set you're looking at across individuals. And so in this case, we were careful to show that basically what we're seeing is recapitulated across three individual mice whose whole brain vascular maps are available. And so I think there's a huge opportunity for this type of integration where you use the actual spatial positions of cells as a bridge to many other data types.

JESSE GILLIS: Thanks, Josh. I still don't see Bo here. So I think we'll just move on to the discussion. Bo, you're not here, are you? I'm not missing anything. Okay, so moving on to the discussion. I guess one of the key questions, there's a lot of methodological elaboration. Many of us could have guessed what Bo might have talked about, which is an interesting and potentially controversial topic, which is one might think about sophistication of models. But I guess even before that, there's a question of like validation of models and validation of new tools. I guess that's been really key as methods become more sophisticated. If you think of the wonders of CASP finally being a solved problem, some people might say it really depends on the idea that people had ways of measuring that it was a solved problem. What is the general feeling right now on validation of new methods? Is there a consensus in the field? I'll make it more casual. If you read a paper and you see a method, do you know that this is a method you desperately want to try or it's hard to tell because it's just everyone thinks their method looks good and it's hard to tell the difference between different methods. So are there any thoughts around validation, comparison, and LIGER? Fabian?

FABIAN THEIS: Thanks for raising that. There has been initially not a problem because initially there was all this beautiful single-cell [inaudible] data around 10 years ago and we didn't have stuff to go around. But then it became clear that early tools such as pseudo-temporal ordering up to now all the mapping ideas or multimodal integration became so much fun so that a lot of I guess computational biology, but also machine learning people have been getting interested in these problems. And I think we just counted for single-cell [inaudible] papers that have been around, I think a year or two year ago was more than 1,000 different algorithms for particular questions. Right? So this is getting a bit of a mess. And I think for even doing up-to-date nowadays analysis, some type of initial benchmarking effort, or if it's sort of an established topic, having something to decide to really give an argument why you choose that particular approach is a bit necessary. Sometimes there's a thing that fits for all. Sometimes you need to be sort of data set specific. But I think comparing and evaluating your tools is something that the machine learning community has been doing for ages, particularly if it's really about small increases in some AUC for classification. So I think we need to be potentially getting more used to that.

In a sort of bigger consortium, we've been setting up this open problems website where-- it's like these Kaggle competitions where you can upload a few of your questions and then have an open discussion about how to even just measure how good potential things are because of course, depending on the metric that you choose, some things perform well, some don't. And yeah. That's part of the development I guess. And then, of course, in addition, also the typical data sets. And we've been running a bunch of competitions at recent Europe's conferences for various things from data integration to temporal trajectories to perturbations. And I think there's a lot of interest. And those data sets and those sort of questions you established there I think can be really a great asset for the community. So I think some of these spatial data sets being generated here, such as this beautiful MERFISH one from Hongkui, could become also for the more machine learning people, an asset for people to actually evaluate their tools and ask rather standardized question that non-experts in that particular biology could also then use up downstream with methods. So I think that's also maybe a trajectory that the consortium could take.

JESSE GILLIS: I absolutely agree. So I actually think that's a really key point, that the existence of this reference data potentially solves this problem, right, in part at least. That this is the data that can be used to evaluate methods. And it's just a question of coming up with consensus standards and consensus metrics as it takes place in any field. Fenna?

FENNA KRIENEN: Yeah. Thank you. Yeah. I totally agree. And I think the efforts to try to make a standardized framework that we think about different types of problems is really important. Also, taking the strength of the community and proposing potential evaluations that will help us to have a benchmarking framework that makes sense. I think one thing that I really thought would be good to emphasize also is just to make sure that the evaluations we're doing and that we decide on as a community are really based on biologically meaningful tasks and trying to understand what higher-order information or predictions we can make with these models that are going beyond kind of some basic preprocessing, but also new biological insights and what we really are able to achieve with the huge amounts of data that are now becoming available to train some of these more advanced models. And I think just one thing we should be cautious about just to make sure that we don't get really, really good at training models that perform a task that's not what we're actually interested in biologically. So making sure that we keep that in mind.

And then the other thing to keep in mind I think when establishing these benchmarking frameworks is to ensure that the problems are sufficiently challenging. So there's a lot of problems that you might not necessarily need some very complex or advanced approaches to accomplish. But in some cases, then everything look like it's doing the same. But if you actually increase difficulty the problem or you get to some of these characteristics of genes or cell states that we would like to explore further as a community, then it will become more clear if there are certain models that perform well in certain areas versus others. And also we can learn from that as well so that we can take the best of all these approaches and put them together to continue to evolve the methods that we're developing.

JESSE GILLIS: Right, I just highlight papers that are now getting old, but when people benchmarked-- Soneson and Robinson, their differential expression methods, I mean, it's pretty striking the top couple of methods are the Mann–Whitney test, stuff like that, or the T test was second or something like that. So that highlights kind of both aspects of what you're saying. Is that just because the problems just aren't hard enough at that point? Or the differential expression is such an easy, relatively easy thing that the 20 or 30 more sophisticated methods aren't essential, or it could be that they're overfitting when they validated it themselves, they were overfitting to specific data. It's just not robust. Yeah, so I think those are key points, I think.

DAIFENG WANG: Yeah, so I'm totally seconding that. We really need validation data to validate our computational machine learning method. Because to be honest, right now, many of those data we got to train our machine learning is a snapshot data. So of course, we can always try to learn some latent representations, but not maybe just driven by some like a particular snapshot. So that's why we really need to get some sort of dynamic data. It's already a computer vision versus individual images, right? So if possible, if we can get some high resolution dynamic data we can use to validate our machine learning models. So in addition to those dynamic data, also maybe we can get some additional mechanistic data, like a cell-to-cell communications or the inactions. So we can try to maybe explain some latent representations, what we learned from the snapshot. So try to give people more mechanistic insight, at least.

JESSE GILLIS: It feels like just defining the problems really is something that is key at this point, but then also establishing a framework for validation. And again, not to harp on CASP as maybe something that's just been, I feel extremely successful. And the key feature of CASP is that the data actually is true validation data, right? It's not cross validation. It's not test our reproducibility of our method on this data. And it looks like it's robust. It really is held back data that experimentalists generate, and then you have to predict. And I guess, can we aspire to that at all? There are ways of doing it like in CAFA, which is inspired by CASP. The data isn't generated for-- it's a functional annotation exercise, isn't generated for the teams. It's data that they know will be generated in the next year, regardless. And so I think given the timeline of data generation among this consortium and BICAN, is it possible that we could set out tasks that would serve as validations for methods based on incoming data that we know is going to be multi-human data or something like that? Is that something that would be interesting to people or is that possible? No enthusiasm for that? So we have spatial data, presumably, that's planned to be generated. It just seems like that might be a mechanism for people to validate or integration data. For example, if we know there's going to be multi-generated later, I guess there's two sides to it. Would methods people be enthusiastic about having their methods validated that way and would experiment.

FABIAN THEIS: How would that be validation? Maybe I'm not getting the point here, Jesse. So you're saying you predict another modality on this unknown region, something like that. And then someone else measures it.

JESSE GILLIS: If you had something like multiome, for example, where you did an integration and you have some claim pre-multiome. I'm hearkening back to something that now has occurred. But let's say, in the pre-multiome data we were integrating ATAC-seq and an RNA-seq and we were making claims about that. And nowadays you just do a multiome experiment and you might consider that a validation of the integrations that people were doing. We saw some work, people were discussing imputation. Again, I think the tasks would have to be defined that creates space for validation. But as an example, imputation, which is kind of key, particularly for development, where presumably, you're not going to sample-- you can't sample everything. And then someone is going to sample this area or this time point you haven't previously sampled, and you have inferred some sort of change. So I don't have a clear idea of exactly, I think it would have to be married between-- there's have to be some marriage between the tasks that people feel are natural ones and the data that's being generated. And if that doesn't exist, maybe this isn't a good suggestion.

FABIAN THEIS: So I agree with some of your points. So let's say if it's a time series and sort of drop out a few things or you go sort of in future time points, that's a very clear question that you should-- a generative model should be able to extrapolate to predict or maybe to go there. One key problem, of course, of many of our generative models at the moment is that there's no unique true answer. There's always variation stemming from all kinds of sources, which we haven't fully delineated. That could be many cases technical, but many cases also sort of same location-dependent changes across-- we're not working in a satellite, right? So individual variation, I don't think we have fully mapped out. So even though you have then a prediction for a particular time point, are you within that sort of error boundary that you would want to have or not? I don't think we can fully answer that. I think it's a similar problem for many of the perturbation predictions that are happening, where you crisp out a particular thing, and then you see some effect, but it's a complex one. And there's tech [configurations?] superimposing that. So you approximate it. But the scores that we have for that often are squares or mean gene expression differences, something like that, and maybe not the most fine-grained measure for that yet.


GUO-CHENG YUAN: Yeah, so I agree. Some of the tasks, they are harder than others, and some of them are less well-defined. So it makes it even harder. So I guess some of the things that are conceptually would be really easy to check, at least, is that, A, deconvolution, for example. And, B, for example, the cell type mapping, all right? So when you have-- these are the things that are commonly used in this community. And to my knowledge, there's not really good sense of which method, how well they really work in a real setting.

JESSE GILLIS: Yeah, I think deconvolution, particularly, if you've been around long enough, you've seen 1,000 deconvolution methods and maybe a quarter of one validation on any of them, kind of. So deconvolution feels very popular when you talk to-- I mean, when I talk to wet-lab biologists - I'm not one - it feels like they actually want deconvolution a lot. Single-cell data seems like something that offers a lot of opportunities for new methods. There is more validation in papers now than there used to be, but it feels like people's methods work on the data they generated. There might be some degree of overfitting. Maybe that's a space for actually solving a major problem, which is what methods are robust and generalize and things like that. So I guess on a final topic since we have one minute or two left, just in honor of Bo Wang, who didn't make his talk, but how people feel about more sophisticated methods, like foundation models and things like this for-- is this the aspiration for methods development at this time? Or are these methods in some sense-- any method which you then need to have an explainable-- there's a whole other aspect of the research that's then on explaining the method itself and how it works. It may limit its utility for problems where we really do want concrete features that are associated with different aspects of neurobiology. So I'd be interested in people's thoughts on enthusiasm for highly sophisticated, let's say, AI methods at the time. Fabian, I assumed you had something to say on this too.

FABIAN THEIS: It's okay if I can briefly mention that.


FABIAN THEIS: So I think as in many fields, it's going to be hard to beat deep learning models for complex integrations. This is just the flexibility and the power with the large-scale data that we get. You just won't beat it. We've been running, I think, three times now in Europe's competitions. And most of the winning integration methods, if you really go just in terms of scores, were nonlinear, typically deep learning-based ones. So our more traditional KNN graph-based ones is just hard to beat. Having said that, it's going to be harder to interpret. But if you really care about accuracy of prediction, I think that's the way to go and your interpretation methods are catching up. But that's, of course, one side of methods. I'm really looking forward to seeing what maybe Josh you think about with sort of-- also NMF-based ones often are performing very well. It's hard to be the linear model in many cases, right?

JOSHUA WELCH: Yeah, no, I think I agree. Ultimately, the deep learning approaches are going to win. I think there's some real challenges in biological data as opposed to text and language, though, especially in terms of the noise and our ability to understand intuitively when the models are getting it right. So those would be my comments. So I think ultimately it will be very exciting, but there's still a lot of challenges before we get there.

JESSE GILLIS: Great. And I think if you could just-- and then Christina very quickly since we just have one minute.

DAIFENG WANG: Yeah, so before we go to those complicated models, like deep learning or language models, I think at least we can use some prior knowledge, biological knowledge. We know, for example, some TH maybe regulate some genes. We use that knowledge to regularize our current machine learning models. Might be useful as well. Just a quick comment.

JESSE GILLIS: Thanks. And Fenna?

FENNA KRIENEN: Yeah, I totally agree with everything said. I just want to mention also that ablation studies are a way that you can also improve interpretability. And then also at the end of the day, I mean, I think we're ways away from going straight from one of these models to a clinical trial, right? There's a step in between where we would be testing some of these predictions. And at the end of the day, if you think about the astronomical number of possible things that you could test in the lab, this is a way to really hone down and prioritize what are the most high yield possible experiments to do that's more cost-effective than testing everything in the wet lab, and then hopefully get us to a more accelerated, therapeutic, etc., output from these modeling approaches that are integrating large amounts of data.

JESSE GILLIS: Thank you. And with that, this session ends. We won't sum up now. There'll be a sum up tomorrow morning. And I'll just thank everyone for participating and everyone else for listening.

GUO-CHENG YUAN: Yeah, thank you very much.

AMANDA PRICE: Thank you. Thank you so much for such a lively discussion. So now I have the pleasure to introduce our next keynote speaker, Dr. Steve McCarroll. Steve McCarroll is the Flier Professor of Genetics at Harvard Medical School and Director of the Neurobiological Genomics for the Broad Institute Stanley Center for Brain Research. Steve and the scientists in his lab use human genetics, biology, and single cell genomics to understand natural variation in the human brain and the ways in which genes and genetic variations sculpt the brain's functions and vulnerabilities. His lab developed Drop-seq, an early technology enabling high-throughput single cell genomics by combining droplets with molecular barcoding, and is working to uncover the biological nature of genetic influences on schizophrenia, Huntington's disease, and other brain disorders. So like Andrea before me, I will also be setting a timer for 15 minutes and will be monitoring the Q&A box. So take it away, Steve.

STEVE MCCARROLL: Thanks for inviting me to share this work. Can you guys see the slide and hear me speak? Just confirming.

AMANDA PRICE: Yes, I can hear and see.

STEVE MCCARROLL: Okay. One of the more urgent scientific challenges that we hope brain cell atlases will advance is to understand how disparate genetic environmental effects converge upon the things we care about with respect to brain function and health, including both clinical diagnoses or illnesses and the far wider range of biological variation that's not clinical but that affects our brain's level of function, resilience, moods, and other biological states. The really big thing that we don't know but that is key in connecting genetics to brain health is this: how do hundreds or thousands of genes in one or many cell types come together to accomplish important things for our biology? One step in this project, of course, is to be able to measure as much as possible in as many cellular contacts as possible; all the genes, all the cell types. And technology for doing that has made possible these brain cell atlases that are making taxonomies of all the cell types, molecular parts lists from the genes expressed in each cell type, and more recently, information about the morphologies and locations and connections of these cells. But the parts list is going to be just part of what we need. We also need to understand how these parts are used, including how they are used together. That is, an atlas needs to be a list of parts, yes, but also a repertoire of programs. When we were developing Drop-seq, the analogy we often used was that we wanted to make it possible to analyze the brain less like a smoothie and more like a fruit salad, in which you could learn from and intellectually savor every individual piece of fruit. It was a good metaphor for Drop-seq itself, but it's not at all a good metaphor for how the brain actually works. The brain is, in fact, built upon non-cell autonomous biology. Almost everything important that the brain does involves collaboration among multiple kinds of neurons and often glial cells, and our goal in this work was to try to understand this aspect of gene expression. How do cells of diverse types collaborate with one another?

But how do you make such collaborations visible when you're analyzing individual cells? Our approach to this was to use the natural genetic and environmental variation that exists across individual humans in order to observe this system in many different biological states. We wanted to ask, what are the features that always change together? And can such constellations of features involve gene expression changes in more than one cell type? And this is where what's often seen as an obstacle to doing controlled experiments in humans - the fact that we all have different genetics, environments, and life histories - can actually be made into a source of strength. In the human setting, almost all aspects of our biology exhibit quantitative variation, and natural variation makes it possible to measure a system in many contexts and to learn underlying principles about how it works.

The work I'll present today is also described in this pre-print. It involved a whole team, but I especially want to mention Emi Ling, the postdoc who led the project, Jim Nemesh, who contributed many of the computational ideas, and our close collaborator and partner, Sabina Beretta. In this work, we analyzed the dorsolateral prefrontal cortex from 790 people. The approach that we developed was based on latent factor analysis, in which one infers the presence of underlying factors that cause large constellations of very many gene expression measurements to change together. Basically, we ask if we start with a matrix of about 200 donors and 100,000 or more cell type-specific gene expression measurements in each donor, can we identify a small number of underlying factors that collectively explain most of the variance in all the measurements? The answer is we do. As much as 30% of all the inter-individual variation in gene expression measurements could be explained by just 10 latent factors. You're likely familiar with the idea that genes in the same cell type or genes in the same cell will often be consistently co-regulated. For example, that there are a module of genes that a cell recruits together, often because they contribute to a cellular function. But at a tissue level, we also see that many of the strongest latent factors recruit coordinated gene expression changes in multiple cell types. Two cell types that coordinate gene expression particularly strongly are neurons and astrocytes. In fact, we find that gene expression in a person's astrocytes is powerfully predictive of gene expression in their neurons and vice versa. You can see this in several of these latent factors. I'm going to focus now on just one of these latent factors as a kind of flagship latent factor. It's derived most strongly from gene expression changes in cortical astrocytes and multiple types of cortical neurons.

So what kinds of activities do neurons and astrocytes perform in concert? First here in this plot, I want you to notice from the variation on the y-axis in which each point is one of these 190 brain donors that individuals exhibit substantial quantitative variation in the fraction of their neuronal gene expression that they're investing in synaptic components. And that's what's shown here. Here, that's normalized to the median donor who's one and is the y-axis of all the plots on this slide. So we find that many kinds of astrocyte activities appear to be calibrated to this neuronal investment in synapses. These include the expression by astrocytes of genes that encode neurotransmitter transporters, shown here, as well as astrocytes expression of their own synaptic adhesion genes, with which their processes adhere to synapses also, and astrocytes expression of genes with roles in cholesterol biosynthesis. Of course, synaptic membranes are some of the most cholesterol-rich structures in the body, and they obtain that cholesterol in trans from astrocytes. So we call this relationship between neuronal gene expression and astrocyte gene expression the synaptic neuron and astrocyte program or SNAP. And you can see SNAP in this heat map here. This is just a small number of the genes that SNAP recruits in neurons and astrocytes. But basically, you can see that there's one subset that we call SNAP-a that's strongly co-regulated in astrocytes. These genes tend to be more highly expressed in the same donors as a group and less highly expressed in the same donors as a group. And then there's a distinct set of genes with synaptic functions that's co-regulated in neurons that we call SNAP-n.

So latent factors like SNAP are apparent even in variation among biologically normal brains, but we think perhaps they can also give us new kinds of intellectual scaffolds for understanding brain disorders and other changes in our biology. And here, what we've done is we measured SNAP in each of these 200 brain donors. Actually, we measured all these latent factors in these 200 brain donors. And we asked, do any of them associate with schizophrenia? And one of them does. It's one we call-- it's latent factor 4 (LF4), which in fact is SNAP. But SNAP doesn't just associate with schizophrenia. It also associates with age. So SNAP declines as we age. And you see this decline both among controls shown in green and among persons with schizophrenia, shown in purple. So in fact, in schizophrenia, just this whole curve or line shifts downwards. And when you look at the distribution of SNAP expression measurements in an age-adjusted manner, you can see this quite striking difference between levels of SNAP expression in persons with schizophrenia and in controls. So aging and schizophrenia have long been known to have cognitive and neuro-anatomical features in common. These include reductions in executive function and cognitive flexibility and processing speed, as well as reductions in cortical thickness and neuropil and numbers of dendritic spines. And so we think that this SNAP may be pointing toward a way in which these arise from shared molecular and cellular changes.

And this is just a little more detail on that. And this is, again, why-- when you make a human brain cell atlas, it's a big question. What human do you use? And you'd actually want to use many humans because any one human can give you a non-representative picture. But here you can see this really quite strong relationship in which our neurons reduce their investment in synaptic components as we age. And in fact, this is true in all types of cortical neurons. This is using the Allen Institute's nomenclature as a scaffold. But you can see all these subtypes of both excitatory and inhibitory neurons exhibit the same property of decline in expression of these genes with advancing age and further decline in schizophrenia patients relative to controls. So we see this in schizophrenia. Also, this is not confined to any one subtype. We see this kind of systematic reduction in all types of cortical neurons. And the corresponding astrocyte activities seem to decline in concert in schizophrenia. You can see that in the way that the purple points are all below and to the left of the green points, or at least are shifted as a group. So just to summarize what I've told you so far, cortical neurons and cortical astrocytes seem to have coordinate gene expression related to synaptic biology and what we call the synaptic neuron-astrocyte program or SNAP, which is made apparent just by normal variation in the general population. And SNAP declines in both aging and schizophrenia.

So we're very interested in the question of whether SNAP or also latent factors like this in general could be biological programs on which diverse genetic effects converge to shape our vulnerability. And in particular, is SNAP a place of convergence for genetic effects in schizophrenia? And specifically, do astrocytes and not just neurons shape our vulnerability to schizophrenia? And one reason this is an interesting-- it would be easy to think that this was a solved question because there's a kind of genomics conventional wisdom that neurons, but not glia, are settings for genetic effects in schizophrenia. And this conclusion has been kind of repeated in paper after paper based on evidence of the following: that if you take the implicated genes and look for kind of enrichments or concentrations of them, that they're concentrated among genes that are strongly expressed by neurons relative to glia, that is to say kind of neuron identity genes, but not by genes that are strongly expressed in glia relative to neurons, such as astrocyte identity genes. And when we do these conventional analyses with our own single-cell data, we see this exact same thing. There's this concentration of schizophrenia risk in neuronal but not astrocyte identity genes. But SNAP isn't really about a cell's identity. It's not about what a cell is. It's about what a cell does. It's about quantitative dynamic changes in gene expression, constellations of genes that change together in that cell type.

And so what if we were to actually pivot from asking about these human genetic results, not in terms of what cells are all the time, but in terms of their dynamic activities or cellular programs? When we do that, what we see is that the neuronal component of SNAP is also enriched in schizophrenia genes above and beyond, just the enrichment of neuronal genes, but the very strong enrichment that we see is for the astrocyte component of SNAP. And this has really led us to think that astrocytes too are a significant setting for genetic effects upon schizophrenia risk, which was not the assumption with which we had approached this work at the beginning. Two examples of this, two genes that are strongly regulated by SNAPs in astrocytes are Neurexin 1 (NRXN1) and complement component 4 (C4). You can see the incredibly strong relationship between SNAP expression in a donor's astrocytes and expression of NRXN1 in their astrocytes, and the way that NRXN1 expression is reduced in the astrocytes of persons with schizophrenia relative to controls, which is actually not something, interestingly, that we see in their neurons. And similarly, the C4 gene is expressed much more highly in low SNAP donors who are expressing SNAP at a low level. This is a very strong effect. In fact, it's much stronger than the genetic result that our lab described at the C4 gene several years ago.

So just to summarize about SNAP, what we know and what we think. So what we believe we know is that neurons and astrocytes in SNAP are mutually investing gene expression in synaptic components. What we think is that SNAP may support some aspects of learning and plasticity. What we think we know is that SNAP declines in schizophrenia and aging, and what we think is that SNAP might be a convergence point for multiple kinds of brain pathophysiology. And finally, what we know now is that schizophrenia risk genes are enriched in SNAP's astrocyte activities as well as its neuronal activities. And what we think is that SNAP's efficacy and resilience may contribute to genetic protection and vulnerability in schizophrenia. And of course, it'll take additional kinds of experiments and in other kinds of systems to firmly establish the things on the right.

So these results have inspired many goals for our BICAN project, which involves looking at inter-individual variation across 200 or so people in 50 human brain areas. We want to contribute to the creation of a great parts list, but we really want to go beyond the parts list to work to recognize how cells and their components are working together in each of these brain areas to the extent that natural variation can help us ascertain that. And we hope that with that, that we can provide new kinds of intellectual scaffolds for analyses of human genetics and brain disorders. So the work that I described, there's much more detail about this work in a pre-print that we've posted in bioRxiv. Here's a short snappy link to the pre-print to make it easier to find. And finally, I'd like to thank the interdisciplinary team that contributed to this work, as well as the brain donors and their families who made all of this possible. I should say that this was entirely study of postmortem brain tissue, archival postmortem brain tissue from the NIH NeuroBioBank. So thank you so much for listening, and I'll be enthusiastic to answer any questions and hear your thoughts.

AMANDA PRICE: Thanks. That was a great talk. So we have a couple of questions coming into the Q&A box. The first one was regarding the latent factors. There's a question about how they are calculated, maybe by PCA, ICA, NMF, or other methods?

STEVE MCCARROLL: Yep. So these were calculated by linear matrix factorization methods. I think there's a really interesting frontier that I think Josh Welsh also talked about in a different context about going toward more complex forms of machine learning that identify and encode latent factors in other kinds of ways. But everything we did here was with approaches based on matrix factorization.

AMANDA PRICE: And I see another question coming in about distinguishing between the effective genetic association of schizophrenia from the effect of genetic and environment interactions in the disease. Could it be that enrichment in SNAP astrocytes is environmentally regulated?

STEVE MCCARROLL: It almost has to be the case that environment also moves our biology in these ways. And so I think certainly our hope-- although these things are much harder to show, but certainly our hope might be that SNAP is not only a point of convergence for very many genetic effects, but is also a point of convergence for some kinds of environmental effects also. For example, you could imagine that stress or corticosteroids or things like that are SNAP-depressing. All of these kinds of things would take other kinds of evidence to really know, but that's certainly our working hypothesis is that it's also a point of convergence for environmental effects.

AMANDA PRICE: So there are a couple more questions. Maybe we'll take Gustavo Turecki's question, and then perhaps you can answer them in the chat box so we can, just in the interest of time, move on to panel two.

GUSTAVO TURECKI: Yeah. Thank you. That was quite interesting. I was wondering if you had a chance to look at how the SNAP changes over age in schizophrenia. Yes, because schizophrenia, it's primarily an early onset illness that begins early, so. And whether or not you've seen a more rapid decline, starting right away with age or not.

STEVE MCCARROLL: Yeah. It's hard for us to tell, I think, based on this sample whether the slope is greater or less in schizophrenia. It's certainly downward sloping in schizophrenia as well. And also with much more variance. So it definitely seems to be the case that in some patients, there's a much deeper SNAP setback than in others. And interestingly, we find that that's actually correlated with the patient's polygenic risk for schizophrenia. So persons with higher polygenic risk just tend to have a deeper setback of SNAP relative to what you would have expected for their age. This is potentially connected to a very interesting thing that's very well established but which the reasons aren't known, which is that having had schizophrenia during one's lifetime is a huge risk factor for developing dementia later in life. And so it may be that the setback in SNAP that the illness causes sort of puts people kind of much closer to the edge for what then kind of could be caused by follow-on pathology later in life.

AMANDA PRICE: Welcome back. I know that was a short break, but I hope everyone was able to do a bit of stretching. So now we'll just jump right into panel two, which is focusing on challenges in human brain cell data analysis, integration, and annotation. This panel will be led by doctors Aparna Bhaduri and Nelson Johansen. Aparna has been introduced previously, and Nelson is a scientist at the Allen Institute. So with that, take it away, Nelson and Aparna.

NELSON JOHANSEN: Can you hear me now?


NELSON JOHANSEN: Perfect. Okay. So really briefly, this session is focused on understanding the challenges in human brain data analysis, integration, and annotation. And all these are confounded by individual variation in humans, non-human primates, due to disease, environment, as Steve was talking about, and also other factors that we may not fully understand. So a lot of challenges in the computational and also sampling side of the world. So a few questions we're really interested in addressing are, "How do we quantify sample variation across human individuals due to cases, differences, and conditions? What variables are biological versus technical sources of variation? And how many donors do we need to really build a human brain cell atlas that covers all this variability?" So we have three topics during the session. The first one is going to dive right into understanding the molecular diversity in human and non-human primates. And how do we sample that? And Noah Snyder-Mackler is going to give us a short talk on the variation inherent in mechanics and his work there.

NOAH SNYDER-MACKLER: Awesome. Thank you so much. Let me share my screen first. Cool. Can everyone see that?


NOAH SNYDER-MACKLER: Awesome. Great. Cool. So I'm Noah Snyder-Mackler. I'm an associate professor at Arizona State University in the School of Life Sciences and the Center for Evolution and Medicine. And as you can see from this slide, the majority of my work and particularly with respect to brain atlasing and analyses of the brain and the body focus on non-human primates here, rhesus macaques that are living in a semi-natural setting. So most of what we know, and this sort of rifts off of a lot of what Steve was just talking about at the molecular level, particularly for these atlases, is drawn from really one or just a handful of individuals. So what can we glean from these N equals one or N equals a handful of studies? What if we picked the wrong individual? What makes someone or some particular sample a reference, right? And there are some really key questions that remain about the consistency and the homology of genomically identified cell types across populations and across species, which is really essential to understanding the breadth of natural and healthy variation, which itself is really going to help us inform human disease and pathology, right?

And so what this means is that we really lack an understanding of the population differences or inter-individual differences in cell type distribution, gene expression, and primate brains, including humans, despite the really well-established inter-individual variability and regional activity, morphology, connectivity, disease. This word that I'm going to mention that I think will stick throughout the whole presentation is this heterogeneity that we see across individuals, how we can see that as both a challenge and as an opportunity. So we shouldn't be thinking that we need to minimize this variation because this variation is extremely important for us in terms of understanding the disease etiology, aging-related changes, and so forth. So some examples of heterogeneity, these things we know are prevalent in our society across many diseases, is that there is large inter-individual variability in how we age. We think about how individuals can age at different rates. There's heterogeneity at the level of organs and even within organs across different cell types. What are the things that patterns those differences? Temporally and demographically, so across sex, ethnic background, and across the lifespan. So Steve mentioned the work that he's been doing within populations across individuals during the aging process.

And then, crucially, also environmentally. And I think these are all challenges and opportunities, but this last one, I think, is particularly challenging, but also a particularly rich opportunity if we're able to capture and understand this variation in individuals' lived experience in their environments, and how that patterns cell composition and cell function and might impact aging and disease in the brain. Now, what I want to focus on here is the fact that when we are looking at what explains variation in gene regulation, here on the left at the level, bulk tissue gene expression across different regions in the primate brain, and here on the right, Nelson's work, looking at the single-cell level across populations of individuals, is that there is a lot of residual variation that is so far unexplained. So us trying to understand what is this residual variation, how can we explain that, and what can that tell us about disease etiology and normal brain development? And I think a lot of this is tied into what we can think of as the exposome here, which is our lived experiences, ecosystems, lifestyle, social experiences, physical and chemical exposures. This is a really cool preprint on medRxiv that came out last year, showing that it does explain a lot of variation in many diseases, just this exposure above and beyond genetics, biological sex, and age itself.

Now, these challenges and opportunities that we can think of here is due to us having to translate some of these really constrained lab environmental models and moving them into more ecologically complex, environmentally complex, and genetically complex study systems, including non-human primates living both in captivity and in more natural populations, and into humans. So if we can trade off this-- we have to make this inherent trade-off between the limited genetic, environmental, and demographic variation for control in our analyses and our studies. And I think we're starting to scratch the surface here. And one of the questions Nelson put up in the beginning was, how do we identify the appropriate sample size? And I think it's going to really depend on the question that we're asking. What are these variables that we're really interested in tackling and understanding, and how they pattern variation in the brain and in specific cell types. And so In the service of this, we're starting to do some of these analyses. Steve's UM-ONE project is really trying to understand intravitreal variation in humans. Ed and Hong Wei's UM-ONE project, which my team's a part of, is trying to examine inter-individual variation in humans and macaques and marmosets, and in particular, trying to focus on particular aspects of the actors that are variable and might pattern gene regulation and cell composition in each of these species where we can control other components.

So in humans, looking particularly at demographics like age and sex. In macaques, where we can more objectively measure some of these environmental factors like the social environment. We're examining how social environmental factors pattern gene regulation and brain cell composition. And in the marmosets, we're able to do these controlled cognitive tasks to see how that is linked to variation in the brain. And then on the right here, a project that Jason Berry, Michael Platt, and I led that just finished as part of the BICCN, where we generated data to look at aging and sex differences across the lifespan in 55 individuals in 10 different brain regions. So we're starting to uncover this, and I think there are a lot of challenges that we need to discuss, but also think of this heterogeneity not as something that we want to remove or reduce, but something that is an opportunity that we should understand. So I'm just going to end with the questions that Nelson sent to me ahead of time with some further sub-questions.

How many donors do we need to capture the variability-- to appropriately capture the variability? Can we, and potentially, do we need to ensure diversity of genotypes? How much do genetics matter in determining some of these traits in human populations relative to the environment? What environmental and life history factors matter the most? So focus on those that have really clear associations with brain and/or health. So what are some of the factors that we know in the environment that pattern health and well-being already, and can we focus on isolating those? And then lastly, how do we accurately annotate inter-individual cell-type landscapes? So do we need to think of some sort of pan genome-like approach in terms of how we think about references rather than being a single or a few individuals, but something broader that might pattern that? And with that, I'm going to just hopefully kick off some of the discussion going forward.

NELSON JOHANSEN: Thank you, Noah. And your last question about how do we actually annotate these inter-individual cell-type landscapes, I think it's a really important one, both computationally and thinking what does drive these differences, disease, genotype, age, all that. So I guess to kick it off, like Noah said, a discussion on those challenges. As we move from this mouse whole brain atlas, these 5,000 clusters, how do we do the same in the human, non-human primate context, given these factors that we've just been discussing? Happy to open them up to the panelists.

NELSON JOHANSEN: Do you have some thoughts on some of the challenges we'll face in building these human brain analyses using the current approaches that was done for the mouse whole brain? Oh, maybe she isn't here.

APARNA BHADURI: Aviv, please chime in. Thank you.

AVIV REGEV: I'm not assigned to this topic, but since there was such quietness, I would highlight two. The first one is actually the metadata. This is not an animal in an animal facility. And to address these issues of exposomes that Noah so nicely articulated, you actually need a lot of information about humans. Historically, some of the samples that we accrued for these kinds of atlasing studies actually weren't rich with this kind of information. They were dissociated, for example, from a medical record. They were for this or for that. They didn't come necessarily from a cohort that had all of the information. But it is fortunate that actually in this field, there's also a lot of rich medical information, metadata, that was collected in clinical data over the years for individuals that signed into these studies. So I would put this as one. The second one, which is much more computational, and I'll stop there, is that in the mouse atlas, again, we had the benefit of sticking to one particular age, one particular strain, one particular everything. And so the common coordinates framework was a lot more forgiving than it is in the human, where, of course, we need much more sophisticated algorithmic approaches and not just lab approaches to let us map between them so that we know that we're comparing apples to apples. I'll stop here. I just wanted to make sure people are over the initial shyness.

NELSON JOHANSEN: I just want to follow up there. Like Noah showed, there's a huge amount of residual variation that's unaccounted for in both his work and my own. And that could just be figured out by incorporating more metadata and more information about the donors, understanding what drives that variation. I don't think we want to have clusters driven by a single donor in an annotation without really great support. Something to think about. Jimmie?
JIMMIE YE: Yeah, great. On that topic about variation. So I think there's sort of two points I want to make. The first is, I think it's important to think about what is the actual measurement of variation. If it's over single-cell data, we know that that's not quite computed correctly yet. And so if it's over pseudobulk, then we need to be careful, right? Is that going to be sensitive to how you're defining cell clusters and other potentially upstream operations? And the second related point is, let's say you have a robust metric in terms of the definition of statistical variance. What is the upper bound that you're trying to hit? Is it actually all variants, or is it that it's sort of like the repeatability of the actual experiment? And so I sort of would encourage this group to also just think about if you have technical issues, how reproducible are they in terms of sort of cell composition or gene expression profiles within it to be resolved. And those are data that you can collect, or results you can sort of compute using data that you already have.

APARNA BHADURI: Nelson, I think you—

NELSON JOHANSEN: So I guess--Hongkui, go ahead.

HONGKUI ZENG: Yeah, so I'm not on the panel, but I'd like to ask a question. Especially on the human variability, I know that there's always the technical variability as a confound, right? The tissue quality, postmortems, PMI state, and things like that. So I actually am very interested in Noah's work, your work on macaque or marmoset. But I think to understand the biological variability, it's very important to control the technical variability, right? Maybe that's more achievable in your kind of experiments. Just much more consistent quality sample that you can get, so you can really assess. And then also kind of questions to Aviv and others. And I think Steve also, beautiful work. How do you control-- how do you normalize against technical variability in order to reveal true biological variability? So kind of experimental and computational Yeah.

AVIV REGEV: I'll make two comments. I think experimentally, pooling is quite useful. One of the nice things actually in working with frozen tissues is that they allow you to actually pool samples from multiple individuals. We have shown that in the past, and I believe that others have as well. You can pool the samples from multiple individuals. You get rid not of all variability, but you get rid of something, which is nice to get rid of. And a lot of variability you handle like this experimentally. You don't fix all problems, but you at least improve on some of them. Computationally, I actually think multiple frameworks, both from the more deep learning side and more from kind of more classical explicit modeling side. I think Steve showed the factor models just very recently. And there's models that actually combine both of these worlds are actually doing an increasingly good job at parsing out and disentangling the different sources of variation. And again, for example, contrasted frameworks are much better at separating the sources of variation from each other so that it's not all becoming one big title known as batch. And in that batch is biology and technical and different kinds of biologies. You can actually disentangle them to different dimensions. So contrasting frameworks in this particular domain are really, really appealing, and there's others as well and we see more and more of them. I'm actually very optimistic that we have the right computational tools in hands for that. And again, any metadata, any explicit things that you can help your model with is wonderful, but even when you can't, you can still do a fair bit. As we learn what technical variation tends to look like, it doesn't always look the same way that other sources of variation look like. And so you can also use this information in your model to say, "After the fact, I can say that's likely technical. It has these characteristics versus this."

NELSON JOHANSEN: Yeah, thinking carefully how we can incorporate deep learning approaches to the computational clustering algorithms used by the mouse whole brain atlas is an interesting line of thought. How do we take that idea and bring in these tools that allow us to control for technical and biological variation present in humans? I think there's ways to adapt the same ideas moving forward in these new atlasing efforts. Steve, you had a hand up.

STEVE MCCARROLL: Yeah, so just to amplify something Aviv said, I mean, multiplexing and doing experiments with large numbers of samples in each prep makes it very straightforward to recognize all of the technical effects that arose in the lab. It just becomes clear because those are shared across the same 20 people or whatever level of multiplexing you use. What's harder to recognize is obviously all the things that are kind of not just perimortem, but-- not just postmortem, but perimortem biological changes that everyone undergoes near the end of life, but not in the same way. And I would just say kind of two things about that. One, you have to be prepared to just overwhelm the problem with numbers. So if you do this the way you might do a mouse of five mutant, five wild type, you're going to be lost. But if you overwhelm the problem with numbers, you can start to see very clear relationship to meta-variables like age that are clearly not arising by chance. You have to be willing to overwhelm the problem with numbers, and then I think also you need to look at it more and more as an analysis problem. You can recognize a lot of technical effects and late-life effects and things like that as principal components in the data. You never would if you did 20 samples, but if you do 200, you really start to recognize factors that arise again and again and that can be correlated to things you do know about. And so I think you have to see it-- again, with inbred mice, you look at it as an experiment design problem. I think with humans, I think you more lean into the analysis side and treat it as a data science problem.

JOHN NGAI: Hi. I had a little mini seizure when I thought about the idea of a technical variation being greater than the biological variation, just saying. So Steve, I appreciate your comment about numbers and multiplexing. Aviv, when you talked about pooling, were you talking about pooling samples or post-hoc pooling afterwards? Because as I'm sure you know, when you pool the samples, you lose any information about the source of the variability between them.

AVIV REGEV: So the idea was to pool actual samples in humans. Because the humans are genetically variable to each other, you can easily identify later on which cell came from which sample.

JOHN NGAI: So yeah, okay. So basically, you're multiplexing.

AVIV REGEV: In animals, you can add bar coding, but in humans, you do not need.

JOHN NGAI: Got it. Yeah, yeah, great.

AVIV REGEV: The point is that you actually pool the samples when there's still a piece of tissue. So you pool them before you've done anything with them. And that reduces, as I said, just one source of variation.

JOHN NGAI: No, no, that's great. Yeah, yeah, yeah. No, that's great. Because pooling, if you're pooling c57bl/6 mice, that would just be-- that would be bad. That's great. Got it. Got it.

AVIV REGEV: In mice, you can introduce nuclei multiplexing, and they have the same outcome. And if you work in genetically diverse mice, by the way, you can get another sense of—


AVIV REGEV: My other point was actually computational that I made at the very end. As you work, in some cases, you have a favorable scenario, and in some cases, you don't. So our friends who do sequencing for many, many years will tell you, after a while, you learn to understand the quantitative statistical nature of your systematic errors. And so you know the different sequencing facilities actually have different systematic errors, but you can learn them. And as a result, you can apply them inside your models, even when you don't have a way of explicitly tracking them. And that's another benefit that comes from large scale endeavors like this one. So I thought it was useful to mention.

JOHN NGAI: Right. So as usual, you're way ahead of me. So indeed, the genetic variability in the human population is a feature, not a bug here that allows one to pull and not lose information.

AVIV REGEV: Yeah. And there's also little annoying things. When you pool a lot together, you have ambient RNA, you have other things. I don't want to go there. It's just too much.

JOHN NGAI: Right. But I'd like to point out, it's remarkable that when we started out the BICCN way back when, there was this fear that if five groups went and did analysis of just, for example, primary motor cortex, we'd get 10 different answers. And it's remarkable how consistent the data really wound up being. So there's something to be said here. I think we're definitely on the right track, but it will get more complex as we start getting into issues of human source-- the drivers of human variation. Thanks.


KIMBERLY SILETTI: Yeah, sort of building on this idea of multiplexing, I was just also thinking that it's worth considering that the variability will also vary across cell types. So different cell types will have different amounts of variability across individuals, which also gives you an opportunity to compare within the same sample that certain cell types look highly variable compared to others. So also sort of might help provide an angle for looking at that.

NELSON JOHANSEN: 22:54       Rebecca?

REBECCA HODGE: Yeah, I think one thing that hasn't come up yet in terms of minimizing technical variation is really having a strong understanding of where you're sampling from. Initially, I think that was a challenge that we encountered, Kim, with the study that we did, and also Nelson, with the study that we did as well, in not understanding super well anatomically where we were sampling and not having good matching necessarily across donors. And I think that's something that we're hoping to improve upon with this kind of era of study is really to have a good strong understanding of anatomically where we're sampling across different donors. I think that'll be critical to minimizing that variation and hopefully understanding more about biological variation.

NELSON JOHANSEN: Yeah, for the teasing out technical sampling bias and variation is other factors. Mike?

MIKE HAWRYLYCZ: Yeah, I'd just like to add to what John is saying is that this new package, the whole mouse brain papers, there is indeed a lot of kind of consistency on a superficial reading of this that I think that will really just be an immense amount of analysis that we can do on these data sets across kind of data correspondences and analysis, which should provide a very rich kind of set for trying to nail some of these issues, so.

APARNA BHADURI: I think that this was a very productive discussion and kind of dovetails nicely into the next topic that we had outlined. So maybe we can continue the discussion on this, but maybe, Nelson, now is a good time for us to have Chunyu Liu present? Great.

CHUNYU LIU: Let me share my screen. So thank you for inviting me to participate in this panel. And actually, I feel that the discussion really set a good platform for me to discuss our work regarding actually specifically on technical noise issue. You can see my screen, right?

NELSON JOHANSEN: We see the presenter view.

CHUNYU LIU: How do I switch to that one?

APARNA BHADURI: So it says show taskbar, display settings, and end slideshow on your screen. Right where your mouse is.

CHUNYU LIU: Oh, okay. That's very weird. Sorry. Let me stop this and try again. I never had this problem before. Okay.


APARNA BHADURI: Yes. Now this looks right.

CHUNYU LIU: Okay. Sorry for the delay. So this is the work by my PhD student Rujia Dai. So we focus on evaluating precision and accuracy of single-cell data. So I don't need to say more about how useful and important a single-cell RNA-seq data is. And so the driving question we try to deal with is first, the single-cell RNA-seq data really deliver good quality data for quantitative analysis to address all the biological questions relating differential expression, QTL mapping, co-expression, whatever you're interested in. So the second question is, should we use an expression measure from individual cell or pseudobulks? Meaning pooling all the cell the same tag for quantification. I think that's a very practical question we deal with all the time because we see publication using both methods. So at the end, we're interested in, will we be able to obtain reproducible results from single-cell RNA-seq data? So we want to say quality control is very essential for downstream analysis for single-cell RNA-seq data, particularly for its two major issues. One is it directly relates to the high missing rate issue. It's very long. We calculate the missing rate on 14 studies. Here shows if you look at the cell level, the missing rate is very high. Average 80% of the cell really missing the genes.

And if you do pseudobulk, that missing rate dramatically reduced, but still around 50%, even for the best data you have. Then if you look at the number of cells sequenced per cell type per individual, the number is very small. We're using one of the best data, BICAN HPS study. You can see if you really zoom into individual cell type, the number of cell sequenced, the highest is about 1,000-something. Many of them really have just dozens of cells sequenced. Even though the few studies really sequenced the millions of cells total for the project. So there are already some common quality control process parameters have been used. I don't need to go into detail, but the important thing here is we did not see any use of either precision or accuracy evaluation of a technical variation in current practice. So we go straight to evaluate precision and accuracy for gene expression measured by single-cell sequencing data. I don't think I need to spend much time to explain what is precision and accuracy. And certainly, some people prefer to call them uncertainty instead of precision or bias instead of accuracy. Whatever you call the name, just you can replace that word in your brain when I was talking about-- when I'm talking about precision and accuracy.

So certainly we are interested in producing data like this. So everything to have high precision and high accuracy. But what is the real decision for our single-cell data is a question we try to answer. So here we assess precision using coefficient of variation CV value by constructing technical replicates using random samples of the cells, individual cells. So divide them by three for every number of cells you have in total. And then we construct artificial technical replicate and to calculate the CV value. So if we go into the 14 data sets, you can see almost every data consistently really have high variation CV value, particularly for the minor cell types. If you zoom in to this-- one of the biggest data, back end data, you can see the minor cell type really has a huge CV value. But when you go down to the major cell type, the CV value is reasonably good. We use a CV of 0.1 as a threshold. This is a preferred CV value based on the classical microarray or bulk-only sequencing technology. So we hope we can deliver that type of precision in our data. So we also noticed actually the precision actually increased with the number of cells sequenced. So we just stopped sampling the cells from all the cell sequence. You can see regardless of what cell type you're looking-- this is a neuron.

As an example, you can see when you increase the number of cells sequenced, the CV value really goes down, and the flow at the end, so that's good news. So good news is if you have enough cell sequenced, you can really come up with good precision in terms of accuracy. I mean, precision measuring gene expression. But unfortunately for the minor cell type, you will see, you will not be able to produce the precision of the data. And we can see the good correlation between the number of cells sequenced and the CV value. And certainly, number of cell is one major driving factor. We also notice that the integrity and the equality is also a good predictor of the precision. So when you have a relatively poor real number, in this case, you would have bigger CV value, as you would expect. And then we further evaluate how many individual sample across the 14 study really have enough number of genes at good precision. So that's an evaluation we classified the sample based on the median CV value greater than 0.1 or not. So if they're greater than 1, we put them into a yellow color. If they're less than 0.1, we color them in green color. So you can see only a few of the major BICAN studies really have most of the individual, past more than half of the individual have good quality. But many other studies really have few samples that  can produce enough genes that have good CV value.

So we also look at the accuracy using-- in this case, we use special data from Hagai et al.’s 2019 paper. They generate both single-cell RNA-seq data and pooled RNA-seq data. They can compare the data directly. So we use this pooled cell sequencing data as a ground truth so we can evaluate accuracy. Again, we use the correlation in this time to evaluate the accuracy. So the result is you can see-- when the number of cell really is small, you also have less accuracy for all the four data sets they generated. If you have enough cell, you really achieve much better accuracy. So that's a major data we have. So the summary we have show low precision and poor accuracy if you use a very small number of cell sequence. And it's preferred to use the bulk with large number cell to ensure the precision and accuracy in terms of measurement of individual gene expression. So go back to the full situation, we would say the single-cell data is more in this kind of place. So it has a low precision and reasonably good accuracy. So it requires a really large number of cells to be sequenced to reach the good quantification of gene expression. But what is the really consequence when you deal with such a data? So I will talk about the actual application analysis of different gene expression on Thursday. That's session three and panel two. Stay tuned. So maybe I should stop here.

APARNA BHADURI: Thank you so much. That was very interesting and very, I think, insightful. Any additional questions or thoughts from the panel while people are thinking about discussions to build on that? There is a question in the Q&A which asks, "Have you considered examining the precision and accuracy between single-cell RNA sequencing and single-nuclei RNA sequencing data sets?"

CHUNYU LIU: That's a very good question. Actually, the majority of data we've analyzed are actually single-nuclei data because we are dealing with-- my lab is primarily dealing with a brain disorder. So for brain, single nuclei is the major thing we use. But the second data when we analyzing the accuracy assessment, that is a single-cell data. So the sequencing is identical. The CV value is a highly related number of cells sequenced. Same thing. But certainly, we cannot do that in brain because we don't have a very good ground truth to compare to.

APARNA BHADURI: Great. Thank you. I'd like to open it up for discussion on anything building off of Chunyu's talk, as well as some of the questions that we had outlined here in terms of thinking about the second topic, which is disentangling technical and biological variation, which I think we started talking about and would love additional insight on as well. Some of the questions that we had analyzed, and maybe we can talk about this one by one, starting out at least, is measuring cell type variation carefully in the presence of multiple sources of technical variation and donor variability, and what are the thoughts? And I know we've talked about this a little bit in terms of adding in metadata and adding in other ways of controlling for this, but are there ways that we can learn lessons from previous large-scale profiling efforts as we are doing this and thinking about what are the right QC metrics that we want to be thinking about here? But specific examples of previous large-scale profiling efforts include GTEx and ENCODE. So I will open it up to the group right now. If we could get some of the panelists on screen as well, that would be great.

NELSON JOHANSEN: One consideration is the type of tools that will be developed or used to assess data quality per cell type as was shown and how that can guide sampling. So really, just everyone's thoughts on how to approach large-scale human atlasing and the tools that are required.


LIOR PACHTER: Thanks. I suppose I have a question more than a comment for others. We're completing an analysis in my lab where we've done a very simple, boring thing, really, which is just to compare results on the same data as processed by Seurat versus Scanpy. And we've looked very carefully at each and every one of the functions, starting with the most basic stuff. You do the pre-process the data, the boring stuff, all the way to getting marker genes clustering, standard vanilla stuff you just do to a data set. And the bad news is that the results are very, very different using these two packages, even when they are ostensibly implementing the same procedure, let's say, Leiden clustering or building a k-nearest neighbor graph. And I'm trying to think if there's good news, but there wasn't really any good news. So my question is, in BICAN right now, has there been a standardization of just the software package used, whether it's Seurat or Scanpy, for doing things like the clustering? I mean, I know that the Allen Institute is-- and I like that the hierarchical clustering method is separate from those packages, but for other groups, are they using one or the other packages? And if so, which one or why? Thank you.

NELSON JOHANSEN: Comment on that from the Allen Institute side, we definitely agree. We have to be on the same page and how to approach analyses because different methods have their own biases. And even internally at the Institute, how do we cluster this human data as we start with each region, operate individually there, and then bring it all together? We've been trying to formalize standards in that space, and they basically extend the work from the mouse whole brain in ways that are both kind of well set up, but also efficient computationally, which is a whole another box of worms. You talked about Seurat, Scanpy. We're thinking of GPU-accelerated Scanpy, which is a whole another box of worms. But I think we just need to have some way to communicate what we're thinking of these standards, and hopefully, all the groups can align together if we show that this one kind of pipeline seems to work well. And we're trying to carefully compare these pipelines you put together against the mouse whole brain pipelines to see if we get similar answers in the number of clusters, DE genes, and just trying to have a baseline really to say what is a good analysis framework to move forward with to figure out data quality, variability on PMI, and how to handle that computationally.

APARNA BHADURI: Kind of building on that discussion before we get to John, there's a question in the Q&A that asks from an anonymous attendee, "Are there differences in underlying assumptions between methods that need to be examined in the context of these comparisons?" And I don't know if, Lior, you have thoughts on that.

APARNA BHADURI: Or Nelson, also, if you have thoughts on that.

NELSON JOHANSEN: I’ll defer to Lior if he wanted to comment there first. But yeah, Seurat and Scanpy, they're finding the same.

LIOR PACHTER: Sorry, what was the question? I didn't quite follow.

NELSON JOHANSEN: Underlying differences in the implementation of methods between Seurat, Scanpy, whatever toolbox you use for RNA-seq analysis.

APARNA BHADURI: What are the underlying assumptions behind those that might be similar?

LIOR PACHTER: It's a long answer. I won't go through it now, but on some of the functions that they implement, they have different defaults. That's pretty easy to fix because you can just try to make them match, although I think most people don't. But there are actually functions where, for example, we've been building the k-nearest neighbor graph, they're relying on different implementations. And there's programs they're using to do those things that are producing completely different results. We have a complete table for each and every one of their functions. It's an enormous table when they can be made to match, when they can't. So it depends, but it's really not a good situation. I mean, yeah.

JOHN NGAI: So, Lior, I have a question for you. When you say completely different results, I mean, at what level? I mean, we know what we've seen from these various analyses that there's a hierarchical structure. In the original MOP studies that BICCN ended, we spent a lot of time on replicability analysis, and we found that, basically, things were replicable at the very high level. And it kind of, not too unexpectedly, got a little messy as you got further down on the trees. And I see Jesse Gillis-- Jesse, if you're around, maybe you can comment because I know you were one of the main drivers behind the replicability analysis. Well, Leor, the question I have for you is, what does falling apart mean?

LIOR PACHTER: That's a great question, John. And I think the answer to you is that, yeah, of course-- if you're trying to tell apart GABAergic and glutamatergic neurons, you can do that. And it won't matter whether you ran your cells through Seurat or Scanpy. They'll largely agree. I think the way to look at the implication of this is, of course, as you go further down, things start to disagree. Marker genes predicted can be very different. And I think the way to think about your question is that we've done an analysis, which is, "Okay, let's say we are not really sensitive to the differences", right? We're looking for results where it doesn't really matter which program you used. Let's say our question was, we want to tell apart glutamatergic and GABAergic neurons. Then the amount of data that is collated, both in terms of cells and depth of sequencing is just massive overkill. For most of the functions we've looked at or almost all of them, when you're just running Seurat versus Scanpy, if you don't care about those differences, you can sequence 2% of what you're sequencing. So you could have saved 98% of your money on what is currently typically sequenced. Same with the number of cells. So it's sort of they don't completely disagree. It's not like if you run Seurat or Scanpy, you have different answers, but they disagree to the extent that the amount of data you're collecting is an overkill for the resolution that you have.

JOHN NGAI: Sure. Sure. But we did a lot better than just separating glutamatergic from GABAergic, right? So I mean, it's interesting, important question, but also practically, I mean, where does it-- given a certain amount of sequencing, do the two methods agree, say, down to the subclass level? Because what we found through the MOP project is you got down about 25 groups and then very reliably, no matter how you looked at it, you could just walk in, do an experiment, you find the same 25. And then as you get down further, not too unexpectedly, it does get different. I mean, it is a concern that these ostensibly similar approaches are giving different answers. But on the other hand, maybe it's not too surprising at all. There was a question in the chat about, why the heck do you care about having 5,000 cell types if you can't describe function to them? And I think part of the answer that I typed in there was, "Well, you're not going to discover those rare cell types by doing a cursory shallow sequencing." But then your observations raise a question about, well, do you even know what you're looking at when you get down there. So I just want to kind of put a practical lens on the question here.

LIOR PACHTER: Yeah. Great point. I think it's exactly like you said. Methods agree at the high level, that's not really where the interest is. It's in the rare cell types, small numbers of cells. When you're getting down to those levels, yeah, I do think these tools are producing very different results. Yeah.

NELSON JOHANSEN: Do you think this is indicative of having to develop new tools, or is it just being very careful in how you implement the current tools, maybe trying to incorporate biological priors or information like differential expression to guide clustering?

LIOR PACHTER: Yeah. I just think it requires looking at the tools that are being used and just testing them against each other, making sure they produce the same answers. It turns out that Seurat and Scanpy, when they run PCA-- PCA is a mathematical procedure, but they're doing it in a different way that yields quite different answers. You can get them to agree, but they don't in their default. And so they're going to produce different answers. You're going to get different cell types. So it's just about being careful and paying attention to detail.

JOHN NGAI: And just to comment, even if you can make them agree, it doesn't mean that it's closer to any ground truth that might be out there. Right?

LIOR PACHTER: I love that you said that because the question is, are the procedures even the right ones, is a great point. But yeah, so I agree with you completely, John.

CHUNYU LIU: Yeah, we observe the difference when you run two different software. It's possible because the noise has been differently managed. Some methods more sensitive to the, I think, the noise. That's another thing you may want to consider. As I showed in my data, for the minor cell type, when you have a few cell sequences, the noise is much, much higher than the major cell type. So maybe that is one of the causes of discrepancy in performance. So if you control those, do better quality control, maybe the results will be more converged.


HONGKUI ZENG: Yeah. I just want to speak a little bit maybe on behalf of Zizhen Yao, who's our main bioinformatician who did the clustering of the 5,000 clusters of whole mouse brain. So she really quickly developed her own pipeline to do that. Unfortunately, I just checked with her-- I chatted with her on Teams. She's stuck at Las Vegas airport right now, so she's not able to attend today. Yeah, so I think from my conversation with Zizhen also, it's like, why don't you use the regular Seurat or Scanpy, and things like that? Her main concern is really the scalability. Lior, as you're saying, we want to do hierarchical clustering. We want to get as much granularity out of it as possible. And then we have humongous data set. Right? So she developed this automatic hierarchical process. I think the principle is the same, very similar to Seurat, but she just really parallelized it and also made it automatic hierarchical. I mean, you can do Seurat on our data as well, you just have to do-- hierarchically, you have to do it manually. Each iteration, you have to do it manually, whereas she automated the whole process. Just wanted to explain this a little bit.

But  I also would love people try your computational method on our data because our whole mouse brain data is now very well QCed. Very high quality, very consistent quality. And now we have sort of, let's just say, ground truth, not really ground truths, but at least the first iteration of the clusters that we have generated. I would love to see people doing it independently and see if you can come up with the same clusters or not. And there are periodic questions about, have you validated your cluster, right? So I said we validated using Murphy's data, spatial transcriptomic data. But there could be subtle differences. Very welcome people to try on our data, because it's really good. It doesn't have all this variability, technical concerns that people may have. So it's a good test ground. Another thing I wanted to say also is that we are-- as Nelson was saying, we are also trying comparisons, things like that. We also have other kind of study where, for example, an aging study, mouse aging study, we have a particular data set. We map the new data set into our taxonomy to get it assigned a cell type identity based on our 5,000 clusters. Once we narrow things down to a particular region, we take also the subset of mapped cells and we do an independent Seurat clustering for whatever manuscript, for analysis purposes. And then we can directly compare within that smaller data set the mapping result versus the independent cluster result and see if they agree with each other or not. So far, it looks pretty good. There's a good point. We'll pay more attention to this and to see if there are several differences in different methods.

NELSON JOHANSEN: I think being also clear about our application of these comparisons to the whole consortium because we are doing a lot of that internally now. Trying to understand. But if you change one part of Zizhen's pipeline for hybrid clustering, do you still get a similar answer to number of clusters on the Siletti human data set? We swapped out PCA with SCBI. Now you have a deep learning latent model to derive your lower dimensions that have PCA. So we think we can do a better job of communicating those as we move forward. Aviv?

AVIV REGEV: I'll share a couple of thoughts. The first is that in this robustness to computational techniques, people often pay attention to the methodology but not to the parameters. But if you shift the parameters, you shifted everything. And also, these are all heuristics. There's no algorithmic solution proven with this and that property for the vast majority of the things that we do. So none of us should be actually shocked that they come back a little less than the same. And so robustness is one of the criteria, I think, that has to be really paid attention to. If something is robust to different ways of doing things, it gives us comfort. It's not a proof of anything, but it does give us comfort. And that actually reminded me that in the early days of the human cell atlas, in the white paper, there is actually a section about how do we know that a thing was actually, say, a cell type or a category that we would trust, regardless of what you call a cell type, which is a discussion in its own right. And we actually spent time on this. There was some project like that - Ed might remember it - that we would all get together on these calls and we would try by different approaches and how do we assess that they're giving us the same.

It's good to go back to that playbook. It's actually still pretty valid, and it did prove itself reasonably well over time, rather than just be locked into one approach that people are doing things now and feeling that's what we do. So robustness is just one criterion. Another one is the reproducibility across methods, meaning the ability to recover the same classification under a different style of measurement. For example, not single-cell profiling, but MERFISH, as Hongkui pointed out, as well as in places where you can do it prospective isolation. Meaning now, I identified molecular markers, I re-isolated something based on these markers, and I actually got a thing that looked the same way. These are all criteria that are still good practices to use. The second comment that I would make is that when you move into disease, all sorts of stuff starts happening. And so things don't always map to the same categorization we had before because, in fact, new things might be there that didn't exist before or substantial shifts in the identities of cells can happen under disease or under different kinds of perturbations.

Now, I think actually in this, there is a lot for the brain community to learn from some of the other organ communities that have gone actually further downstream on the disease side than necessarily has been the case for the human brain, mostly because the human brain is so humongous and it's so varied across its physical scale. So there has been some drilling in with disease in Alzheimer's, for example, and so on, but not in a way that kind of encompasses the whole thing. But in lung and in gut, people have been iterating at it for several years now. They found higher reproducibility across diseases than people have expected. There was a lot of worry, like the one John mentioned earlier, that we'll go to another court of patients and we all look different. Turns out you go to another court of patients, and it does not all look different. So different courts of the same disease. But there are cells that are there that really were not there before or are radically shifted from where they were before. And there is a fair amount of computational work that has been done. I don't know if Fabian is actually on the call or not. I believe he was supposed to. For example, his work has done a fair amount of work at trying to take an atlas that exists and adding something into it, mapping something into it, and identify which are the parts that map well and which are the parts that are actually truly new.

APARNA BHADURI: Great points. Jimmie?

JIMMIE YE: Yeah, so I think just to bring the conversation back a little bit, I think one of the goals here is to try to compare data across potentially multiple individuals, case-control comparisons. So if you take a slightly different approach, especially now that there's so much reference data, you can sort of, given some new data set, assign a cell back to some reference with a statistical measure. And that way, you at least have some way of anchoring all the data to the same reference and then do the statistical comparisons on top of that, rather than rerunning a pre-processing pipeline on all the data and then have to draw boundaries on what's a cell type versus a cell state. Not saying about that, I think you should sort of complement the approach of let's rerun a Scanpy or Seurat pipeline. But just another way so that you can potentially get to that quantitative comparison faster without having to be sensitive to how that upstream processing is done.

NELSON JOHANSEN: I think leveraging all the great work on atlasing and human from Kim Siletti and the mouse whole brain mapping to those with new data is going to be really useful. And also as a really nice segue into our last topic for this session, which is on how do we actually integrate both multimodal information and come to standards on these annotations. So we can do these kind of reference comparisons faithfully. And Jeremy Miller is going to give a short talk on a few of those points.

JEREMY MILLER: Can you hear me?


JEREMY MILLER: Okay, great. My actual camera and microphone decided to die in the middle of the day, but I seem to have found another one. Okay, I'll share my screen. And let's see if this works. Can you see a presentation?


JEREMY MILLER: Okay. Yeah, so I'll talk for a few minutes about setting standards for multimodal integration and annotation. I'm putting multimodal in parentheses here because while all of this can apply to any modality, most of the stuff that I've worked with so far is the single nucleus RNA-seq, but it's not limited to that. Yeah, so as we've heard throughout the day, there are many different kind of whole brain analyses in multiple species that have been published kind of across science and nature and all of these publication packages and in other places. And I think these all are kind of huge and amazing efforts in themselves. But I think a driving question that many of us have here is, how do we get from these amazing papers into some sort of an integrated version cross-species multimodal atlas of brain cell types in the developing and adult mammals? And I think standards are an important way to kind of get there. We also need to get the community to actually use and improve this atlas. That's, I'd say, another topic on its own, but also requires you to have kind of these standards to begin with. Some sort of important considerations for this we have underway and some of the things that I'll talk about are kind of a set of standards for version taxonomies, then tools for creating and annotating these taxonomies. You need a backend for the data, the taxonomies, and the ontologies, and scaling to thousands of cell types. So there's a lot of kind of manual curation you can do if you're looking at one brain region, but if you have 5,000 cell types, then manual curation becomes a lot more problematic. So having kind of automated ways of doing things is really important.

I would say some things that are more interesting and that-- maybe for discussion topics, but that will require some additional planning are things like how you integrate anatomy into these taxonomies. How do you extend the analysis, which we've already talked about quite a bit, and then the standards and tools for kind of cross-project and cross-species integration. I think something that will become even more important as we go on is how do we integrate the developments in the adult brain analysis. And then another topic that's, I think, always at the forefront is how do we deal with the cell type nomenclature. And so I'll talk a little bit about some of the things that we are doing now. I'll put names at the bottom for the people who are primarily working on the work, and I'm sure I've missed some of you, so sorry in advance if I have. We've been using this cell annotation schema that's developed originally in collaboration with HCA and that we're extending for BICAN, which is basically a way of kind of keeping track of different kinds of information about cell types at all the different levels of the taxonomy. A lot of it are the kinds of things that we would want to keep track of anyway, but in this schema, it's just a way of kind of setting out when we say the name of a cluster, this is what we mean, when we put out aliases, this is what we mean. Different ways of tagging the methods that we're using, kind of who's kind of putting this in, keeping track of evidence.

I'm not going to go through this, but I think the main point is that there exists a standard that we should all work off of so that we all know we're talking about the same thing. It's also worth pointing out that this is compatible with CZ CELLxGENE, which I know a lot of people like to use. At the Allen Institute, we've also developed this scratch taxonomy kind of h5ad format, which stores all the data, metadata, and standards together. Whether or not it will be able to store data for bigger data sets is kind of another topic, but this is a way of allowing us to kind of build tools off of a single format. And we're working on making sure that this is also consistent with the h5ad formats that CELLxGENE is using so that there's only kind of one format that everyone has to kind of work with. For annotation, I think this is a key component. David Osumi-Sutherland and his group are developing this CCN2 tool, and it basically allows you to instead of working from a Google Sheet where people manually type in, "Okay, for cell type 157, this is what we know about it," you can basically type it into an editor that kind of will automatically edit these kinds of tables, but keep the backend standards like the cell annotation schema in there so you don't have things like some people are typing somatostatin as capital SST and other people are doing lowercase SST. You can also tie into existing ontologies, so that when we're talking about somatostatin or some other cell type, it ties into the other information from the field.

And so that's kind of what this one is about here, where people who are studying, for example, I think chandelier cells is the one that I have here from electrophysiology or chandelier cells from RNA-seq in mouse or in human can tie into kind of the same definition of chandelier cells. These kinds of things will also allow you to-- for the autocomplete in the search that I mentioned. And I think this annotation step that I'm not even remotely going to enough detail about is kind of the key step where taxonomies can be linked across modalities and species. But the important part is you have to have a way of actually linking them. And much of the kind of math stuff that's just been talked about how you actually do this is an important part. And the next part is this part of how do you actually track that information once you've matched things together, and I think this is one way of kind of going about doing it. And then I think a final point is just that for us to make really cool tools like the Allen Institute, there's this MapMyCell tool that lets you map your own data into an existing reference taxonomy and ABC Atlas for visualizing these reference taxonomies, and kind of any other tool that you like to make, it's important to have standards. Otherwise, every single time you make a new taxonomy, someone has to go track people down and figure out what everything means, and try and sort of jam it into the data set.

But if we work off the bat on these same kind of standard formats, then it's much easier to kind of make these cool tools so we can actually use the multimodal data sets that we're creating. Okay. And I think at the end of the day, the goal is this brain atlas where you can kind of search the atlas, get whatever information back that you want across multiple modalities and species. And this can be something that kind of ties into other consortia and other scientific users. And then people can kind of put back into this in a way that is feasible with something as large as a whole brain taxonomy. And so I think I'll stop there. I've taken the discussion topics that are up for this and kind of rearranged them slightly. I think you guys can bring them up - Nelson and Aparna - when you're ready. To me, a lot of them fall under kind of annotations in the multi-modal integration categories and things about knowledge graphs for linking them together. But I'll let you guys bring them up in kind of the order that you would like.

NELSON JOHANSEN: Awesome. Thank you, Jeremy. I think, yeah, as you pointed out, having common annotations that we can use across multiple groups and efforts allows us to really understand biology, disease, and aging, developmental studies. We need that common nomenclature to work from so we all talk the same language. And a question to the group here is, how do we get to those standards? Is it using the tools that Jeremy suggested, which are all great? But then how do we engage the community and ensure adoption of annotations and tools to utilize them? Steve?

STEVE MCCARROLL: So we really love having these taxonomies available as scaffolds for disease analyses. And in fact, almost always one of the first things we do is just in a top-down way, assign the cells to the cells in an existing nomenclature rather than try to reinvent that with some kind of Ab Initio clustering. So I think they're really useful scaffolds. I think one thing that would help a lot in them getting used more generally and also just in getting non-genomic biologists interested in them is if we made more effort to connect them to everything else that's out there in the literature about cells. I mean, these taxonomies often have these very cryptic names or numbers, or they're named after one or two markers. And then when you really dig with someone who knows that part of the brain, they're like: "Aha, chandelier cell," or something like that. And there's whole literature on many of these cell types, right? But these taxonomies get created sort of in the ether and not connected to that literature, not connected to just the street names or club names that these cell types have. And I think these cryptic names are just not going to catch on, and there really needs to be some effort to connect what this community is creating to what decades of research have created. And I realize it's not always going to be one-to-one and it's an effortful thing to do that, but I think that will make this much more useful and interesting to the general biological community.

NELSON JOHANSEN: Yeah, I think your point about-- this is a community effort. We need to get the community involved in a way that everyone feels like they have a stake in what a cluster is. The patch chic teams, the physiology, they can define how neurons fire for each cluster. The connectomics can bring in really great information to describe these clusters. But how do we get the community to all engage? What are the tools to allow that? That's the kicker. Aviv?

AVIV REGEV: So first, to follow on a point that Steve just made, the connection to the communities is at three levels. The first is the people who actually gave things their names often have a very particular reason they call them by this name, but there is no place to capture that. And so annotation platforms that actually allow to capture it as people give something a name. And here's why that is both the beginning of this mutual community understanding of the ability to kind of adjudicate across different names that people give to different things, as well as that record becomes fantastic for algorithms later on. The second is communities, also all those people who preceded us and partly are ourselves that have written that literature. I'm actually amazed. I've been in a meeting today for an hour and a half, and no one said a large language model for the whole time. And so this is actually where we have a fantastic opportunity and you can see the methods starting to come out. And there was a mention of knowledge graphs, which is a different type of modeling approach to actually bake in that into the process, into how we think about integration. So rather than thinking, "Well, first we integrate this data, and then we give them names, and then we look at the literature," it's actually all baked in together. That's actually where the methodology is now. So we shouldn't kind of work with an old methodology. The world has substantially moved.

And then the third point that I will make is still the conversation was very, very focused on the, "We have the cells. We want their categories. We have their names." This annotation is useful for a lot of things we just do on a cell category by cell category level, but it's also tremendously helpful to do a lot of things with the integrated atlas. For example, you can learn once you have a partial annotation-- not a complete annotation, actually a very partial annotation. You can learn distances that then allow you to query so that you can come with a new cell-- with a cell you have, and ask where are others that look like that. Or you can use this information as you train models that are generative models so that they take, for example, one data modality and generate a second from it. But in the semi-supervised or supervised step that is used in the training, these labels become very, very essential, very useful. And you can do things like look for multicellular programs. I believe Steve also mentioned those when he talked. When you try not just to categorize the cells one relative to another, but once you have a categorization, you can now ask, "How are processes that are happening in one category of cells related to processes in another category of cells?" I believe Jimmy worked on problems like that, as well as my lab did. And so all of those things depend on having these categorizations and these tools, and the better the tools captured the reasoning process that was happening when people gave things names, the better off we are when we try to reap these additional benefits.

NELSON JOHANSEN: And just to echo your point there, the large language models is an exciting place to explore taxonomies.

AVIV REGEV: It's already happening.

NELSON JOHANSEN: We can talk to taxonomies in the near future, I would imagine. If you have a language model for each taxonomy, you actually speak to it and get a coherent response back about the cell type and all the information that describes it. So it's an exciting direction to go. Chunyu?

CHUNYU LIU: Yeah, I think having a tools package to facilitate the labeling of cell classification is very important. But at the same time, I think another critical issue is what is the evidence to relate your cells to the cells which set up the ground reference is also very important because there could be some kind of misalignment somewhere you did not realize actually give you a wrong classification. So I think how really ensure your cell is really classified to the correct category and how to measure that distance is a very important thing we should keep in mind.

NELSON JOHANSEN: I guess extending off your point there to the whole group for the mouse whole brain, there's now the MapMyCells tool. It's a web portal you can submit your cells to and get a mapping result bank along with, I believe, some confidence in that annotation. What is the group's thoughts on utilizing that as we go forward? I think it's an amazing way to get community involvement, but is there anything that we could do better or-- get the panelists thoughts there.

ED LEIN: First of all, thank you so much for your various comments on that. And I think that we don't really have an ability to say all the things that are going on.

BOSILJKA TASIC: It's very hard to hear you.

APARNA BHADURI: 01:12:47    Same here. I can't hear.

ED LEIN: I'll try to come closer. Is that any better?


ED LEIN: Strange. I don't know what's going on. Sorry, this is the best I can do as well. I just wanted to say there are a lot of these things going on. One of the challenges is we went from one part of the brain we all understood very well and could bring together our collective understanding to the whole brain all of a sudden. And so it's going to be a process to annotate this thing. But an important thing is once you've defined these cellular communities or entities, now you can start to layer on information and build more and more and more and connect to the community. There is a bit of a danger to connecting to the primary literature because we've tried to do that a few times and find out the literature is wildly inconsistent. And so the mapping has to be really accurate for doing that. But there are ancillary efforts like sequencing with productivity and things like that that begin to build this knowledge base that will also make it easier to connect to the literature. One thing that is a bit missing is a really highly functional community annotation tool. And CAP is in that direction, but I think not quite there yet. And we have similar challenges here. And so even within the consortium, we're trying to use our own community here to annotate these classifications. But that's a challenge. So I think if those tools can be developed, I think this will become much more feasible to do. Having the community map against the current state of the reference is great and it will get people using it. We need mechanisms for feedback on how well that works and where it does or doesn't work so that the community can help to update it over time and make it work.

JOHN NGAI: Both literally and figuratively echoing its comments. And also getting back to what Aviv said, I mean, we definitely do want to connect back and forth with the community so that the resources will be used, right? We can't be in some kind of foreign nomenclature space. But on the other hand, I think Ed, I think you were alluding to-- I mean, the issue there is that people look at their favorite cell type through a very specific lens and they're looking at a function that they're using to define the cell because they're only looking at one kind of a thing. And it could be other biologically relevant aspects of that cell to begin with. But when you mentioned this kind of-- the words community and annotation in the same breath, it reminds me of the early whole genome sequencing days of these new model organisms. And I would get calls like, "Hey, we're having an annotation party." Basically, it's a hackathon and bringing in people that have knowledge of different aspects of biological function and applying it to a new organism. Perhaps there's a way of doing that, especially so much can be done virtually. If we look at the way that proofreading is being done on some of these projects, maybe that's a way to get a less biased or less lampposty kind of way of annotating the cell types if we can somehow manage to bring in a larger number of people to sit down and help us annotate some of these taxonomies. Just kind of a thought off the cuff.

AVIV REGEV: I'll just say that as Ed mentioned, this is something that the platform team would be very happy, I'm sure, to engage with because it's within their mission. And I know they've been talking to all of you.

NELSON JOHANSEN: And we're almost at the end of this session, but I'll open the floor to any of the other panelists who may have had any comments or questions on the multimodal or other annotation standards.
KERI MARTINOWICH: This is actually more of a question to everybody else. And I think a lot of the cell annotations right now have been focused on cortical regions. And then I think there was a comment made just now about how this is going to be a process, an iterative process to move to different brain regions. And I'm wondering also as a field, is there a priority list for how-- because it is a big job to do this, and I think the cortex is relatively easy compared to some other brain regions where you have kind of more complex cell types where the spatial gradients are harder, where you're going to have to have that spatial information integrated to actually call cells and have that in the annotations. And I just wonder, as a field, is there some kind of priority list for how we kind of map around the brain?

HONGKUI ZENG: Yeah, that's a very good point. For the whole mouse brain, we've already done an initial round of annotation of all our clusters based on the MERFISH data. So it goes much beyond cortex. But of course, the annotation doesn't include a lot of the literature information yet. And the accuracy may still be questionable because we just didn't have time to do it, and also don't have the expertise in many areas. The seven large neighborhoods that I mentioned this morning, six of them are neuronal neighborhoods. They are based on the anatomical structures. So I think it's a great idea to maybe prioritize, For example, hypothalamus would be one, midbrain would be another one, something like that, because they have very, very complex cell types in there. And get experts involved. Another thing we're also trying to do is to integrate-- not just mapping, but also integrate our whole brain data with previously published RNA-seq data, which already had expert annotations, right?

When different labs published their work, they already described where the cell types are. So if we can integrate them together, then we can do label transfer, directly import information from the community as well. It doesn't negate the fact that it may be good to engage the community for annotation. I think John was mentioning that. And maybe this is something that we can discuss in BICAN, like how to do this. Do you just put an online tool, annotation platform, just let people randomly input? Or do we actually organize some efforts, right? Annotation or hackathon kind of effort. Jamboree. We talked about Jamboree before. But we need to get interested community, a particular community who's very interested in working on cell types and get together. We have to have the community to be interested, the experts to be interested in doing this.

KERI MARTINOWICH: I mean, I think it's clear for the mouse brain that it's there, but I guess the comment was also made just the sheer size of the human brain is overwhelming. You can't do the same thing that's been done with the mouse brain that you can actually just take a whole mouse brain and get good coverage that you're going to get most of these. And so I think for just the actual sheer size of the human brain and then individual regions, I guess that was more of my question is, is there some kind of prioritization of where to go first or different areas?

JOHN NGAI: So I can jump in, that's a great point. And in fact, there are efforts to prioritize on-- well, it's not a small region, but for example, on basal ganglia, because this is also going to be kind of a demonstration case for the connectivity project, which is the other big project. So we can't take on the entire human brain at once, but the idea is, as you're astutely pointing out, to take on a smaller region of great, both biological as well as medical relevance, and that presumably we're going to get better or you folks are going to get better at it as we go along.

ED LEIN: John, you pretty much said it. A lot of work has gone into BICAN in coordinating across the different funded groups to have a joint sampling plan that covers which anatomical structures will actually be analyzed by different projects. And then some of that is being prioritized, sort of front-loaded. With this kind of investment, it would be great if the rest of the community is not replicating what's going on here, but benefiting from it. And so there may be a possibility to get some input on prioritizing other regions, but at the least what could be done is to communicate what the plan is so the community can see what's coming.

NELSON JOHANSEN: All right. Mike, in the last minute for the session. You're muted, Mike.

MIKE HAWRYLYCZ: I think we have to be a little bit more precise about what we mean by annotation. Part of the problem is that we don't necessarily-- it's not clear what information we have about many of these clusters, right? There will be genes associated with them that would be potentially greater or less degrees of knowledge about them. We don't have a tremendous amount of functional or knockout kind of studies associated with this depth. So I think that we need some more creative-- we need some more creative approaches toward really this annotation program and adding information, right? I mean, this is a-- there are certainly things we can do, I think. Perhaps using more kind of data mining techniques, but it's still-- it's not obvious that we have the information yet to complete it. That's all.

NELSON JOHANSEN: Well, with community engagement, we can get there. Thanks, everyone, for the great discussion during this session. I'm going to wrap it up here and hand it back off to Amanda.

APARNA BHADURI: Thanks, everyone.

AMANDA PRICE: Thank you, both Nelson and Aparna, for your leadership of this panel and to everyone for your engagement through the end of this day one.