The BRAIN Initiative® Cell Atlas Workshop Day 2: From Single-Cell Genomics to Brain Function and Disorders—Data Integration and Annotation
Transcript
YONG YAO: This is the day two workshop. And so I will start the meeting. Before we start session two on brain cell atlas data infrastructure and the data ecosystem and data information to knowledge pipelines. So there are a couple of housekeeping items here. So the same Zoom webinar link will be used for the beginning of this session, too. And then yeah, for the panelists, please stay in the webinar session, even if you have to step away and stay muted. And for the attendees, please use the Q&A function if you have questions for the keynote speakers or panel. And the chat will not be monitored, but it's open for attendees and the panelists to use for discussion. And any technical issues, please contact Laura.
YONG YAO: And for panel two, there is a breakout showcase of different web models developed by different institutions, brain cell atlas. And we will use different Zoom meeting link for breakout rooms. And the showcase demonstration will start about 3:15 PM Eastern Time. We will have about several minutes break to switch from webinar form to Zoom meeting. So there's a switch between webinar link to the meeting link around 3:15 this afternoon. So let's move on, then, for the first item of the day. That is summary and the highlights of day one to wrap up the discussion of yesterday on session one. Let me first introduce the first panel presentation by Jesse Gillis and Dr. Guo-cheng Yuan. Dr. Yuan is a professor at Department of Genetics and Genomic Science at the Icahn School of Medicine at Mount Sinai. Please take it away.
GUO-CHENG YUAN: Thank you for the nice introduction. So let me just share my screen. So this session was moderated by me and Jesse Gillis, who unfortunately won't be here today, so I will represent our whole team. And the session overall has 41 panelists altogether, so it's, really, a lot of discussions. A lot of people participating. It's really a pleasure to help moderate this session. And I'd really like to thank the notes takers, Alex Pollen and Daifeng Wang, who helped us put together this summary and highlights today. So our session is a follow-through this excellent keynote talks in the morning by Hongkui, Thomas, and Aparna.
So in Hongkui's talk, she presented the other institute's beautiful work on this whole brain in the mouse. Whereas Thomas and Aparna's talk, they talk about their work in developmental brain, the-- on a human brain, human primates. And Aparna also talked about her work recently in this meta-analysis of human brains. So there's really a lot of enthusiasm on those talks. The main theme is there is a lot of diversity in the cells in terms of cell types, where they're located across species and across times and so on. These are really important issues that we follow up in these talks in the following discussions.
So in this panel, we break it up into three subtopics. And the first topics, we focus on the cell types. And each of the topic, we have two short talks. And the first short talk in this session is given by Dai Pho and Wang. And he presented his work on machine learning approaches for multi-model data alignment and cross modality imputation. And one of the key focus of this is to use the gene regulatory networks in defining cell types and the need to leverage digital tactic. And then he applied these tools to study the brain development and the disease predictions and gene prioritizations. And the second talk in this topic is given by Josh Huang, and he presented this concept. It's called the working hypothesis. It's called the communication element definition of the neuronal types. And the communication elements contain basically two main parts. And one is the relationships. My understanding it's mainly focused on the connectivity, but also not just the cell-intrinsic autonomous properties.
Together, they can be learned through these transcriptional signatures and gene regulatory programs as well. So following those talks, we had a lot of discussions. And in general, there's very active discussions about how to define a cell types. It appears to be there are many different viewpoints of how to define the cell types, including the transcriptions or other omics features, developmental origin, evolution, conservation, connectivity, and functional activities, such as electrophysiology and the perturbations. And so far, it seems to be difficult. All of that seems to be important, but it seems to be difficult to put everything together into a very coherent matter. Whereas at the same time, there is a need to use a common language and operation definition to refer to the same group of the cells.
So in general, I think there is the overall agreement of using the transcriptomic-based cell types as a consistent, if not perfect definition of the cell types because the feeling is that if you wait for a perfect definition, then you might wait forever. And I think this is nicely summarized by a quote by Zhuang Nai. Basically, his view is that the transcriptional divide cell types should be considered as a hypothesis to be tested in the future. So these are general kind of comments of that. But in addition, we also talk a lot about technical difficulties. And one of the questions, I think, repeatedly coming up is, what is the granularity of the cell type definition should be? How many cell types there are? How do you distinguish the different assets? And I think later, Patrick alerted us that there is a difference between the modalities and the assays themselves. For example, in the single-cell analysis experiments, it's not just to have one modality, but it has multiple modalities, such as gene expression or splicing, nuclear versus sub-cell, and so on.
And this is kind of goes a little bit beyond what people typically talk about, like in the cross-modality integrations. And also, what genes do we use to define cell types? Should we use all genes? Should we focus on the sub-type-- transfer factors, which are known to be very important? How do we do this in a way that's consistent across species and developmental stages? It's very, very complicated. And how do we integrate information across the data modalities? So there is a-- I think there's a lot of suggestions coming up, but I mean, it's not clear that we have really finalized answer to answer any of these very difficult questions. So then we move on to the second topic, which is more focused on the data integration and the annotations.
And again, we have two short talks. And the first one is given by Nenad Sestan. And he's talked about his own work and highlighted in this developmental brain in the non-human primates. And the kind of take-home message that I got from his talk is that it's complicated in two major ways. One is from the developmental axis. So there is these human brains that there is a lot of the features that are transient to very specific times. And some of them even emerges in a very small period of time. And not all brains mature at the same rate, continually against the generalizations. And he's talking about there is a lot of-- cross-species integration is difficult because sometimes the evolutionary conservation assumption is not entirely correct.
YONG YAO: I see. Could you wrap up in three minutes?
GUO-CHENG YUAN: Oh, okay. I'll speed up. Sorry. So the second is from the Xiaoyin Chen's talk. He basically talked about his work integrating the RNA and connector to build a really big atlas and study perturbation as well. Okay, so I'll speak up. So there's a lot of discussions here as well. So first, the space is a way to integrate modalities. And there are challenges to integrate the development across species studies. And one of the things that are quite interesting, is that there's multiple cellular structure units. They're interesting. They're useful for exploring the cell-cell interactions, but they are not indirectly considered here. So there are some lessons to be learned from other communities, such as cancer and immunology. So the last talk-- so we have scheduled two talks but only have one, so from Josh Welch.
So basically, he's talked about his work in developing machine learning tools to integrate modalities, such as the MultiVelo and MorphNet. Both are published in the file archive. And then the second one is Bo Wang, who's not present here. But in short, the highlight of that is he developed this tool called scGPT, which is a foundation model for single-cell model mechanisms. And the kind of-- there's also a lot of discussions on this part. So one of the things that we really highlight is the difficulty of the benchmark and how we evaluate that properly to avoid overfitting and over-interpretability. And one of the things that are kind of brought up by a number of the panelists in the end is how do we set up a kind of test that the server is the foundation of the validation of those encoding data and what metrics should be. And that's the end of the summary. Thank you.
YONG YAO: So any questions? There's time for one question. If not, then let's move on to the second panel, moderated by Nelson Johansen and Aparna Bhaduri. Nelson is a scientist from Allen Institute and Aparna is from UCLA. So please take it away.
NELSON JOHANSEN: Thanks, Yong. Can you see my PowerPoint?
YONG YAO: Yes.
NELSON JOHANSEN: And full screen now?
APARNA BHADURI: Yes.
NELSON JOHANSEN: Awesome. Thank you. Yeah, so Aparna and I are going to go through the summary for this session, which mostly was focused on challenges and analysis of brain cell data, integration, and annotation. So really, a whole bunch of great note-takers that helped with this summary. This session was following Steve McCarroll's great talk discussing how human brains actually vary as a function of disease and other biological sources, which really set up this topic well, which was aimed at understanding how can we tackle human brain atlasing with so much variation at the human and non-human primate level. So for this session, we had three topics. The first one was aimed at understanding how to sample diversity in human and non-human primates. We had a really great short presentation by Noah Snyder-Mackler, talking about his work in mechanics and how that relates to both the variation that's seen in those non-human primates and how that relates to human biological variation at the single-cell level.
Topic two was focused on the computational efforts and disentangling technical and biological variation. And Chunyu gave a really nice talk on discussing how to actually measure accuracy and precision of RNA-Seq technologies and a really great discussion around how to kind of do proper quality control for atlasing efforts. And finally, topic three was really focused on how do we actually get the community involved with multimodal integration and annotation standards. And Jeremy Miller gave a really nice talk on all the efforts going on to help bring the community together to build these whole human brain, non-human primate brain atlases. So jumping right into topic one really quick. So a summary of this topic really is just trying to figure out how to address variability in brain studies between biological and technical sources.
So there was some kind of bleed through in the topics, but they're all really great discussions. Some of the big points to come out of this was, really, data management analysis. So importance of keeping track of careful metadata for donors. Noah's talk for this topic discussed that there's a lot of residual variation or unexplained variation in current studies of human and non-human primates. So being really careful with metadata is important. And also developing computational tools that allow us to perform analysis with these biological and technical sources of variation. That was an important topic for discussion. On that note, standardizing what tools we use, Scanpy versus Seurat, and how that results in differences. And of course, biological relevance. What's really important? What can we glean from all the great diversity from these cohorts? And what is actually an artifact that's maybe less interesting? A point brought up by a few folks is actually trying to handle the genetic diversity and making sure that we sample enough donors to achieve really strong signal from our cell types. And finally, a nice discussion on making sure that we have enough donors to properly sample and build these atlases in the face of technical and biological variation. Aparna is going to summarize topic two.
APARNA BHADURI: Thanks for that, Nelson. And I apologize. I think I switched one thing on your slide. But the conversations were intermingling. So we've got it all covered here. The central challenge of topic two was establishing a paradigm to disentangle some of the technical and biological variation in diverse sampling of non-human primates and human populations. Essentially, we have diverse populations that these additional atlas efforts are going to be sampling. And how do we really understand how we can capture that variation without introducing technical variation? And I think that Chunyu's talk was an excellent intro into this in terms of thinking about how we really can measure both precision and accuracy. And one of the places where this was really important is thinking about the essential nature of quality control and that we need to have robust QC for downstream analyses because these technical challenges will otherwise overwhelm some of the biological variation that we're looking to find. And specifically highlighting that low precision and poor accuracy are what we observe when there's limited cells from a population. So we do need some scale in terms of the sampling.
This was on the experimental approaches with a conversation that kind of bled between topics one and two and was something that Aviv and others were really commenting on, which is that sample pooling and implementing pooling of samples, especially in human tissues, can control for some of this technical variability and can enhance the clarity of some of the biological variability. And additionally, that large-scale data analysis can be used to distinguish between different sources of variability and identify consistent patterns in human subjects. This was also something that Steve was chiming in on regarding how, especially when you use some of these pooling strategies, you can find batches that are going to-- if everything's the same between those 20 samples, then you know that that might be a artifact, whereas then if it applies to everything else, then you know that this is actually something that is worth looking at and could be a source of population level variation. Additionally, there was a conversation about, given that there are some of these challenges in terms of variable experimental approach in various analytical approaches where you are getting differences in terms of which analytical method is used, the question is, how can we really understand what it means to be a ground truth? And that means one of the ways that was suggested is using tools on existing data sets and atlases, including from other resources such as the human cell atlas to use some of their principles of cell type quantification, as well as really contextualizing this with other experimental validations and understanding that just because tools agree doesn't mean that they're necessarily closer to the ground to truth.
And the opposite of that is, just because they disagree doesn't mean that something isn't real. So really understanding that-- I think it wasn't in our session, but I think it was posed-- I think it was in the session before where someone said something, I think, quite relevant, which is that a cell type is a hypothesis. And so how do you really go about understanding how we can use our data analysis to then put this in the broader context of what we can use to understand the brain. And then I think kind of concluding this in terms of both the QC and the analysis and the different experimental approaches and references that exist, a question that was raised and I don't think was answered is, do we need to develop more tools, or can we apply what we have with the appropriate perspective? And I think that that's really a challenge and a call to action for BICAM moving forward. And sorry about that switching of number three, Nelson.
NELSON JOHANSEN: No worries. Thank you, Aparna. And to summarize the last topic three here, really focusing on standards of annotation, getting the community involved through multimodal integration. Everyone agreed that community involvement and collaboration is important for this process. But we need to be really clear on the reasoning behind the cell type names. How can we get the community to engage effectively without cryptic definitions of cell types? So again, having just a common language that we talk about for cell typing, for atlasing that the community can get behind, and trying to enable the community's involvement through use of modern tools. Aviv brought up a great point, using large language models as a kind of moderator or a middleman between the community and the taxonomy developer or the atlas developer. It's a really great idea and should be thought about carefully.
One point brought up during the session was, how do we prioritize the whole human brain atlas. And Ed and others talked about how we are focusing on the basal ganglia as a place to start and align as a consortium and go from there. To do all this, of course, many tools and platforms are going to be needed to allow us all to work on the same atlas. We have already great starts with knowledge graphs, but tool development is still a crucial part of getting kind of this joint annotation going. And another really great point brought up is making sure that we connect taxonomies to literature. So making sure that the cell types we define are rooted in any previous definitions of those cell types from previous studies is really important and also ties in with the broader biological community. And this is definitely going to be a challenging task. The panelists brought up that we need creative approaches, data mining techniques, and really just, again, these annotation tools to allow us to come together as a community and build these multimodal and annotations for atlasing. And that is the summary for this session.
MING ZHAN: All right. Thanks, Yong. It's a great first day meeting in this workshop. So today, we're going to move on to another session, session two of the workshop. The focus of session two is data, tools, knowledge, and the community for brain cell atlas. We start this session by two keynote presentations. Each presentation will have 20 minutes of time. That includes probably 17 to 18 minutes for presentation and two to three minutes for Q&A. I may give you a heads up if you run over time a little bit. The first presentation is by Dr. Bob Grossman. Bob is a professor of medicine and computer science at the University of Chicago. He is the PI for the NCI Genomic Data Commons and the project lead for the Gen3 data platform. The topic of Bob's presentation today is, as you see from screen, data platform for managing, analyzing, annotating, and sharing genomic data. Bob, take it away.
ROBERT GROSSMAN: Thank you. Can people see my screen?
MING ZHAN: Yeah. Very well.
ROBERT GROSSMAN: Thank you. It's a pleasure to be here. I'm going to, as the title suggests, talk about data platforms in general for biomedical data. I am not an expert in single cell nor in the brain, so I'm going to speak more generally about platforms that can handle genomic imaging, clinical, the data types that are the subject of this workshop. I'm going to start with a little background on data platforms, data commons, and data meshes. And then the second part of the talk will be on five trends that I think are going to be relevant for data platforms that can be used for single-cell clinical genomic imaging data for brain and for other areas. So I'm going to take the perspective of a platform. Oftentimes, in biomedical data science, the broader computing trends like cloud computing, platform computing, and those types of broad trends are the foundations that the particular groups building the different repositories and portals and other platforms that are being discussed over this three-day workshop, they leverage over this.
So a platform, whether it's a social media platform or whether it's a platform like Uber or whether it's a data commons, has a number of different roles. And one of the things that is changing is that these roles are expanding to make it easier for research groups studying single cell or studying particular aspects of the brain to bring up specialized platforms that may be relevant for their particular research community. So I'll use the term sponsor for the organization, whether it's internal or an NIH institute or center that funds the platform, the operator for the person who builds it, and then there's data coming in and users accessing the data. And there's a lot of changes that are happening to these types of platforms. I'll begin with a platform I've been involved with. The GDC project started 10 years ago. We built our first prototype several years before that. Over time, this is now used by over 80,000 users a month. In an average month, two petabytes of data are accessed in the cloud or downloaded. In some months, five petabytes of data that's pulled out. This is based on an architecture I'll describe in a minute that is quite useful for working with genomic data and is quite commonly used now.
But I want to emphasize that these types of architectures are over 10 years, and that's a long time in this space of platforms that are used. And that's one of my themes. I'm going to remind everyone quickly of the architecture used in the GDC. There is a data model and a database of curated data. We use a graph database, but other types of platforms I've used have used a number of other databases, from OMOP to others. There's a lake structure where we have roughly a million data objects that can be accessed by persistent identifiers, and they have metadata associated with it. For the GDC, the bulk of the data is BAM files. So we have roughly 10 petabytes of BAM files publicly accessible, a terabyte of curated data that's accessible from a database. I think of these as when you get scalability for data objects like BAM files or imaging files or some of the other modalities out there as a data hybrid where you construct, where you have FAIR data, but it's split between a structured database and data objects.
I want to emphasize, for over a decade, by putting APIs to make the data findable, accessible, interoperable, and reusable, you can very easily build an ecosystem so that people can write libraries and their own applications that can access data. In the GDC, we made the decision that we would use no APIs that were not publicly accessible. So all our portals and all our analysis is done over publicly accessible APIs so that there could be a rich ecosystem. The scalability comes from using both on prem and public clouds. And the data is harmonized by running a common set of scalable cloud-based bioinformatics pipelines that is the core of whenever you analyze data and what we've heard about yesterday. And then there's a scalable interactive front end. So this architecture, it's a standard architecture. It's widely used these days. I want to talk-- it took us, for various reasons-- partly because of the size of the data. It started with about two to four petabytes. It took us about two years to build the first GDC. And that's a big lift. A lot of that was because of the interactive tools.
So we asked the question, if you could build-- if you had a data model, could you also generate a data commons? That is, could a data submission portal, a data exploration portal, some of the other specialized portals with fair APIs for all the data, could that be auto generated? So not in two years, but in three months, we built a commons for liquid biopsy data. And single-cell data is coming into this now. It's with a technology called Gen3 I'm not going to talk much about. But the idea to rapidly build a commons for a particular project to support a particular community, I think, is a very powerful concept. And both the GDC and the Gen3 commons used a very old idea. And this is the idea that's called the end-to-end system design. This is how the internet was built, where anytime we get a new modality, whether it's when we started with web pages and then added images, then we added continuous media, we did not change the basic routers and switches or HTTP and TCP, but we simply changed the applications that brought the data in and the applications that were used to analyze the data. So this is how the data commons are built. We build it over cloud computing, but you could think of it as data comes in, it's imported, cleaned, and curated, and science comes out.
And so in this architecture, you try to use absolutely as few services as possible so that, over time, you can be as flexible. So when we built the first data commons, say, the Gen3 commons, there were no Jupyter Notebooks, but it was very easy to add them because of this end-to-end design, where you have a handful of services for data access, authorization, authentication, linking data across platforms, metadata, etc. This gives flexibility. So at a high level, you could think of a data commons as a governed software platform that co-locates well-curated data, cloud-based computing infrastructure with software tools and services for managing, analyzing, annotating, integrating, sharing data with a particular research community. And then an emerging concept that's emerged the last several years is a data mesh, which brings together, over a handful of services with the same principle I just described, multiple data commons, multiple data platforms, multiple computational platforms, multiple knowledge bases, multiple knowledge graphs all over a common set of services. And they're both enabled by cloud computing.
There's a slightly-related-but-a-little-different concept that is coming out. That's a data fabric. It, again, brings together multiple commons, multiple data platforms, multiple knowledge bases, multiple computational platforms. But it brings one more thing in, which is, I think, going to be important. It asks, if I have an instrument-generating data, whether that's single-cell data or data about tissue or any other type of data from a lab, how do I very easily bring that data into the mesh? And so both data meshes and data fabrics have multiple data models, have a hybrid governance, but they both run over a core set of services, which I'll call mesh services. So I want to take the last five minutes to, very quickly, after that brief introduction, talk about five trends that I see emerge-- that have been emerging over the last five years and I think will continue to emerge over the next five years.
So the first trend I alluded to is the early platforms took a while to brought up. They were extremely impactful, but they just took time and money. And so one of the things I think we'll see - and it's related to my first three trends - is that you're going to see more platforms as a service where the PIs, the scientists analyzing the data, the scientists creating individual analysis tools and other things can do what they do best and not have to take the time on the core data management, the core authorization, the core authentication, the lower levels. And partly, this is to split data platforms such as a data commons so that the developers and the people who operate it, the people who design it can more easily act independently. So the PI for a particular project or for a particular consortium can work with a much smaller team to bring up a fully functional platform. And I don't have time to go into details, but I think we'll see lots of smaller-scale data platforms and data commons that can be used for projects for as long as they are needed, then with the data moving to other platforms in a seamless fashion when that particular platform is not needed so that we could accelerate science.
So the second trend is related is I think of this as with these core mesh services we've been using for a while and our standard now for building data platforms and data meshes, I think you'll begin to see life cycle support. So right now, it's a heavy lift to get data in and to move it out and to update. But with fabrics making it easier to bring data in from lab instruments, including single-cell data, this will happen over the next few years. I think, also, when a project is over, since there's a data lake with FAIR data, you could easily transition all the data to another platform or an NIH repository or research-- say, in the brain community. And if you build this in when you start a new platform, it's much easier to do this later. So I think this brings up something I can't really go into about when two platforms can trust and when they-- one of the things you see in a well-operated data mesh is an agreement through the governance, through the hybrid governance for minimal metadata so that you can transition data between platforms.
The other thing, I think, you'll see is-- and I put in the bottom-- well, in January, February, we're bringing out-- in Gen3, we're adding each of these capabilities, but other groups are also-- and we'll be working and others are working with GA4GH to make these standard. But in the bottom, we're also splitting where the hybrid services, the mesh services, I think, are going to be kind of standard. And they should just be brought up, and the commons as a service should be brought up. And then you can have an app framework. We're bringing this up in Gen3 and in the genomic data commons over the next few months so that a particular individual using the right framework can bring up an analysis tool and plug it in in the same way you can plug in something into your iPhone from an app store and so that the experts with deep biological knowledge, with deep understanding of curation, can focus on developing the tools and plugging them into the platforms, designing great front ends, great computational frameworks, and plugging it in.
And my last two trends, I mean, we've seen-- I think, at least, five to eight years, we have been moving containers around for PanCAN, for large-scale computations, but it was sometimes a little heroic. We're still standardizing on standard APIs. So I think for FAIR data, we're used to standardizing metadata, authn, authz, data access. I see three types of new APIs emerging as standard so that we can interoperate. One is for computation where we could ship containers around. One is for platform-to-platform interoperability that some people are calling SAFE for a secure and authorized environment for FAIR data. And the final one is, I think, we are all used to large-scale models, but people are building small generative AI and LLM models. And if we expose standard interfaces in commons and other platforms, including vector bases, then it's much easier to glue these together with emerging technology, to glue together multiple small LLMs or generative AI to create more powerful ones. And in general, we're limited in large-language models and generative AI by large-scale clean data. That's the real contribution of the data platforms for single cell and brain. They have the great data. So by exposing appropriate APIs, including vector bases, they can make it easier for the community to scale these things. That's the end of my slides, and I should have a few minutes for questions.
MING ZHAN: Thank you so much, Bob. Very nice presentation. So I see GQ, you raised your hand. You have a question?
GQ ZHANG: Yeah. Bob, very fantastic work and very nice vision. And I think this will be very valuable guiding the BICAN into the next phase and future phases. But for the BICAN community, we are concerned and spent a lot of effort for de novo data generation. And so the FAIRness consideration is not at the deposition to data commons phase, but rather at the point of data generation. We're trying to enforce and actually practice FAIRness principles at the point of data generation and annotation and all of those activities. Do you have suggestions from the data commons point of view what aspect we should pay particular attention so that it makes this transition much easier later on?
ROBERT GROSSMAN: That's a great question. And I think with the decade-old model of the data commons, FAIRness was put at the platform layer. And it became hard to get data in because it had to be sufficiently well-- it had to adhere to a model. The GDC model has hundreds of attributes, as do most models. So it was hard to get the data in, but once the data was in, it was well curated and made fair. I think that the transition-- and I didn't have time to go into it. I think the issue of meshes is, how do you glue things together? The new idea with the fabric is when I have instruments putting data in, such a single cell, how does the mesh-- well, let me step back. I didn't emphasize it, but for meshes and fabrics, it's a governed resource. And it's hybrid governance. So that means the individual components have governance, but the mesh or fabric has governance.
So basically, one of the most important decisions is, what is the minimum metadata and what is the standard for something to connect to the fabric and bring data in? Because unless you do that-- so that's one of the most important things that the mesh or fabric does. And then it sets up a practice across all those things, bringing the data in to do that. So we have conservation of difficulty. This is a governance curation aspect. The role of the platforms and meshes and fabric I was talking about is once that governance is decided, what is the supporting infrastructure and so that you don't have to constantly redo this? So you asked a good question. Your community will be one of the first ones doing this, but that's precisely the role of the fabric, to reduce the amount of redundant work around that.
GQ ZHANG: Thank you for your comment. I feel like we're kind of aligned with what you suggest and on, at least, the right track at least in terms of focus in our effort and attention.
MING ZHAN: Thanks, GQ, and thanks so much, Bob, for a very nice presentation. The next presentation is by Dr. Mike Hawrylycz. Mike is an investigator at Allen Institute for Brain Science, working here in data analysis, annotation, and research development. Mike is also a veteran BICCN/BICAN investigator. He, along with others, launched the BICCN Data Center about seven years ago, and he continued to be the PI in a newly organized data center. Today, Mike's going to talk about opportunity and challenges in organizing cell type data and knowledge. Mike, take it away.
MIKE HAWRYLYCZ: Okay, yeah, thank you. Yeah, I'd like to thank Dr. Grossman for that very interesting talk. And it shows how many of these same issues apply to the organization of neuroscience data, and in fact, how far we have to go with some of our work. I'm going to talk about here the organizing of cell type knowledge in the brain and opportunities and challenges and aspects of things that we have been considering. BICAN presents an unprecedented opportunity in kind of data availability and organization across these major UM projects, which are extensions of the BICCN into human and non-human primate across time, population studies, development, all kinds of axes. Our vision, ultimately, for the organization of this material is to produce annotated transcriptomic-based taxonomies in each species, across species, across development, organize this data into cell type knowledge, to provide molecular maps common to all brain regions consistent with our ontologies and taxonomies, and to understand the graphical and spatial architecture of this data.
The concept of the knowledge base-- a common kind of topic in sort of biomedicine with many, many people kind of identifying this as an important thing. We wrote about this in kind of a proposal with Rafa Yuste, Ed Lein, Hongkui Zeng, and myself, about the interest in the community-based kind of transcriptomic classification and associated kind of knowledge base. Dr. Zeng also wrote a remarkable article here about cell types and their sort of architecture. How to define them. Again, calling out for the knowledge base and organization of information in this way. To get there, there are essential first steps toward achieving this. We know we need these formal releases of taxonomies within species and data mapped to the same kind of pipelines and kind of organization. Formal nomenclatures, as Dr. Jeremy Miller has pointed out in a previous talk. Ability to visualize the reference and tools and map to these kind of data sets. And finally, knowledge base and information organization to bring it through. This is kind of the pre-work toward neural knowledge organization and its architecture.
To do this, a group of us applied for and received an award through the NIH to create a kind of a BICAN CUBIE knowledge base. I'd like to sort of acknowledge my co-investigators here, Contact PI Shoaib Mufti, Lydia Ng, Satrajit Ghosh from MIT - whoops, I've got Lydia Ng twice - and myself. And also Yasmin Hussein, our program manager. But our goals here were to create sort of an adaptive knowledge graph to harmonizing brain cell type data, build an ecosystem of tools, and enable this kind of-- as Dr. Grossman has pointed out, this app store type of application here to develop the infrastructure for knowledge base and associated information. And we would hold community workshops in terms of training and collecting information and knowledge and feedback. And that's something I'd like to tell you a little bit more about here. We had, in fact, such a workshop.
The first workshop was held September 26th to 28th. A three-day event at the Allen Institute, summarizing all kinds of topics and discussing things related to our information and knowledge that we needed to achieve kind of this goal of building a brain cell type knowledge base. Our preliminary vision of this and getting kind of the basic substrate in place, is to consider evolutionary and developmental ontologies, anatomic and coordinate frameworks, and to have cell type maps and taxonomies with spatially organized data to map these two between the two to allow querying. And finally, to embed this information into a knowledge graph-type structure, allowing kind of claims and evidence to be made and investigated with respect to these kind of the triad of kind of groups. There's a great vision about how this might evolve. Ultimately, leveraging all the BICAN/BICCN, data, we can now imagine that bringing these cellular properties, ontologies, coordinate frameworks, etc., kind of in place. We can now intersect this with cellular and systems level neuroscience, the neuroimaging community, or the brain armamentarium, the upcoming now connects project, other external consortia, and translational research. So the foundational data sets need to go into kind of this knowledge base.
We've heard about a bunch of these kind of in the last day. A remarkable set of papers published just recently, December of 2023, which was a now delivery on the full kind of whole mouse brain, sort of extending the work of our original primary motor cortex work. These produced now, in addition to, as has been discussed already, an unprecedented opportunity for analysis and now examining these data sets and their kind of cross kind of relationships and what they're implying. But they can be brought into a knowledge base, and they provide essentially everything you might need to know about the cellular, epigenetic, genetic kind of structure of a brain. They come from these major labs that were part of the BICCN. The Wren Lab, Ecker Lab, Callaway, Zeng, etc. And these are now available to us to bring into the thing. Interestingly, in this kind of paper thing, there was an editorial in the Nature journal that came out, that they called for certain kind of extensions that they thought were immediately kind of needed to bring this sort of work to fruition is that data sharing standards would become increasingly important. Data models, code need to be open. There are challenges to reproducibility. Standard frameworks for data collection and analysis. And then all these kind of aspects to bring this data to its fruition. This work has begun already, of course. We had a very, very nice kind of analysis in the primary motor data of kind of concordance of transcriptomic data and regulatory data built by the Mukamel Group. We've had Jesse Gillis from Toronto who has investigated the correspondence of these atlases and shown remarkable kind of consistency through many of the clusters that have been identified. But there's clearly a lot more to be done to bring this kind of to fruition.
So to bring this into a self-type knowledge base, what do we need to do? Well, we have to ask, what are the data sets, right? What's the knowledge? What do we mean by knowledge to bring in? What new information will be brought in? And what is the infrastructure and their APIs to produce them? A general kind of framework from this can be sort of envisioned as infrastructure, standards, computing, ecosystem, and finally training. This is consistent with our model and our workflow in our grant for our knowledge base and at the Allen Institute as well. It's important to identify these use cases. This is kind of an exercise in understanding how the knowledge base would be used, right? For example, a user might wish to kind of query a list of neurotransmitters and ask, "How can this predict neuronal circuit function through identification of common co-expression patterns?" And there are many other kinds of possibilities, but this kind of inventory and study really should be done to basically tailor the database in a way that needs to be kind of produced.
There are lots of central questions that came up in our workshop that we discussed. For example, where does the knowledge base lie in terms of a functional organization of data? How do we keep it as a living resource? Should literature be part of the knowledge base? How do we keep consistent with advances in cell ontologies? And what is the determining of a significance level for the adding of new data or the basically integration of increasing data? All these types of things we had sort of breakout discussions on. And more questions were raised than were answered, but it was a starting point toward understanding where we wanted to go with this. From identifying appropriate clustering pipelines, working with our partners at the Broad and NeMO and others, aligning data with CCF, advanced data visualization, issues of providence, etc., relationship with disease and other things.
At the Allen Institute, we have a knowledge framework, a kind of a brain knowledge platform that is under development that manages our own kind of pipelines, services, registry, etc., and consists of kind of three primary components that really are kind of right in the direction of where we want to go here. One is the Allen Brain Cell Atlas, which you heard mention of, and we'll hear more about later today, a series of APIs and notebooks developed, and the Cell Type Knowledge Explorer, which a preliminary kind of database in the direction of knowledge organization. This Allen Brain Cell Atlas is mentioned earlier, and we'll hear more about is, essentially, the next generation version of the Allen Mouse Brain Atlas first released in 2006, right? An in situ hybridization atlas, which now has been extended to the spatial and RNA-seq data at a cellular level resolution. A set of APIs and notebooks available for deep access to this data developed by Lydia Ng at the Allen Institute, which provides a way to understand and have a deep dive into this data. And our Cell Type Knowledge Explorer, which is released in 2021, which is a coordinated kind of look across three species in the primary motor cortex showing all the integration of data and its modalities in the primary motor cortex. This you can look at transcriptomic information, morphological, electrophysiological, and cross-link and map your data to these resources as well. We can think of this as a first down payment in the direction of a true knowledge base of cell types of the brain.
Knowledge base forms a kind of a core kind of part in our design of the HMBA, the Human and Mammalian Brain Atlas of Ed Lein and Hongkui Zeng. Here, the goal to develop a whole kind of mouse brain kind of taxonomy linked with functional kind of imaging data and spatial data. And a knowledge base is really at the center of integrating kind of this across archives, applications, data catalogs, and pipeline. Finally, I'd like to mention a few tools and resources that will be important in kind of achieving these goals, right? Content curation is essential, right? We need quality information. Its context has to be provided, consistency with other kinds of frameworks and visualization and feedback. Ability to update and to measure performance and engage users. Taxonomy organization and management services, as Jeremy had talked about. And Nathan. But here, this idea is that we need coherent ways of representing taxonomies, comparing them across datasets, managing them, and enabling controlled access and kind of exploration of their content.
In the BICCN, we wrote about this architecture and published impulse biology in a community paper, talked about the levels of data and how providing kind of structure and organization to data can enhance it to allow users to understand better its need and its application. How it's linked. How it's featured. These are the different kinds of things. All data sets in the BICCN have been classified in this way, and we hope to do similar things for BICAN. A little bit more detail on the ontologies. We worked it through a BRAIN data standards grant through Ming Zhan’s office. We worked on a detailed, data-driven, kind of ontology of brain cell types and built this. This is embedded in the brain cell type knowledge explorer, and we are now reconciling this with our human BICAN data. We want to be able to cross-map species and achieve taxonomic alignment through the work of Trygve Bakken here, basically aligning these taxonomies, mapping the datasets to datasets, understanding what are the common issues of cell types, how do they change, and how do they differ across species. As has been mentioned, the issues of data mapping. And here there are tools that originally sort of set forth by Raul Satija through the New York Genome Center. And now we have developed our own method here at the Allen Institute for mapping these cells. You can use various different methods for doing this, from deep machine learning methods through more conventional hierarchical mapping. But these form important ways of comparing new data to existing data in the knowledge base. And furthermore, advanced kind of visualization, tools such as the CELLxGENE from CZI, Cytosplore our colleagues in the Netherlands developed a kind of spatial data investigations, analysis, genome browsers, etc. All these are essential tools in the organization of our information.
So basically, in summary here, several ideas in these three-day discussions from the workshop, we had just to reflect on these a little bit here, that these were ideas that we came about, that we are now presently under investigation. We wrote a report about this. The report sort of summarizes all our investigations. If you're interested in the report, write to me. I'll send it to you. But it's connecting kind of our common coordinate frameworks and taxonomies. Unifying data across species and development. Developing the right APIs. Highlighting the interrelationship of spatial and transcriptomic data and information. The importance of common coordinate frameworks and how data is mapped. Issues of scale. Building bridges of software. Publishing our visualization tools and large data sets. And several other kinds of important things regarding data organization, kind of data management, and ways to kind of connect data with the community. So that really is our first kind of version and instantiation of our BICAN CUBIE knowledge base, and we're kind of often working on this and trying to keep pace with the remarkable data generation and its interpretation. So I'd like to thank you. Thank the organizers for allowing me to speak. And I guess we'll take questions.
MING ZHAN: Great presentation, Mike. Thank you so much. Wonderful to see all this happen over the years. So you got a couple of questions on the Q&A. First question is, can the community contribute the Jupyter Notebooks to fit knowledge into the CTKE?
MIKE HAWRYLYCZ: Well, we don't have a mechanism just yet to do that, but these notebooks provide the kind of mechanism by which that can be implemented, right? At present, our way of comparing data with existing is through our mapping tools, right? Your mapping data. But an important part of this will be how to update taxonomies and how to modify them. And this is a big tricky subject that, of course, it's been addressed in many ways in the genomics community for many years is that how do versions of taxonomies evolve? How do, for example, definitions of cells or genes evolve across kind of genome builds and such like that. And so this is a direction that we're hoping to go into. Yeah.
MING ZHAN: All right. Second question. Could we have access to reach this amazing presentation?
MIKE HAWRYLYCZ: Access to the presentation?
MING ZHAN: Yeah.
MIKE HAWRYLYCZ: Of course. Yeah, we can make the presentation and the report available to any who are interested. Yeah.
MING ZHAN: All right, one more question. Are beta developed version of this platform or parts of this platform available on a website? I think the answer should be yes. Right?
MIKE HAWRYLYCZ: Part of our platform is-- I mean, part of our kind of work and mandate at the institute is putting all our resources and data forward. But the code base itself yet is not really open because we're still in evolution. But we will make-- we will be definitely aspiring to go abide by all kind of fair data practices as this becomes better and more refined and developed. Yeah.
MING ZHAN: All right, we will take one last question, then we will take a break. Where can we access those notebooks/APIs?
MIKE HAWRYLYCZ: Okay, go to the cell type-- go to the Allen Institute and our brain cell type database. And you will see the-- you'll see the cell types, whole mouse brain cell types. And from there, there's a direct link to accessing the notebooks for data access. And you can go right there. I encourage you to go play with them. It allows you to really do transcripted-- Lydia did a wonderful job on these. You can access the data. You can compare with spatial clusters. You can do visualization. And it really provides an important thing. I would just like to also stress that one of the things-- that there's a remarkable opportunity now in the co-analysis of all these whole mouse brain papers that I think that our analysis community will no doubt really be jumping on, so.
MING ZHAN: All right, so we have to conclude this keynote presentations. I'd like to thank Bob and Mike for the wonderful presentations. And thank you everyone for your attention. Then we take a break and hand it over to my colleague, Dr. John Satterlee to continue to host the rest of the meeting.
JOHN SATTERLEE: I'll just mention my name's John Satterlee. I'm from the National Institute on Drug Abuse, and I've been really enjoying listening to the presentations today and the discussions as well. So our next session is Brain Cell Atlas Data to Knowledge Pipelines. And the moderators for this are going to be Lydia Ng and Tim Tickle. And of course, the panelists include a cast of thousands, or at least 30, and their note-takers as well. So what I'm going to do is pass it off to Lydia and Tim. And I don't know what exactly they have planned for us, but I will plan to let you know when about half your time is up. So without further ado.
LYDIA NG: Thank you, John. And I would like to welcome everybody to our panel session, Brain Cell Data Atlas to Knowledge Pipelines. It's great coming off the heels of the two great keynote talks. First one to tell us about data of commons and meshes and fabrics and forward-looking to new trends in those areas. And a talk from Mike about challenges, opportunities in organizing brain cell atlases and knowledge base. So in our session today, I'm co-chairing with Tim Tickle. Tim is the Head of Scientific Partnership and Data Science Platforms at the Broad. We have two wonderful volunteers for note-taking. We have Jeremiah Cohen, who's a principal scientist in the Allen Institute of Neural Dynamics, and Cindy van Velthoven, who's Associate Investigator in Allen Institute for Brain Science. What we are doing in our session is that we took the theme of Data to Knowledge Pipeline and broke it up into four sub-themes. And then with each sub-theme, we've invited one of our panelists to give us quick motivational slides. And our panelists have seen a bunch of questions that we are going to talk about. So without further ado, I am going to hand it over to Tim to talk about our first theme, which is data access and sharing.
TIM TICKLE: Yep, yep. As far as our first theme goes, we're going to spend a little bit more than about 15 or so minutes on this one. And it is about data access and sharing. We're going to start off with Daofeng Li, Assistant Professor at Wash U. Daofeng will give us some motivational slides. And after that, we'll jump in with some questions. Daofeng, as you're bringing your slides online, thank you, I'd just like to ask all of our panelists if you could turn on your video if you're comfortable with that. That'll help make sure that we're all talking together. We will love it if you use the hand-raise mechanism. We will call you in order unless there's someone who has not been able to say anything. And so in that case, we might bring someone out of order, but we will try as much as possible to use the order. All right.
LYDIA NG: And as the slides are coming up, we'd like to also thank our 34 different panelists coming from various different institutes across different countries. And we collected all their sort of interests, and their interest spans multimodality from data analysis to sort of making data platforms. So a big thank you from us.
DAOFENG LI: Hello. Can you hear me okay or see my slides?
TIM TICKLE: We can hear you okay. I can't see your slides quite yet.
DAOFENG LI: No?
TIM TICKLE: Not quite yet. Let's see here. Maybe let's try one more time, and if that doesn't work, I'm very happy to host your slides for you if you want to tell me when to move on.
DAOFENG LI: Okay. I don't know why this is happening. It's not working.
TIM TICKLE: No problem.
DAOFENG LI: Okay, great. Yeah. It should be in the Google Drive.
TIM TICKLE: Okay. I'm happy to share them for you. Does that work for you, Daofeng?
DAOFENG LI: Yes. Yes. It's perfect.
TIM TICKLE: Great. Just let me know when to move it forward.
DAOFENG LI: Okay. Hello, everyone. My name is Daofeng Li from Washington University. So we are part of the UM1 grant. So I'm very happy to motivate this theme of this session. So there's a lot of items need to be discussed later, so I will try to go through the slides pretty quick so again hand over the discussion to Tim. On one slide, please. So my entire query was about data and particularly focused on data visualization. So I guess the data is the foundation of biologic analysis. So besides data itself, so we need to figure out a way to organize the data. So you also see the talk from the pillow talk in the morning is show many better tools to organize the data. So most time it means the development of a tool called data portal. So besides the data need to be hosted by data portal, most important, we're thinking that metadata need also to be hosted to provide the search index and filtering functions for the data itself. Besides that, rationing can also be another important aspect about the data updating, data processing pipeline updating for the process data, and also more metadata coming out and also error fixing on the metadata. So once we set up a data portal, so we will see how to access the data becomes the next challenge. So typically, we are dividing the data into two types of data, either public or protected. So publicly accessible data should be the data released to the public. Typically, we're using a cloud bucket to store the data and attach with some data usage policy. And for those data are being protected, so for those data being uploaded or created should be protected. We can either use some password product or put this data in a private bucket. Please move the slides.
So fortunately, there are many existing good examples we can learn from. So personally, myself have been using the ENCODE and the 4DN data portal a lot. So both data portal provides coding, filtering, and download of data and the metadata. So that data can be downloaded one by one from the web page, or you can download them in bulk from the command line interface. So basically, the both portal provides very convenient ways to access the data and the related metadata. Please move the slides. So this is a screenshot I took from one of the call from the BICAN call for the developing data portal. I'm pretty confident BICAN will develop a way for the best experience for users to access the data produced by the BICAN. That's pretty much all I have. I'm just handing over the discussion to Tim for the discussion. Thank you.
TIM TICKLE: Thank you, Daofeng. All right. So today we have panelists from the BICAN and outside the BICAN. And I just want to just emphasize as we get started, warm everyone up, and get ready for our great conversations, we're really looking forward to input and shape what we're doing in the BICAN. We're excited to talk about what we're doing, but we're also excited about talking about what you're doing. And if you have other projects outside the BICAN, what works, what doesn't work, or what can be improved on, and those kinds of things. Much in the spirit of what we saw with Bob's presentation, it was really wonderful to hear about things that worked and things that he aspires to see everyone move towards. Really enjoyed that. So with that in mind, let's go ahead and get started. So first up, our first question is focusing-- so getting everyone focused on how will single-cell genomics data be collected, stored, quality controlled, and disseminated by data archives. Would anyone like to go ahead and start on that? Or I can get a little more specific if you want to break it down to different pieces?
So maybe we can break this up a little bit and first talk about the-- so in the BICAN, we have a very long data lifecycle. We have, as GQ had mentioned earlier from the data being created, we have really wonderful infrastructure for collecting that metadata and being able to indicate where it's from and collect metadata for it as far as the assays themselves in the libraries. That goes to the sequencing centers. And then after that, it goes to different repositories. Some of those repositories do common processing. And then eventually, all of this will move towards the knowledge base, will be integrated for people to be able to work with, as well as being able to access from archives, for instance, or learn about it from the portal that has the metadata about the assays and such. So maybe we can talk about the upfront part, and what single-cell genomics metadata will be collected, quality controlled, and how is that defined? Kim, go ahead. Thank you.
KIMBERLEY SMITH: Thanks, Tim. Yeah, so it's been very interesting to hear the panelists over the last day or so. And as I've been thinking about this question and have been working for the past year on defining the metadata and working with GQ's team to store it properly, it all really comes back to what metadata do we need in order to tease apart and understand the biological versus technical variation. I think that's key to help us understand which pieces of metadata need to be tracked and moved along and be a part of the final analysis. And that's where I've been spending my time, defining what is minimal metadata. We really need to understand minimal for what. And this, it helps us think about that question so that we don't throw everything in, but we come to a common language and common understanding of what's needed for provenance to understand donor through the single-cell data, how that cell came from the donor and the provenance among all the partitioning of the tissue, as well as the analysis and the technical variants so that we can understand, again, tease apart the biological versus technical variants of the process, whether that's sequencing or spatial. It's important to have the analysis components come along as pieces of metadata so that we can really have confidence that there's biological variants.
And then there's instructional metadata that helps us process the data along the way that needs to be carried along with it. So understanding the function of the pieces of metadata helps us whittle down what we are putting into the database so that it does become essential to understand the source of the variants that we see and focus on what we all want as biological variants and to understand and tease apart from the technical. All that is a lot. And what I have found-- what we have found in this process is starting with a process diagram really helps us ground and talk through the provenance and how the tissue moves. And grounding that then allows us to attach pieces of metadata accordingly to the different partitioning and provenances along the way. And it provides a good framework to talk to different teams so that we can all come and, again, share that same language. We might have different labels for the different metadata that we call in different internal vocabulary, but grounding those different labels into a common and a stable identifier in the database, which GQ has worked very hard on for the in-hash IDs, that's where it really becomes stable, and we can all start to work together to pour our data into this common repository.
TIM TICKLE: Thank you, Kim. Owen?
OWEN WHITE: Yeah, I'd just like to make an observation. And I actually wonder if this contrasts with Bob's really terrific presentation about what's happening over at NCI, where we have a tremendous opportunity in the BICAN in that we've essentially created a registry for all tissues before they really enter into the whole ecosystem. This is a really nice opportunity. We're all very grateful to GQ with the development that they're doing. And this allows us to do a management of the metadata in really kind of a completely different way, at least in comparison to other projects that I'm aware of. So mostly, I am just making the observation that this is a really great opportunity for us in that we're essentially gatewaying any information that's sort of coming into our harbor here. And the opportunities that that makes in terms of making robust queries of all the derivative data types that come out of all these tissue samples is really very nice. It's just an observation I wanted to make.
TIM TICKLE: Thank you, Owen. Jim?
WENJIN JIM ZHENG: Yeah. So I think I also want to share some experience with us developing library minimum metadata. As you know, we started with working with like 14 teams to collect all the metadata elements. And the observation is that, while working with a diverse range of technology, it's kind of really you have to put in effort to find the common ground and also identify things but also be inclusive. The other thing is that, as an informatician who are working with the experimentalist, you want to make things very easy. We develop a template that's easy for people to input the library minimum metadata elements that they think is important. And that will really help us to collect all these data elements. And then we sort them through, work with the main experts like Kim, so that we can, in the end, we really put all this together. I think this is something-- I think a lot of times people kind of not pay attention to how to work with people not in your domain. I think making things easy, that's really important for to make good progress.
TIM TICKLE: Thank you. Dave?
DAVID HAUSSLER: I wanted to mention some lessons learned that we've had over the many years that my group has done data coordination and development. Certainly, it took a long time to develop the standards for genomics data. And in terms of sharing data, we definitely need to motivate the PIs. That is a carrot and a stick thing, right? So the program has to require a certain amount of data sharing, but it also has to induce the PIs to want to data share by making it valuable to them in their research program. That means that data that is too big to download needs to be computable on the cloud. And Terra, the project that you are working on, Tim, and the Gen 3 are examples of that. We have a need, I think, to have the same code, though, work on the scientist's home institution as on the cloud. So things absolutely have to be containerized. There are going to be some big data at the home institution and big data on the cloud, and you want to be able to compute on it in the same way in either place. You'll have a lot of your own big data at home, but you'll have other people's big data as well on the cloud. And when you put those two together, that's really fundamental.
Access restrictions have to be simple and standardized. That is absolutely a lesson we have learned. You cannot have different organizations have different ways of doing restricted access. And that means coordinating how we define the levels of restricted access. And that should be done multi-project. And finally, I would say the data have to be AI-friendly. And that is becoming more and more essential as AI tools get more powerful. And a big aspect of this is to have many levels of metadata. So obviously, we want the required data for more traditional knowledge representation. That metadata, though, needs to be minimal because PIs will simply not fill out 100 or 200 metadata fields. But the PIs also have to have the motivation to put a large amount of extra metadata that's less structured. Even descriptions and links to papers and so forth can now be processed by current large language model and other types of AIs. So both the structured at various levels required, suggested, and these kinds of levels and the completely unstructured metadata are all very valuable. We need something like a card that goes with it, like a model card that says how you produce the data and what its restrictions are and what its biases might be in order to be AI-friendly.
TIM TICKLE: Thanks, David. Daofeng?
DAOFENG LI: Oh, yeah. So I just want to give some comment from a user point of view, especially for the downstream computational analysis. I think for the metadata, if we want to make metadata useful for downstream computational analysis or statistical analysis, we should already have something like a codebook so for people can look up what the meaning of the each-- like a gender or ancestry, or something people have to use for the downstream analysis. So such a codebook or any central location people can look up the information about the metadata would be very helpful. So this is one thing. Another thing is I know maybe people cannot share all metadata, but if people still can provide some derived information from the metadata, like a covariance, even like a first few PCs in the metadata, for example, a genotype or any ancestry. So I think that would be also useful for the downstream analysis. Yeah.
TIM TICKLE: Thank you so much. Brian?
BRIAN AEVERMANN: Yes, I just wanted to second a number of things that have been said already, one of which is the importance of sample tracking, especially with the possible issuance of donor IDs at some sort of centralized location. I think it sounds like BICAN already has that in place. And I think this is like absolutely critical, especially when it comes to model building and uses in AI, because we do have nice resources around genomics data. We heard earlier about BAM files, sequence data, and so forth. But there are a lot of other associated data that come along with an experiment. And those tend to be siloed in other places. They'll end up on Figshare or random places that you're like, "What is this?" And so if they don't have a donor ID linking them back to all this other genomic data, there's no way the models can ever use them. And so I think having some sort of standardized ID that follows all the data from an experiment is critical.
TIM TICKLE: Thank you. Yeah. And even having some agreement between projects on IDs as well, right, so that one donor is not named differently in another project potentially. All right, we're going to have Jason, Anita, Jim, and Chunyu, and then we're going to slightly pivot on the conversation. Jason, go ahead.
JASON STEIN: Okay, thanks. So I think there's been sort of a reasonable focus on minimal metadata standards, but I would just encourage sort of if an investigator wants to report maximal metadata. So things that are often very important for these omics type of methods like person doing the library preparation or lot number of the reagents are often not recorded in these large metadata standards but can be really important for finding technical variables that are associated with differences in gene expression. So minimal metadata so that everybody fills out the same thing, but also maximal in the sense that if you want to, like optional, maximal so that bioinformaticians are able to find technical variables in the future, I think would be a good thing to have.
TIM TICKLE: That's a great point. And also, I think that as we build our infrastructure, we're going to be extending the life cycle of what we do. We'll have releases. We'll have the knowledge base. And because of that, I think the metadata we collect will be iterative. As we add in layers, we'll know that we need something else for the knowledge base or we need something for release and such. And so metadata is likely to be living and breathing, but maybe we'll hear from Jim on that. Thank you, Jason. Anita?
ANITA BANDROWSKI: So actually picking up on that point and the point made earlier by David, yes, we do. And we are definitely working towards this minimal metadata standard. And the data that we're expecting from the investigators should truly be minimal. But I wanted to bring in this concept of curators because this is not something we currently have, although BICCN certainly had a lot of curation coming from Carol Thompson and others. I think this is something that we need to continue to iterate on. Once we have the minimal standards in place-- I mean, we love our PIs. We've been extremely fortunate to have very, very good and involved PIs in the project. But unless someone is looking over the whole of the data, it's never going to be as consistent as, if multiple people are filling in different fields, they might have a slightly different understanding of what goes into that field.
And so from my experience in other big projects and of course in BICCN before, there is really a need to have someone kind of at a higher level take a look and kind of smooth over the kind of slight variations in the amount of data, that metadata that we get into each individual data set. And that person should be able to work with the investigator to kind of fill out those minimal metadata fields in a consistent way. But also, the investigator should absolutely put in as much maximum metadata as they like because I fully agree, that often helps us actually even pull out the minimal metadata fields. So there's just me shouting out towards professional curation. I think we need to start working kind of in this direction as we get the standards in place. Thank you.
TIM TICKLE: Thank you. Thank you so much. All right. We'll take the last hands that are raised, and then we're going to have to move to our third topic. So we've got about three minutes. So please speak your mind, of course, but please let's stay within that. Go ahead, Jim.
WENJIN JIM ZHENG: Yeah. So I just want to point out with regard to the description and other important information for metadata, that's what we actually included. I posted the link to the latest development version of the library minimum metadata, which contains all of those information about description of the field and the properties of the data elements that are contributed by all these participating teams. And all the information should be there.
TIM TICKLE: Thank you, Jim. Chunyu?
CHUNYU LIU: Okay. We were talking about tracking data and query data. But I just wanted to point out one important thing people tend to forget. That is, it's not uncommon to make a mistake in terms of really managing the samples. So sample swapping is not uncommon. And unfortunately, when we have lots of sequencing data, it's very easy to keep track and make the alignment to fix many of the problems. But that's not commonly done for many of the data we have in hand. It's created lots of downstream problems. So I really want to encourage the consortium to think about aligning the sample with the genotype. A small number of genotype would be sufficient to make that correction.
TIM TICKLE: Thank you, Chunyu. Jimmie?
JIMMIE YE: Totally agree with all that. And my point is that we should really try to learn from other consortia. Bulk RNA sequencing has faced a lot of the same issues. Sample swap, I think something was upwards of almost 20%. I think there's also enough single-cell RNA sequencing data out there now that maybe we can take a data-centric approach to figure out what are the most important technical features that may not be in every single data set right now that's sort of driving sources of variation. And maybe the CZI can help there. And so then that should be part of what makes up that minimal set because I also agree with Dave that we're not going to get people to fill out 100 entries in a field. But maybe sort of a data-centric approach on what are the most important factors.
TIM TICKLE: Thank you, Jimmy. Dave, maybe 30 seconds to round us out.
DAVID HAUSSLER: Yeah. Yeah. So I've been involved in lots of these transcriptomics and genomics organizations, and you really should use the Global Alliance for Genomics and Health. Standards have to be-- for standards, they have to be international. And they will work with not only companies that are building the instruments, but with also the PIs who are trying to interpret the data. The standards change as the technology advances in our understanding of the problem and the data generated by the instrument changes. But if you don't have an organization like GA4GH that's getting the companies to work with the investigators, the companies can start pushing proprietary standards to try to sell more of their platform. So you really, really need that.
TIM TICKLE: Yeah, 100%. Thanks so much, David. And then we're going to-- just to round things out, one thing that gives me a lot of hope for the backend, and I think it's a wonderful project to be in, it's because, number one, of the close relationship we have with the infrastructure professionals and the scientists working together, like you saw with Kim and GQ, for instance, and also that the system itself is a living system where it's a part of the process. It's not an after effect after everyone's done, we put data here. It's being processed for people to use to do the analysis with and such. And so we'll catch a lot of stuff because it's got to be good enough for us too. So I'm really excited about that as well. So with that, however, I do want to move us over. I think Lydia will be taking our next theme.
LYDIA NG: Thank you, Tim. So our next theme is Features to Knowledge. And we then have invited Satra Ghosh as the Director of Open Data in Neuroscience Initiative and principal research scientist at MIT. Satra, take it away.
SATRAJIT GHOSH: Thank you, Lydia. I hope people can hear me.
LYDIA NG: Yes.
SATRAJIT GHOSH: Wonderful. So I want to kind of get into this kind of middle space of what we are talking about in this theme. We've just heard about data and data access, and we'll hear about communities and use cases in the upcoming parts of this panel. This middle space of kind of features to knowledge is what I wanted to focus on. And as part of this, some of the panel members asked us to kind of define this space. How do we kind of define these terms: data, information, and knowledge? So this is some very simple definitions just to keep us grounded in some of the pieces we are talking about. So data as raw, unprocessed facts and figures without context. Now, we tend to often talk about data and information in the same space. Information is what I think we generally refer to as data, which is data with context and meaning. We heard about minimal metadata in this last discussion and information about provenance from Kim and others. But the other way to think about it is that it's often processed, organized, and structured in a way that's meaningful to the user. And one could ask basic questions about this like who, what, when, where, and how this information was generated by, processed, etc. Knowledge is derived from information by insights, understanding patterns, and contextualizing information, and it encompasses the understanding, awareness, or familiarity gained through experience or learning. And in some ways, this is a cycle, right? To generate data, you are building up on knowledge. You're just kind of thinking about these pieces slightly differently.
As an example where knowledge could be helpful, I want to bring in an external project. This is creating biolink, which is a knowledge graph schema, which is supposed to encompass all biology from molecules through to clinical entities, and hence, as such, relevant to our needs of those who are focusing on the brain. And the charge that the NCATs folks had for this group was to get 20 different sites using the same data model, which I think we are also kind of trying to do within the BICAN and hopefully with the wider collaborations across it. And one of the goals of this, if this is possible, is to create an integrated knowledge graph that one could apply reasoning. And this was written in 2022. We've seen some incredible changes in technology over the last year or so that allows us to interact with information much more readily. And hence, getting information organized in knowledge would be great. So what is it that we want to do? Features to knowledge. There's lots of features in this BICAN single-cell space that we are looking at. And it might come from different types of assays. It has additional information about anatomical location, morphology, projections, physiology, quality of information, as well as a lot of phenotypic information, especially as we are getting into a larger space of human information coming into BICAN. On the other hand, we have knowledge. And this is just a subset of things that BICAN and others are doing in building reference atlases, increasing our knowledge of the brain architecture through parts, cell types, connectivity, and location, as well as thinking about pathomechanisms, function, dysfunction, precision targets for intervention, and creating ontologies.
The challenge is there are many ways to get from A to B, and there are many ways to even get to A. And part of this session is kind of discussing how we do some of these things in terms of pipelines that are available. I want to bring up the BICCN multimodal cell census and atlas of the mammalian primary motor cortex as an example of this. This covers information across species, across various modalities of techniques into what we have defined as knowledge and part of the ecosystem of disseminating these through data archives and other resources like the BICCN data center. But this was just one piece of a lot of collaborative effort across various collections of publications. And one of the questions for us is, how do we encapsulate all the tools, technologies, and data that have been used in all of these papers and make it readily reusable such that we know what features can be computed, what knowledge can be extracted, and how we build this up together. So I'm going to get back on this slide, and I pass it back to Lydia on how, what, which, when, and how, and who is going to do all of these things.
LYDIA NG: Thank you. Thank you, Satra, for the motivation slide. So let's start by the what; what goes into the knowledge base? So the first thing is we know that we have a large amount of cell annotation data, single-cell data. So what do we want to extract out of that to go out of the single-cell genomics data? And then what kind of queries and searches do we want the knowledge base or any of the portals to provide? Can we see hands? Other hands? All right. Mark?
SATRAJIT GHOSH: Mark, you're muted.
MARK GERSTEIN: Is Mike before me, I think, in the list?
LYDIA NG: Mike is answering a different question. So I put the questions out of order.
MARK GERSTEIN: Oh, okay. I'll just say something then. That's fine. One of the, I think, things that people probably would want to get from a knowledge base is answers about differentially expressed genes. People are interested in disease and things like this. They're going to want to know how particular genes in particular cell types change. And one thing that I think is worthwhile pointing out to people is these are such complicated calculations. There's many levels. Obviously, there's the cell types that we've heard about and the pipelines and all that process them, that sometimes you can get pretty different results if you change things. People all know this and all. But I wonder if there's some way that we can preserve some of these different types of results that people would get and the variance in the knowledge that we get out in some standardized form so people can get a sense of how robust a lot of these calculations are. I mean, particularly just to amplify this, the differential expression of genes is so tied to the definition of cell types. And a lot of times what might appear to be a gene that changes a lot, if the cell types change, that will not change so much. And I think that's an important thing to try to somehow preserve. That's it.
LYDIA NG: And I think that brings up a very good question of what Mark just brought up, that there's more than one-- and Satra said, there's more than one way of skinning the cat. The data can be put through different processes and output. So how do we then keep track of all the different ways things are processed and versioning? Tyler, do you want to take that on? You stuck your name next to it.
TYLER MOLLENKOPF: Yeah. Thanks, Lydia. I'd be glad to. So we talked some already about more fundamental things like donor IDs. I think one of the core aspects that we've got in mind, the Allen Institute, with the Knowledge Base is making sure there's very tight linkages between everything between donor and cell type over at the end. So there's lots of things that happen to tissue. There's lots of things that happen to data. Those things are going to have IDs from lots of different systems in BICAN, and we just need to be able to keep track of those. That sounds simple to do, but we've got multiple systems in play, and there are different points in time where things get QCd. And so there's coordination between systems that need to happen. One of the really interesting areas, I think, in this discussion around versioning is the distinction between clusters and cell types. And that came up a little bit yesterday. And so making sure that clusters, which come out of algorithms and the taxonomies or graphs that may come out of the algorithms, may have some additional curation, have their own identification as really important data-driven analysis artifacts. And then the names for those, the linkages to literature, the manually identified relationships that are sort of known between them, that's also really important to capture and to sort of ID and preserve. But there, we'll want much more information about human provenance and maybe room for discussion. And so versioning those two categories of things separately and linking them, I think will be really important. We don't really have the notion of version 1, version 1.1.1. Maybe that will happen, but we're at an earlier stage of maturity, I think, that at least identifying and providing information about who did what when is a really important place to start.
LYDIA NG: And Fabian, you have your hand up?
FABIAN THEIS: Yeah. Hi. I wanted to sort of bring in this question of how we will be accessing data nowadays. We've been discussing this, not just in the brain setting, but in human cell atlas for, I think, more than five years, right? Initially, how do you keep track? There's all these distributed data sets. At some point, you start integrating. You kind of make sure that there's annotation. And I think at some point, for example, CELLxGENE has been turning out to be something very useful that a lot of people use because it's sort of a bit on the standardized side. I think sort of these things, they will be getting sorted out. There will always be different people providing things. I think it's important that, as long as it's fair, right, you might be able to map between the things. And I think, for example, in the Cell Annotation Platform, CAP, project, there's maybe different ways how you would annotate that you might annotate here. But if you sort of know how to go back and forth, I think that's useful. I think the key question that we haven't really fully understood, that's why I really liked Satra's talk - I hope I pronounced this correctly - was about the knowledge graph, right? So this is how we've been also traditionally often thinking, right? We sort of put a lot of things in, we have a bunch of associations, we have the cell types, we have the transitions, sort of annotate everything in detail, and then we sort of suck it out again. And this flexibility of recent natural language models that you obviously also mentioned already, right, I think really makes it attractive to re-ask that question. And if we have something that sort of understands language and the type of questions that we would ask from papers, is that maybe really way how we access things? I'm sure many of you have been playing around with this, and you can sort of feed your local GPT with that. But I think given the fact that also these bigger embedding models that are really starting to become multi-modal, where we really throw a lot of these data sets into one thing, are becoming a popular thing. And it's not entirely clear if they really outperform other type of representations. But for some settings, they at least are able to eat up a lot of variation. I think this could be at least an orthogonal thing. I don't think it will be the solution for all of these, but I think it would be very exciting to put some of these use cases together and some really good questions that you could ask for these that maybe are not sort of super obvious and then maybe compare.
LYDIA NG: 40:15 Shoaib?
SHOAIB MUFTI: Yeah. I just want to make a quick comment on how we sort of keep track of things, like the discussion we were having earlier. So I think one of the things which is really very important to have some sort of a reference, which is in the knowledge base. This is a current consensus around different aspects, right, whatever the aspect you are looking for, right? And when the data comes in, there's always a way to compare it against that reference. And then when you do the comparison, what's going to happen is that over time the understanding will evolve, whether it's a cell type definitions or whatever things which we are tracking right now. And there's some mechanism to update the references. And so that's the kind of our approach is that at least the knowledge base has some sort of reference. And over time, there could be multiple references, but at least there's always that here's the consensus. And this sort of moves the ball forward, right, because that's how you can sort of capture your current understanding of, for example, a cell type, right? So I think that's one of the key aspects. And then obviously you can overlay some versioning, right? Maybe there's a reference to version one, two, three over time, but at least you need to know what the current consensus is.
LYDIA NG: And given that our conversation is so heavily genomics-oriented, I'm going to invite Giorgio because you volunteered anything to talk about how someone who is more interested in pathology and connection, how would you make use of this? How would you want to make use of this knowledge base? And what other things would you want to be connected to these type of knowledge bases?
GIORGIO ASCOLI: Yeah. Thank you, Lydia. This is really fascinating. There is obviously a tremendous amount of data that is going to be powerful for the community. At the same time, I think it's been touched both in this session and in previous sessions, the trick really is going to be able to link dimensions. My experience with curating both databases and knowledge bases is that pretty much any and all metadata that the community is going to be able to put in to enrich the database with is going to be able-- is going to be used by the users for searches, queries. And the more metadata that can be annotated, the more powerful those searches are going to be and the more usable the resource is going to be. So my own experience, both as curator, as user, is that things tend to evolve over time. When more metadata gets added, there is a point where one reaches some critical transition and the amount of queries that can be done exceeds really what can be thought of from the get-go. So it really becomes up to the creativity of the users. And a lot of it has to do with the combination of metadata notation and ontologies. And that really requires the use of structural vocabularies as opposed to free forms. And there are now some pretty good ontologies that can be leveraged at many levels.
For this specific group, I think that one of the key aspects is to be able to expand the anatomical and spatial annotations. So right now, we are all enthusiastic about having the common coordinate framework, for example, for the mouse. And that's been a game changer. But of course, that's just one way to slice the pie. There are many others that have been discussed, and it would be nice to be able to link them together. Within the molecular domain, one can imagine, for example, usage for disease relationships. And there are many resources that can be used to link the genes to the diseases, both in terms of animal models, but also genetic mutations, and so forth. So imagine a user that is interested in finding cell types that are linked to specific diseases. And so one links some sort of metadata annotation of the genes themselves to those diseases.
So I'll say just one thing on a positive note, which is I used to emphasize annotation. I think that with the advent of large language model, I think we can shift the emphasis from annotation to curation, really. So I don't think we can trust ChatGPT or even specialized large language models to do the annotation for us, but they can be suggestion systems. And there's been a long history in annotations and other fields where the bulk of the searches and extractions, proposals for terms, and notations are done automatically. And the curator really basically has to say whether this is right or wrong or can choose between one of a few possible terms for annotation. And in our case, for our resources, hippocampal and neuromorpho, that has sped up tremendously, by an entire order of magnitude, our operation. So the bottom line is keep annotating using any and all resources that are available.
LYDIA NG: All right. I have six hands in like two minutes. So I'm going to take people slightly out of order so I can hear new voices. We're going to do Mike, Anita, Jim, David, Shoaib, and Tyler. Oh, and Bo, you just popped up. Go, Mike. Let's go.
MIKE HAWRYLYCZ: Thank you for the queue-jumping pointer there, Lydia. So I think one of the main things that is a main issue is this idea of the update, right, and maintenance and update and what is new and when to add and how to add and the mechanism. So I'm kind of channeling back to Giorgio here, who has maintained this spectacular database over so many years. But I think that we're going to need ways of-- more automated ways rather than simple manual curation of trusting data as it comes in and making adjustments in what is known. And I think that-- I think it's entirely feasible given the state of knowledge engineering and AI, but it's a major challenge that we're going to have to address. That's all I want to say. Yeah.
LYDIA NG: All right, we're going to be in elevator pitch mode. Anita?
ANITA BANDROWSKI: All right. So my elevator pitch for this particular section is there's a paper that's been put together on the atom ontology model to standardize the use of brain atlases and tools. And I think maybe Tyler was going to mention this, or maybe he had a different point. But there is power to ontologies, that power to ontologies to make sure that we're using the same language, the same terminology to get to the same set of concepts. And if that set of concept is a brain parcel, we can still name it. And we can still use appropriate terminologies. And then we can compute upon those. So I had put the link to this paper into the chat. I hope it makes it into the notes. And I love the discussion that we've just had on ontologies. I think they're incredibly powerful ways to be able to query this data. Also, in full agreement with Giorgio on using any annotation tools, anything necessary to get us close, and then having the professional curation, again, to actually nail some of those things down. So I just very quickly. Thank you.
LYDIA NG: All right. Tim, I think I'm out of time. So I think we need to call this. I'm so sorry. But everybody, there was still the chat. Still put things in the chat and the questions. Tim, or should I take one last question?
TIM TICKLE: Yeah. Anyone who hasn't got to speak yet. There were a couple of people that have not been able to speak yet.
LYDIA NG: I think Bo.
TIM TICKLE: Yeah. So Bo, yeah. Let's get Bo on there.
LYDIA NG: We haven't heard from in any of the session. You have been elevated, Bo. Let's go.
BO WANG: Yeah, thank you. Yeah, so we have an ongoing project together with some of the folks in Korea in which we develop a semi-automatic pipeline that have human annotated in the loop. We use ChatGPT to look at all the publications that share data with some metadata annotations. And then we asked ChatGPT to map some of the metadata to CZI Metadata reference. And then we look at the confidence of the mapping, and if the confidence is low, we involve human annotators. And the human annotator makes the corrections. And this feedback used to fine tune some of the larger analytic models. And then this kind of human annotator in the loop with feedbacks really works very well. And then can increase the efficiency by human annotator by-- I don't know how many fold, but a lot. And then we already curate almost 10 million cells that is not collecting in Human Cell Atlas by various small publications. So this project gives me lots of insight about how do we incorporate powerful models such as ChatGPT together with human annotators instead of like just let ChatGPT do whatever they want or the other extreme is purely manual. I think we can reach a middle ground in which ChatGPT can be assistant to human annotators to improve the efficiency. Yeah, that's all. Thank you.
LYDIA NG: 50:09 Thank you. Okay. Let's switch over, Tim, to the next theme.
TIM TICKLE: Awesome, yeah. And then thank you Bo. If you have a link to that or any kind of thing that we can add to our notes to let people know about your project, that sounds really exciting. All right. So thank you so much for the awesome engagement, everybody. Doing great. So we're moving on to our next theme. This theme is focused on within BICAN Group and project synergies. And so I want to first invite GQ Zhang, Distinguished Chair in Digital Innovation, Vice President and Chief Data Scientist at the University of Texas Health Science Center at Houston. Please go ahead and take it away. I can see your presentation, and it looks like it's ready to go now. I'm not sure I hear you though.
GQ ZHANG: Mic on.
TIM TICKLE: There you go.
GQ ZHANG: Mic on. All right.
TIM TICKLE: Mic's on.
GQ ZHANG: Thanks, Tim. Yeah, great discussion. And also stepping out from yesterday's workshop and earlier parts of today, it looks like we are talking about a lot of different opportunities and challenges and interesting topics. When you already have the data, what do you do with it? How do you leverage that? And how do you create data pipeline to knowledge? But I'm going to step back and look at how we can engage the broader community across BICAN and across bridges to other brain community of interest. So I'll cover a couple of slides by—
TIM TICKLE: You went mute, GQ, unfortunately. If you could unmute. There you go. All right.
GQ ZHANG: Yeah, and what are the BICAN components and efforts ongoing and so that people can understand? And then we're posing the question of how best to maybe realign, reorganize, or mobilize the community better to sort of work on things of common interest. So there are at least five, six, seven kind of different components. At the institutional level, we have different UM1s at different universities and institutes. BICAN itself is organized with working groups and task forces and steering committees. And that's actually a great kind of initial design to help us achieve thus far. And then this working group and task force is driving different components of the operational infrastructure, including the brain bank, IMS system, NeMO, BL, Seq library portal, and specimen portal. We've discussed how the tissue metadata is incorporated at the point of data collection and data generation at the brain banks and library labs and sequencing centers. And then we're looking at the data flow and data pipelines from APIs and molecular data processing alignment. And those kind of tasks are necessary in the direction of moving data to information. And then the resources that are required to do this. Of course, when we're dealing with human tissue, there's pressures, and it's necessarily limiting in many different ways. Nobody is perfectly healthy to donate their brain, right? So we need to sort of coordinate the activities of different resource components so that we can crosstalk and data makes sense when it flows from one place to another.
And then the importance of standardization we already touched upon. Not just the metadata standardization ontology, but also the standard operational process for us to get in there and annotation at the different lifecycle points and common coordinating framework, QA/QC. All of those important activities needs to be tightly integrate into the entire ecosystem. And also we consider fairness, not after data already generated. That's already-- sometimes, it's too late, especially for such a big project. If we consider fairness afterwards, it's much more challenging. And so it's very kind of cost-effective for us to spend more time focusing on how to make sure fairness is enforced at the point of data generation, de novo generation, I would say, because such data never existed before, and we're generating actually new resolution, new data modalities through this project.
And then the tracking of all of those activities together, using dashboards-- we talked about identifiers all the way from a donor to their cell in their downstream analysis, different releases of data, a variety of tools, compatibility, interoperability, and versioning of almost everything we do. So these are important aspects we incorporate in the BICAN. And to highlight a couple of examples; one, is the drivers for BICAN. So we not only ask what and how, but we ask why as well early on. So what's our scientific mandate? And then what are the technical requirements, operational processes that allow us to achieve those objectives? And then, who are the stakeholders and playing what kind of roles? And then, when we create some portals and databases, what are the requirements, what are the artifacts, and who is going to use that? So it's being tested and prototyped and iterated as we go. So it's not fixed and frozen in time, but it's continuously enhanced, updated, expanded with each role and each release.
TIM TICKLE: GQ, you're at the five-minute mark. It'd be helpful if you summarize at this point.
GQ ZHANG: Yeah. So we have an example of collaborations through NeMO, which pulls together different components and generating the sequencing kind of resource. That's first data comments, if you wish, for BICAN. And then, in general, we have this kind of dimension of upstream and downstream. And the more upstream we are, the more kind of standardization and centralization we need. The more downstream of the activities we situated, maybe there's more variety. But we have opportunities still getting back to mobilize the community, to reorganize, potentially, so that we can be better synergized to tackle all of those interesting issues and challenges.
TIM TICKLE: Awesome. Thank you so much, GQ. Thanks for setting us up there. So for the next about 10 minutes, we're going to talk a little bit about the kinds of things, for instance, goals that can be used for synergizing organizational frameworks, challenges that come into play with coordination, what types of new projects can be created from our coordination. And really, for me, I think it's important that as we do this, we're all here to support science and accelerate science, right? And so really having goals and being driven by goals help make sure that we're focused on value in science. And so maybe for our first question, what goals can be set to synergize and mutually benefit projects across the consortia? And Mark, thank you.
MARK GERSTEIN: Sure. So what I thought I would say is just a little bit of some experience I've had in a variety of other groups and consortia and how that might help inform what people might want to do in BICAN. I just want to kind of compare a situation. I was in this project called DOE KBase, which was a very integrated project to construct an integrated software system, very, very highly integrated, and so forth. And maybe the other extreme I'd just point out is 1000 Genomes, which also had an integration component, but was also a bit flatter, where there was more of a sense of people just kind of coming up with things that use the data. And I think both of these types of things have advantages. I just want to point out that, of course, standardization and integration is always a good thing, but sometimes it gets very rigid. And if you have a very top-down type of thing and you make a mistake at the top, then the whole thing maybe kind of suffers from that. And especially this is true for a project that involves a lot of people and goes over a long time. And sometimes that's not the greatest thing to have. Sometimes it's good to let people kind of do their thing and just not have very strong goals, but just sort of have a framework where people can come up with ideas and express them. So I just wanted to put it out there. I mean, there's a whole range of projects, obviously, government has funded in different levels of integration.
TIM TICKLE: Thank you, Mark. Owen?
OWEN WHITE: Yeah. In a sense, this might be leveraging off of what Mark said and what many people have said. And when I go back to, for example, where Satra started in terms of the idea is to derive knowledge from this, I don't know what the answer to this is. But in a sense, what I think that we'd all benefit from is maybe looking at our consortium and understanding-- that this isn't true, but thinking of it in terms of if we had unlimited resources for the integration. That still means that we're going to be boiling the ocean unless we have a really clear idea of what it is we want to do with that. So what I long for, most of all, is to actually have a clearer understanding of the use cases that we want to achieve, and then it would be easier for us as a consortium to be building towards them. And I know that that's extremely hard to do, but when I think about something like the human reference genome, there were a lot of ways in which-- they were really fortunate in that they didn't actually have to think about the use cases in the same way that we do. It's almost as though we suffer from an abundance of opportunities here. And being able to come at it and thinking a little bit about what it is that we want to be achieving with these data sets that we're integrating first, which certainly, in my mind, just as sort of a blue-collar engineer, make it easier to understand what it is that we're trying to achieve and where we want to go with this. So I'm just going to put that out there and see if other people agree with it.
TIM TICKLE: Thank you, Owen, for focusing the point. I really appreciate that. Jimmie?
JIMMIE YE: I think we've had several discussions about coming up with some standards on input, like metadata, but I think it might also be useful, as in sort of middle ground between full integration and fully distributed study, just to think about standardizing outputs for each group that gets funded. So why would you want to do that? Well, I think a great middle ground is meta-analysis, right? I mean, you can gain a lot of insight from many, many small studies. And certainly, that's something that the human genetics community has done really well. And sort of power these studies where you still have lots of different groups, but each of them can do their own thing, agree on a set of outputs that can then be meta-analyzed to extract additional information. It's just sort of food for thought and lessons learned from that community that could be useful here as well.
TIM TICKLE: Yeah, I think that's really clever, Jimmie. It's like making sure that your interfaces are standardized so that you can be working off of whatever each other is producing and creating. That's super smart. Satra?
SATRAJIT GHOSH: I wanted to piggyback a little bit on Owen's point about use cases and bring in the word nails. I mean, we all have various nails, and we're building all kinds of hammers that we are doing. And in many cases, we're kind of recreating hammers that could have served that common nail. In that cross-consortial effort, is there a way - and I don't know how yet - to bring what nails and hammers are there as a broader exposure across the community?
TIM TICKLE: 01:03:31 Awesome. Thank you. Jim?
WENJIN JIM ZHENG: So I need to get a chance to talk about LLM. So I think one thing though, we need to think about what's going on now, right? So before, there are a lot of ontology development, all these kind of things. But look at what the large language model achieved. There are a lot of things that suppose you can achieve that only if you have ontology, you have inference, have all those kind of things, right? But in the end, now, data wins. You have a lot of data, a lot of tags, you train a large language model, the model actually can accomplish a lot of things. So I really would like to bring this up and ask people to think what we can do with that in terms of-- I mean, certainly ontology, standards, metadata, those kind of stuff are still very, very important. But I think there is a change in the game. I would like everybody to look into that and think about that.
TIM TICKLE: Yeah, I think that echoes Bo and Fabian's points as well. That's really great. Thanks, Jim. GQ?
GQ ZHANG: Yeah, I want to just pull back a little bit on one of the synergy kind of topics. Of course, large-- I mean, there are opportunities for additional synergy. But one of the questions, given all discussions and interest in a variety of directions, should BICAN sort of reposition or regroup in some ways? Is our existing organizational structure best suited to tackle all of those interesting topics? So we're still looking at the topics. But then how do we mobilize the entire scientific community to tackle those? Is it by working groups, task forces? And from my own experience, I sort of-- in the development of the specimen portal and the SIC library portal, I find that that serves one part of the purpose. But then to create a tool that can be designed to be future-proof and adapt to changes, adapting to dynamic requirements, adapting to data versioning and test data, production data, and different things, I needed to sort of operate on an office hour kind of basis, open three times a week, of group meeting to interface with the large variety of different interesting topics. So are there any suggestions for additional mode of interaction and formation of the community to tackle those challenges?
TIM TICKLE: That's a great question. And please, Bruce will end up this session. He'll be our last voice. So if anyone has any answers to GQ's questions, feel free to use the chat or the answers—
BRUCE FISCHL: I don't have an answer to GQ's question.
TIM TICKLE: No, don't worry.
BRUCE FISCHL: I want to follow up on Bo's point, which is that I think that one of the great things about BICCN and BICAN after is the development of these amazing imaging technologies where the resolution, the field of view, the contrast has just increased. And we're able to label things that we couldn't see in large sections of the human brain or mammalian brains. And that increases the importance of these algorithm-assisted annotation tools. Satra and I saw a talk by a group at MIT who's developing one of these things. And they're great. As Bo said, they increase your efficiency by at least an order of magnitude, probably more. And in 3D data, they increase your accuracy as well because human beings are terrible at reconstructing 3D structure from serial sections. We have no evolutionary reason to be good at it. In fact, we're terrible at it. But I think we have to be cautious about bias. I think it's kind of a classic bias, variance trade-off. That if we're embedding some knowledge into these models that are helping us through the adaptation, and if that knowledge is biased in some way, then we're going to be paying a price for the algorithm system. And maybe that price is one that we'll be happy to pay because the variance has decreased so much. But I think that it's incumbent upon us to try to quantify what that bias is.
TIM TICKLE: Awesome. Thank you so much, Bruce. And so now we're going to move back to Lydia for our last theme. Lydia, I don't think I'll need all five minutes for the last closing. So I just need a couple of minutes if you need a couple of extra minutes for yours.
LYDIA NG: Thank you. So now we're going to switch from synergy within BICAN to synergy between BICAN and the rest of the community. So Jeremy Miller, who's a senior scientist at the Allen Institute for Brain Science, is going to give us five minutes of quick intro to this topic. Jeremy, take it away.
JEREMY MILLER: Thanks, Lydia. I don't actually need five minutes, but it looks like someone else added a couple of slides at the end of my motivation. So it might be a team effort with a surprise helper. We'll see. Let me share my screen. Can you see the slide?
LYDIA NG: Yes.
JEREMY MILLER: Okay, great. Yeah, so I'm going to give one quick example and then kind of an overview, and then I'll be done. The one example I want to give about why we need to talk outside of just BICAN is the example of microglia changing in Alzheimer's disease. I picked three UMAPs from three different papers that talked about different kinds of microglia changing in Alzheimer's disease. At this point, there are probably dozens of papers that have shown this. And so DAM stands for disease associated microglia. So in this middle one, there's a good chance it's this red population here that are the population of cells that are changing with Alzheimer's disease. But if you wanted to then compare between papers, well, I want to look for the same kinds of microglial cells here, it would be hard. You'd be hard pressed from looking at these UMAPs to be able to tell, first, what are even the same populations that were identified in these different papers. And then second, even if I had listed on here, which ones are the ones changing with disease, whether they were actually the same populations that have changed. This is just one example, but this kind of thing comes up all the time. And it's really important that we're all speaking the same language, not just within BICAN, but kind of with everyone in the field. And so I mean, standard cell types names would make for an easier at-a-glance comparison between studies. And then if we aligned on a community standard, that would remove the problem entirely. But the question is how do we do this kind of within BICAN and externally?
And so we're BICAN, we're creating these brain atlases. I have one adult brain, but of course, we're also doing development. But we're also not the only project that is creating these whole-brain atlases. There's ones from other countries that I know I probably don't talk to as much as I should, and I don't know about others on this call. There are also all these disease types. So how does the brain look when there's things that aren't found in healthy brains? So I put a few examples down here, but this is not even close to all of the different disease consortium, not to mention disease kind of labs. And the brain's not the only organ. So there's Human Cell Atlas, HuBMAP, Spark, and other groups that are studying the whole body. And since the brain is obviously an organ that's found in the body, we want to make sure that when we're talking about brain cell types, these other groups are talking about ideally the same brain cell types. And then so I think the challenge that we've already started really talking about already is do we need a Google Translate when we're talking with these different groups? Or can we use things like large language models or other things to kind of make sure that we're speaking the same language? So that's really all I had to motivate this. Did someone put these slides--?
DAVID HAUSSLER: Sorry, yeah. Sorry, I put these in a few days ago, but I didn't touch base with you. I just wanted to mention there's the CIGENE project that has a lot of similarities. And we're working very hard to automate the process of getting the data in and using the same archives. So we're very interested in a semantic translation between the projects. And if you advance to the next slide, I wanted to make the point that we need to work very hard on getting more automated data transfer into all of our databases. I will make the prediction that AI will completely dominate the area of knowledge representation in these projects in the next few years. And that requires very large data sets that are available for indexing. And we want to work together. That's all.
JEREMY MILLER: Great. All right, I'll stop sharing.
LYDIA NG: I guess I've got to unmute myself. Okay. Let's pick one of the questions. So what are some of the-- Jeremy listed some of the consortia that exists out there. What are some of the next-- what are some of the synergy points that either people are already working on or would like to see between some of the consortia already listed there? Jeremy, how about you talk about specifically one of the other consortia that you're related with and what you hope to achieve with BICAN? And hopefully that will motivate other folks with different ideas.
JEREMY MILLER: Yeah. That sounds good. Yeah. So I mean, one thing, we've been trying to get kind of the brain cell types that we are generating as part of BICAN as the brain cell types that are available in the HCA and Hub app. And so there's some of the papers that have come out of, I guess, BICCN, at this point that will ideally continue into BICAN are a part of the neural system network on Human Cell Atlas. And so there's some overlap there in terms of the same kinds of cells that are being used there. For HuBMAP, we're putting the cell types into their kind of anatomical structure, cell type and biomarkers, I think, table as well as trying to get the anatomical structures that we're defining as part of-- that we're going to be using in BICAN into their HRA alignment system. I can't think of the exact name of it at the moment. And we're also working with a couple of community consortiums like AMP-AD and PsychENCODE to try and get at least the critical areas that are typically used in disease to kind of work off at least some versions of the same brain cell types. And so I think that's the biggest thing that we're doing so far is just trying to make sure that we're all talking the same language - I know this has come up a few times - and working with these groups so that things like data formats are the same and that we can use kind of-- we're using the same data sets and all of that good stuff. So I think I'll stop there.
LYDIA NG: Evan?
EVAN BIEDERSTEDT: Am I muteed? Hello. No, I have to agree with Jeremy. I mean, bringing up the HCA, I'll put on my HCA hat on a moment and my tool-building hat. It'd be useful if we adopt some of the community standards that have evolved and have arisen over the past couple of years. I mean, we've been trying for several years within the HCA to coordinate the data, and it's been a bit of a challenge. There's lots of lessons we can learn from this that we can discuss. So simply using the same file format and the same definitions for certain metadata fields and encoding them in the same way to the file formats would be really quite useful. You'd be able to use the file formats in different data portals quite easily. Clear standards is something that we would need as well. Clear schemas that are published for the consortia, metadata fields that are motivated by the use cases and clearly defined would be, probably, the first place to start. At the moment, I was a bit worried with some of the metadata discussions earlier. It feels like we were sort of reinventing the wheel that we had discussed back in, I don't know, 2017, 2018. Brian Aevermann is here. I mean, the CELLxGENE standard is really, really quite used by computation biologists in the field. They encode organism in a specific way, tissue, disease. If BICAN were to disagree with this, it would be sort of massive schisms within the field. So somehow motivating how you're defining these metadata fields would help me as someone building software to actually work for these synergies. That was my point. Thanks.
LYDIA NG: 01:16:56 David?
DAVID OSUMI-SUTHERLAND Hi. Yeah. I guess a couple of things. One is I'll just second Evan there and to also say that we're developed on common standards and trying to at least align with the standard developed by Cell by Gene that's so widely used. We also developed a standard in collaboration between BICAN and the saltation platform that Evan works on for expressing saltations themselves. Okay. But the other thing I wanted to-- I was prompted by Jeremy's talk. So talking about standardizing cell type, do you think that a standard reference for cell type will consist of some kind of standard reference for a cluster and some kind of standardization of how that might be used with annotation transfer tools? Because I think there's a lot-- if a lot of this depends on standards for what it actually means to define cell types by data, I think that's a critical thing to at least start discussing and come to some agreement on. And those are my two points. I don't know if, Jeremy, you have anything to say about data-defined cell type standards?
JEREMY MILLER: I mean, I think I don't have anything big to add. I just agree we need the standards. I don't think it really matters what they are as long as we're all using the same one.
DAVID OSUMI-SUTHERLAND Yeah. I suppose on the final point, though, it's more a matter of it's moving out from text-based definitions or maybe image-based definitions for anatomy terms and things like that from ontologists to actually-- the whole point of something like BICAN is to define cell types with reference to data. How do we make those links between ontology terms and data in clear ways that people can reuse in their analysis and actually give at least some recommendation about what it means to define a cell type by using that data? So as someone who does analysis, I wonder whether you might want to comment on that. If you were coming in cold, and you wanted-- here's a bunch of ways to-- I want to be able to define my cell types in terms of the cell types defined by BICAN and get some-- how would you like to see those defined and maybe linked into a broader sense of what cell types are?
LYDIA NG: I think that might be a little bit of an open-ended question. We have three minutes and three names.
DAVID OSUMI-SUTHERLAND Okay. All right.
LYDIA NG: Yeah. And I know you guys are doing a lot of work in there. And then as BICAN moves along, I think that we will start to answer our own question that you posed. So I have three hands is Liz then Bryan then Tyler and Giorgio. Elevator pitch mode, please. Thank you.
ELIZABETH KIERNAN: Maybe this will partly address that large question, but I just want to go back to something that Brian was emphasizing earlier that when we're working across consortium and all these different portals, it's really important to have a way to track the actual projects. And if you ask that question, "How do I know that I'm using the same cell types?" Well, being able to identify where data came from, what project it came from-- and originally, you'd be able to go back and maybe find that answer. Yeah, so I just wanted to emphasize that that would make things a lot easier. And it's also very important for data updates. We know that data is always being updated and reanalyzed. And what happens now is we get data in multiple portals that may not track with where the original data is from and how the original data has been updated. So having a clear identification across all of our consortium and all of our different portals and tools would be really helpful. That's all.
LYDIA NG: Brian?
BRIAN AEVERMANN: Yes. Thanks, Liz. Saved me some talking. So yeah, I think one of the main reasons why the Cell by Gene schema has been so successful is our adherence to a lot of validation, right? So for the minimalist schema that we have in place, it's only 14, 15 fields. It's mostly sample metadata. So organism, disease is in there. Each of the fields has either an ontology behind it, or it is controlled vocab. Almost all of them are ontology-based. And so basically, we have a validator. I can put the-- it's open source in GitHub. It's Python-based. And so what it does is it validates an H5ad obs field, looks for the relevant URIs to the ontologies, and then populates those with text. And so this sort of framework really works well in computational environments because you don't have free text fields that kind of-- you end up with a whole bunch of competing responses. And this way, we have a lot more control, and then you have a lot more-- you avoid a lot of downstream wrangling for the computational biologists. So yeah, thank you.
LYDIA NG: Thank you. Tyler?
TYLER MOLLENKOPF: A quick comment on two ontologies in particular, really an observation. I love your comments, Brian. And I think especially with HCA, those ontologies have been really powerful. Uberon, I think the cell ontology in particular seem really great sort of across the whole body. But my observation is I think when I look at those ontologies, the languages used there and the structures used there just don't really seem familiar to the way I hear neuroscientists talk all the time. They just don't seem to relate. And I'm not sure what the gaps are, particularly in the brain, for why those ontologies are so much more unfamiliar, but it feels like a place where maybe the technology side is leading the science side rather than the other way around. And I think if we're going to succeed in leveraging and improving those ontologies, which do seem critical, in this case, we're going to need strong bridges to the ontology builders there and maybe work out some agreement about how to get the brain to really be as useful as the rest of the body in those two cases.
LYDIA NG: 01:23:23 All right. Giorgio.
GIORGIO ASCOLI: 01:23:28 Yeah, I'll just say something that I think is very important to mention, that usability is a huge aspect if we want eventually BICAN to interact as much as possible with resources in the rest of the community and the community at large. So having interfaces that are intuitive and usable and bringing users as specialists into the foray and use cases where users can actually try things and give feedback, both in terms of the web accessibility and eventually API for machines as well. But I think having intuitive accessibility in the end might be one of the most important criterion for how much this will be used and impactful in the community.
LYDIA NG: All right. I think we've hit time. We've got one minute to wrap up. Tim, over to you.
TIM TICKLE: All right, everybody. Panelists, you've been great. I mean, it's just been amazing to be one of your moderators. And just thank you so much for your time and your thoughts today. So a big thank you for our panelists. Thank you for being so engaged. Thank you also for Cindy, Anita, and Jeremiah for the note-taking. It's super important. It's going to help us capture this and act on things. So thank you so much for doing that. Lydia, you've been awesome. Thanks for co-moderating with me.
JOHN SATTERLEE: We're going to move on to panel two, Brain Cell Atlas Showcase Demonstrations. And this is going to be led by Dr. Mufti and Dr. Mollenkopf, both from the Allen Institute. And again, they have panelists and note takers as well. And this is going to go until about 3:10, I believe.
SHOAIB MUFTI: So mainly the goals for us in session, we want to give everyone overview of some BICAN tools. Though I'm going to show a slide where you can see there's a lot more tools in BICAN than we are able to show today. But we want to just give you a little bit of flavor of some of the really interesting tools available to BICAN and collaborators. And then as I mentioned, there'll be hands-on breakout sessions to get more familiar with the tools. And really, the goal here is to start collaboration discussion within the BICAN community, but other consortium community who are attending today. What we want to achieve is that we want to start a dialogue around what are ways to collect user experience, user-based annotation, proofreading enhanced data resources and product designs. As I mentioned, there are multiple tools. So we want to start with asking what tools are applicable and what tools can be enhanced that they can be used widely within the community? How will BICAN data archives and Knowledgebase link integrate with other data and knowledge resources generated by research community? Dr. Hawrylycz did a nice presentation for BICAN Knowledgebase. We heard about the data archives. But there's an opportunity to bring more data in and at least the infrastructure we all collectively developing can be leveraged across the wider community. And also just start thinking about it. Where are the areas to collaborate in terms of tools and technologies and what action we should take? So hopefully listening to these presentations and also in the breakout sessions, we can start that dialogue. And then as we move forward, we can continue on that discussion.
In this session, I'm Shoaib Mufti. I'm with Allen Institute. And I'm co-moderating this session with my colleague from Allen Institute, Tyler Mollenkopf. And then we have panelists, which are going to demonstrate their tools: Jonah Cool from CZI, Hanqing Liu, who is at Harvard. And we're going to show a tool at Ecker Labs, Mouse Brain and Mukund Raj from Broad, is also going to demonstrate their tool. So just a quick overview of the agenda here. I'm doing the introduction. Then followed by CELLxGENE, a demo by Jonah, and then Mouse Brain from Hanqing. And then Mukund will show the BrainCellData. And then Tyler will demonstrate our Brain Knowledge Platform, which is a collection of tools with ABC Atlas, Map My Cells, other things we are thinking in that area. Then we have a quick break. And then after the break, we are going to move people to the breakout sessions. So you can pick the session you would like to attend. And there'll be instructions on how to join these breakout sessions.
So if you look at the variety of tools available to BICAN and other consortiums as well, they are helping in many ways. This is definitely not a complete list, right? So they're helping us to figure out the counts, like how many cell types in the whole human brain? How they're related to each other? They're helping us understand the distribution, how cell types distributed across the whole human brain, which types or subclasses best characterized different brain regions. They're giving information about multimodal, how do electrophysiological and morphological properties further characterize transcriptomic cell types across the human brain? They're giving us information about anatomy and evolution and development. So these tools are very, very helpful in terms of scientific research as we use them. So there's use cases researchers use for them. They try to find data and knowledge, view and explore data, analyze and annotate data for genes, brain region species, disease modality. And also these tools are helping us to download the data. And also the tools which we are going to, I think-- I believe Tyler is going to talk about tools like Map My Cells, where you can sort of bring your data and compare it against the reference. And we heard in the last couple of days, there's a lot of interest in the natural language processing and machine learning and LLMs, right? And these are up and coming, which is going to increase the use of these tools, right?
Now, not only you are visualizing your data or interacting, you can really start asking natural language questions like in the chat box. And this machine learning approaches can really increase the effectiveness of these tools as we move forward. So I just want to just give an overview of the variety of tools. It's an amazing amount of work the community has done in terms of developing different aspects of technology that helped in scientific research. There's a variety of external applications to search and access data catalogs, archives are there where you can go and find what the data is. There's a lot of tools in terms of analyzing and integrating. And this is not a complete list. If you look at it, there's more tools in terms of analyzing, integrating the data. So today we are going to-- as I mentioned, we are going to show you CZI, BrainCellData Viewer, Ecker Lab platform, and ABC Atlas. But there's many, many more tools there who can help researchers out there.
And then there's platform and development, like one of the platform which we talked this morning around Brain Knowledge Platform, but there's other platform as well under development, which are helping to bring all the things together and create an end-to-end workflow. There's a data operation. In terms of data operation, there's also many more capabilities which we are developing in the community in terms of technology from a data generation. For back end, for example, the specimen portal coming from sequencing data. There's a pipeline like Terra which Tim is developing, which helps us process this data. There's a modeling capabilities. And people are using a lot of modeling capabilities like linkML, OWL, and there's CCFs, right, where we can look at anatomy. And then the large infrastructure has been built in terms of storage and computing, whether it's a NeMO archive or BIL or DANDI. So that's really helpful. And then there's a lot of initiatives on how to make this data fair, findable, accessible. And there's a lot of work going on in terms of that. So if you look at it, there's a huge amount of effort going on. And I think there's an opportunity to bring a lot of these things together and enhance it as a wider community.
I just want to maybe touch upon a little bit about on the Brain Knowledge Platform, which Dr. Hawrylycz talked this morning. And the heart of that Brain Knowledge Platform is the Knowledge base, which is on here at the bottom, which is funded to BICAN. So the whole notion of a Brain Knowledge Platform is an end-to-end workflow where you can have a common database and you have tools for interacting with that, right, from exploration, analysis, mapping, and also it's seamlessly connected with the pipeline as well. And the really goal is to-- the thing which Dr. Zeng showed the first day to create a brain-wide cell type map, which is across species, modality, space, and time. So the whole other question is that can we just achieve that goal by creating this toolchain or platform, which can help us process data end-to-end from generation to publishing? And I think there's a great opportunity as we work in the community here is to connect this Knowledgebase, at least to some of the other tools that is under development. It's not like a tight integration, but at least there's a common point where you can come in and access all these wonderful tools out there because when I was trying to put together the slides along with my colleagues, I'm trying to find what all the tools are there, but it would be wonderful that you can go to a common place and find all the tools out there thinking about more like an app store connected with a centralized data like in database.
So that's something we are thinking about talking to it. And I think this is an opportunity here as we have this dialogue is that how we can sort of bring these things together so they're accessible and findable and community can use them. So there's some useful links. I just wanted to add in my slides. So if people go back and refer to it, a lot of these tools, you can find this helpfully. Again, this is not a complete list, but I just want to put some links out there. So if people want to follow up and look at these tools after the session today. So I'm going to stop sharing here. And now I'm going to hand it over to Jonah to talk about CELLxGENE from CZI.
JONAH COOL: Great. Thank you very much. Can everyone see my slides? Does that look okay? Yeah, okay, great. Thank you very, very much for the invitation to present. It's really nice to follow the last panel in which I think there were a number of mentions and acknowledgments of kind of different efforts within CELLxGENE and I'm going to try to very quickly summarize those with a particular emphasis on the Census. I will apologize in advance. There's a lot of work here that gets summarized pretty quickly. And so really what I'm hoping to do is also provide some pointers and information and maybe intrigue people to join us in the breakout session to discuss a little bit more or follow up. Okay, so two specific challenges just to kind of state these and share that we're really focused on and have been grateful for the collaboration and partnership with you all within the BRAIN Initiative, previously the ICCN and now BICAN going forward, to work on as well as many other consortia. So the first is just the sheer scale of the assays and data sets that are being generated currently as well as anticipated in the future in needing to create mechanisms by which those data sets can be shared and dynamically and easily explored by a plurality of scientists, whether they want to do so at the command line or simply in the browser. And so already within CELLxGENE, what I'll show you in a few minutes, is that the data corpus of now actually over about 60 million cells worth of data is really too large to use-- or certainly too large to use in any sort of convenient and rapid way.
And in particular, what I'll talk about with Census, is trying to accelerate computationally based applications and discoveries of cell biology. And this includes efficient access, trying to break down language silos between different computational tool chains, really a big emphasis on the harmonization and interoperability of all of those 60-plus million cells. And in particular, trying to drive forward and support the work from many on this call and Bo and Fabian and others mentioned this on the last panel, but around the possibilities in modeling and language models and others. And in particular, really thinking about diversity of different scientists. So our real goal here is to help you all, as well as the wider community, drive forward on reuse and interoperability of multimodal single cell data. So let me first talk about CELLxGENE Discover. So CELLxGENE Discover is a hosted platform. It is free to contribute to. It includes-- and I'm very proud of this, and it's been a great collaboration, includes many of the BRAIN Initiative data sets. And we're looking forward to extending this, as well as data from both CZI funded projects, as well as many other funded consortia and groups around the world. So it is genuinely intended to be a community-based platform that helps promote standardization and sharing and exploration of single-cell data. Within these data sets, there is what we refer to as CELLxGENE Explorer, and it allows you or anyone else to go onto the website here, pull up a data set, and explore that data set via what looks like a pretty familiar, but hopefully highly performant Humap Explorer, as well as look at things like gene expression across multiple data sets that are present. I'm not going to talk more about this, but again, happy to do so in the breakout or in other venues.
So what the current data corpus looks like is approximately this. The slide is maybe a month or two out of date, but it is a corpus of about 60 million cells, largely human with some mouse data. It includes a lot of brain tissue, brain data sets. Again, acknowledgement, and thank you for the collaboration. Many of these are from the BRAIN Initiative project. It is multimodal in nature. So while the majority of it does indeed come from dissociated single-cell transcriptomics data, it does include other data modalities, including single-cellular data and spatial data, and spatial data is really an important focus for us going forward. And then both the rate of data ingestion, as you can see on the right here, as well as the rate of data access and visitors coming to the resource is really growing rapidly. So I don't know about exponentially, but it's really a rapid growth, and we're seeing kind of regular upticks. So I think it's a great opportunity to distribute and disseminate data to the wider scientific community.
Now, in the last panel, there was quite a bit of discussion about metadata and metadata schema. So based on that, here is, in fact, the minimal metadata schema for CELLxGENE. And there's two points I'll make here. Again, just to reemphasize something that Brian and I believe Evan and others mentioned in the last session, many of these fields, wherever possible, are ontologized, and they conform to controlled vocabularies. These are not ontologies that we have developed, but really are looking to adopt and create interoperability among data sets and other consortia. And then the other thing that I will just touch upon is this is the minimum schema, meaning that it is additive. And any additional fields or information for a given study can be included. And we welcome that, encourage it. But these are the fields that are validated and enforced. Again, and there's a QR code if you want to dig into the schema.
Okay, so now let me transition again fairly quickly from the corpus of data to a new feature within CELLxGENE that has been developed over the last year that we refer to as Census. And so this is a transition from those individual level data sets into tools and data objects that make it possible to start to think about cross-cutting questions in which you might want a specific subset of cells from many data sets or perhaps want to look at all the datasets or some subset thereof. And so what the Census is is, again, a data object plus an API by which you can access it. And it is a performant data object. It was built in collaboration with the company that we had worked with very closely called TileDB to provide this dynamic support. And it is a concatenation of all the dissociated single-cell data sets within CELLxGENE. So currently, that number gets into about 35-ish million cells. And from that concatenation, there is a single object that, again, can then be queried and sliced and then quickly streamed and exported in a language-agnostic way. So if you were looking at a specific question and would like to create one of these slices and then subsequently work on it in Seurat or Scanpy or whatever tool-chain you like, the Census will allow you and enable you to do that. If you want to look at all of the data, similarly, it allows doing that and streaming it. And we also have some data loaders for tools like PyTorch and others if you're looking at modeling as a specific use case.
Okay, so let me tell you what the subset of the data looks like. So currently, what is included in the Census in particular. So now this is different than what I just told you about kind of the larger corpus that is available as individual data sets. So it currently includes data from both human and mouse. It is dissociated cell data at the moment, so largely 10X, but also includes other modalities such as Smart-Seq and other transcriptomic measures. I will make a note here that while that is currently where it is, it's a priority for us. And we're thinking a lot about additional modalities and specifically spatial modalities. And so we'd love thoughts, feedback, collaborations, data sets if you're interested in kind of piloting or thinking about this more. It includes raw counts. And then again, to go back to the aforementioned metadata discussion, within the Census object, it includes all of the standardized metadata, but those additive metadata fields may not be included or aren't included. And so there is a little bit of a distinction. But all of that standardized metadata is included in that data object that I mentioned.
Okay, and another couple kind of key-- or key things to mention about how the Census has been done. So one is it is currently hosted on Amazon S3. I know there's been some interest in floating or replicating this across other clouds. It is free to use. So CZI is supporting the storage and bandwidth usage of this resource via that API. And we are also doing regular releases. And those releases come in two flavors. So there are what we refer to as long-term supported releases. These are happening about every six months, and these are stable. And so if you are using one of those, they will be set aside and earmarked for five years. And then we are also in between those long-term releases doing more regular releases that are held for a shorter period of time. So those releases have been about every week. But suffice it to say there's both long-term stable releases and then shorter-term releases for you to work on and also hopefully to, again, create reproducibility as more groups begin to use the resource.
Okay, so let me just give you a very, very, very brief example. So certainly, well short of a demo, but hopefully going from some glossy slides into a specific example of what can be done here. So if you were interested in going to the CELLxGENE data object and/or the API and were interested specifically in microglial biology, you could run a query such as the one shared here in this notebook in which from all of the data sets that are present, you can look for female microglial cells that are specifically within neurons and across all tissues. And in the span of about 30 seconds, get access to the metadata and start to make progress towards these-- sorry, about 300,000 cells worth of data. And so you can imagine other variants that draw on cross-sections based on the metadata in those standardized fields, sex, tissue type, cell type, other category of the assay type, and very quickly get access to those data sets.
Finally, I will end with a couple of pointers. So one is that the team is always very, very eager to connect with you if you're interested in using the API or the data object or are currently using it. Please file issues, reach out to us either via email. We also have an active Slack channel. Again, the QR code is here. And then there's a lot of documentation, including example notebooks and a number of other opportunities to understand both the original CELLxGENE work that I mentioned at the beginning, so Discover and Explore, as well as the fairly rapidly expanding work on Census. And so we really encourage folks either via the breakout or if you're about to go to another breakout to reach out and learn more here about the Census project. With that, I just want to quickly acknowledge, in particular, four people on the CZI team that have done a lot of this work. So Emanuele, Bruce, and Andrew, and Pablo, who has led a lot of the product work. So with that, thank you very much again for the opportunity to present, and looking forward to the breakout and the other presentations.
SHOAIB MUFTI: Thank you. Appreciate it. So next is Hanqing. You're on mute.
HANQING LIU: Thank you, Shoaib. Sorry. Yeah. Hi. This is Hanqing. I previously worked in the Ecker Lab at Salk Institute. And I'm very glad to have the chance to present today. Since I only have five minutes, so I probably want to start very briefly with pointing to our paper that everything I present today, including data in the browser, is more detailed and described in this publication on December last year. So you're more than welcome to check that out. And just briefly, before I introduce our data visualization browser, I want to expand that, in this study, we generate a whole brain single-cell methylation and the 3D genome epigenetic data set, including about half a million cells that covers the entire mouse brain. And in addition to that, we also performed a multimodality data integration. So in one of our figure, we demonstrated the power of this whole brain multimodality data set at this particular gene, which is a marker gene of a deep-layered cortical cell type that's circled here. And you can see that when combining multiple modalities and also brain-wide cell atlas data, we rarely need multiple kinds of visualization, including the scatter plot, a 2D genomic plot for a heat map for the 3D genome topological data, and the one-dimensional genome browser for the DNA methylation at base resolution or accessibility data from the single-cell ATAC.
And that's why we made this browser that we call Whole Mouse Brain Atlas in this particular URL: mousebrain.salk.edu. And in this browser, it's capable of visualizing the scatter plots using a JavaScript package called Plotly at multiple embeddings. And it's also possible to visualize the data using the MERFISH coordinates that's very kindly shared from Allen Institute. And in the same time, we can also click the example here to load epigenetic modalities to create panels that kind of being able to visualize the 3D genome data together with the one-dimensional tracks of methylation and attack in corresponding cell types. And in the control panel here, you can easily add a few cell types to comparison. And this particular visualization is achieved through a genome browser called Higlass.
And we provide a few different visualization panels demonstrated here. For instance, you can load the two cell types and compare their difference of each tracks. And you can also load one cell type and have a zoom-in view, which is important to have a global understanding of gene locus and zoom in to check specifically regulatory elements and their potential chromatin contact loops between the enhancer and the gene promoter. Yeah, and that's my brief introduction to the tool. And if you're interested in more details or more technically how we use the data structure and the tools to achieve this, you're welcome to discuss more in the breakout session. Thank you.
SHOAIB MUFTI: Thanks. Mukund, thank you.
MUKUND RAJ: Hi, everyone. I'm Mukund from Macosko Lab at the Broad Institute. And I'll be talking about braincelldata.org, which is an interactive web portal to explore the mouse atlas data that was published as part of this paper, as part of the BICAN package in Nature last month. The three types of data that are available in this portal: there's the spatial gene expression of one mouse brain hemisphere, there's the transcriptionally defined cell types, and finally, there is cell types with spatial localization. In the portal back end, all this data is organized into three types of matrices. There's the number of beads by number of genes, matrix containing gene expression data using Slide-seq. And then there's the number of cell types by number of genes matrix, which is the transcriptionally defined cell types. And then there's the number of cells by number of beads matrix containing cell types and spatial context. And all these three matrices are available for download via the download link. Additionally, they can also be interactively explored via the three corresponding tabs that you see on this portal: the gene expression tab for the gene expression spatially, and there's the single-cell tab for the cell types, and then there's the third cell spatial tab, which is for the cell types in spatial context.
So there are a few different things that you can do with the portal, and I'll be demoing some of them. The most important thing is that you can search for cell types by filtering across multiple metadata dimensions by region, by defining gene sets, and several other dimensions. You can also search for genes and cells that are enriched in a particular region. And finally, you can plot spatially genes, cells, meta clusters, as well as cell classes. And before moving on to the demo, I just wanted to show that we've been getting a tiny number of visitors so far with a little over 6,000 users in the past year. Well, move to demo.
Okay, so here's braincelldata.org. As you can see, there are three tabs at the top, and these are the three tabs that we use for interactive data exploration. The first tab is gene expression for gene expression data in spatial context. The second tab is single cell to explore the cell data. And then finally, there's a cell spatial tab, which is cellular data in a spatial context. I'll be focusing only on the single cell tab now. And then in the break-out session, I'll talk more about the gene expression and cell spatial tab. So the single cell tab provides expression data and annotation metadata for over 5,000 cell types of clusters in the SN RNA-seq dataset. To start, you have to type in a gene name that you're interested in. Let me type one, Rbfox3. And then you'll see the expression of that gene as a dot plot of average and percentage values show up, along with a lot of other metadata. You can also view multiple genes simultaneously. Two genes are loaded. And you can also sort all these different cell types in several different ways. You can sort by the other gene, or you can sort by percent expression, average expression. And those options are there. One way that you can find interesting cell types is by region. You can filter by region if there's a region that you're particularly interested in. So let me demonstrate that.
So here, I've picked a region, and what you see is only cells in that region are showing up. And if I click on this cell type, which is the top cell type based on the average expression for this gene, Rbfox3, then it'll take you to the cell spatial tab and land on the slice that has the highest abundance of that cell type. And on the top row, these are all the slices ordered in the anterior to posterior direction. And we see here with this rectangle that you are on the 35th slice from the anterior side of the brain. Going back again. Let me show one more region. So I'm picking another region and you have different set of cell types. If you click here, now you have the top cell type in that midbrain raphe region showing up. Notice that we have already selected this. It's pre-selected the region here. If I uncheck it, we can verify that as very local to-- the cell type is very local to the region that we have selected. So in addition to searching by regions, there are several other ways of searching for cell types. You can filter by all these different metadata options, plus brain area, defining gene sets and meta cluster, and as well as few other options that become available. For example, if I want to search by a gene or search a cell that is defined by any particular gene, I can just type in the first few letters of that gene. And I get only those cell types that are defined by these two genes that I've picked. I click here.
SHOAIB MUFTI: Mukund, sorry. Just one minute more to go. Thank you.
MUKUND RAJ: Yeah, sure. I'll be done soon. And then here's the cell type that we have based on the genes that we selected. And it looks like it's very local to this area as well. And the only last thing that I want to show is there's a get link option that if you want to share it with someone, just copy that link and then you should see exactly the same view that you were seeing. Okay. I'll stop now.
SHOAIB MUFTI: Okay. Great. Thank you very much, Mukund. So next is Tyler Mollenkopf from Allen Institute.
TYLER MOLLENKOPF: Hi, everyone. Get my screen up here. Okay, hi, everyone. My name is Tyler Mollenkopf from the Allen Institute. I'm really excited for this panel, really for the whole workshop. It's been very invigorating. I'm going to talk for a few minutes about some tools as part of the Brain Knowledge Platform we're working on at the Allen Institute with a key part of that being the Knowledgebase supporting BICAN. I'll try to cut short a little bit about 10 minutes so that we can get right into the breakouts and get folks to jump into some smaller groups and ask questions about all of these great tools. So there's a lot of use cases we're thinking about with the Brain Knowledge Platform, but four I want to highlight here. First, visualizing and exploring whole brain spatial transcriptomics data. Second, mapping cells from an individual lab against a BICAN reference taxonomy to understand what the cell types are from the user's own data. Third, annotating cell types with things like marker genes, common names, literature. And fourth, not maybe for the scientists in the group, but for developers here and elsewhere, developing software tools that integrate with BICAN standardized inversion data.
So for the first one, folks who want to visualize and explore this, what are the kinds of things we're thinking about in particular? We talked a lot about this over the last couple of days, but seeing cells in an anatomical context really helps people orient. A lot of the RNA sequencing and epigenomics are really abstract when you just think about the data, but anatomy grounds everybody. People want to see how single-cell expression data compares with traditional anatomical boundaries or other modalities. And understanding what cell types are in the brain regions I work with and what proportions-- I loved in Mukund's presentation just now jumping from one cell type to the slice with the highest proportion. I think that's a great example sort of towards this type of use case. A short video here showing how with the live Allen Brain Cell Atlas that we've got-- you can do this here. We're looking at spatial transcriptomics data and can color by genes and SPECT gene expression. We are really excited about the ability to look at multiple views of the same data or multiple datasets side by side. So here we're looking at the RNA sequencing data on the left and the spatial transcriptomics on the right, coloring by different elements and filtering down; in this case, just a glutamatergic or IT ET cells within the glutamatergic class. And being able to look at not just the cell types, as we've been talking about, but maybe other information; in this case, some neurotransmitters on the right and some gene expression on the left. So this idea of parallel views to really understand how dimensions of data relate to each other at particularly the scales BICAN's working with is something we've seen be really useful for these kind of cases and are excited about doing more of.
So a quick summary of some of the things there: anybody worldwide can use these open web tools, as with all the other ones we've heard about so far. But really exciting here is looking at multiple whole brain data sets simultaneously, as with the mouse brain tool that Hanqing showed as well. We've got about eight million cells now with spatial transcriptomics data-- or sorry, that should say with just with transcriptomics data available. Oh, no, I misspoke, sorry. Eight million cells with spatial transcriptomics for the mouse and about 300,000 for human coloring, filtering, comparing with all sorts of standardized cell types and gene expression. Standardized here meaning sort of within, in this case, the BICCN Whole Mouse Brain. Of course, BICAN standardization is ahead of us, and we know we've got work to do to standardize with tools like Jonah showed with the Census as far as what cell types mean, and particularly as we get down to more granular levels. But that's the current state for the mouse and the intent moving forward, and then easily being able to share that with colleagues. There's a quick QR code there if anybody wants to jump in, but we'll talk about this more in the breakout.
Another use case, I've got a large amount of omics data, and doing my own clustering is difficult. It's slow and it's costly to do that. I know BICCN and BICAN have worked hard on useful cell types taxonomies. And I know those are being linked to multimodal features. That would be really useful. How do I connect my data to those cell types? Over the last six months or so, we've released and now, updated and improved in a couple of ways this tool, Map My Cells. Upload a NAND data file, pick the reference taxonomy, pick the algorithm that's available, and in a few minutes, even with hundreds of thousands of cells, get back cell type labels, probabilities at multiple higher levels of sort of the cell type hierarchy for your data. This is not the first of these tools to be available, but we're excited to be able to provide maybe greater scale than some of the other tools that are sort of openly web-based, getting up to hundreds of thousands of cells. We're going to push that as far as we can go. Browser limits being what they are. There are alternative versions or alternative interfaces, not shown here in the video, for people who want to work in computational environments to do these kind of mappings at even greater scales where computing resources can be provisioned sort of separately. The taxonomy is available now. This whole mouse brain, of course, comes from the joint analysis and the BICCN, but we're in a position to grow that list of taxonomies as BICAN does their work in basal ganglia and beyond. Mapping algorithms, we are not intending to be an algorithm-as-a-service platform, but there are lots of great ones out there in the world like Hugging Face. But we do know that BICAN computational biologists and others will develop more algorithms, and we are set up to take in new algorithms as time goes on.
So a couple sort of quick stats about this: you can get about 330 million CELLxGENE pairs to get labels here within just a couple of minutes. Why do we talk about it? That's sort of an unusual way. Well, depending on how many genes you've got, that affects how many cells and vice versa. With some of the spatial transcriptomics experiments, where there's fewer genes than the single cell or single nucleus RNA sequencing, you can do more cells and vice versa. Of course, we want to provide probabilities. Researchers want to assess how confident different predictions are. And as I mentioned, BICAN's human and non-human primate taxonomies will become available as they're developed. I know it's a little bit of a whirlwind. I got two more quick topics.
So as we've talked a lot about over the last couple of days and will, I'm sure, do more tomorrow, it's not just about getting access to all of the information from BICAN, but contributing back to it. This is going to be a really broad community effort, not just a BICAN one. So how does that happen for people who want to contribute their knowledge about cell types? Maybe scientists in my lab or research area call this cell type by different name from what is known in sort of the Knowledgebase so far. Maybe I know about marker genes or other cell features that aren't included yet. I want to describe this cell's importance in my research area. Maybe I know some insightful literature that deserves to be linked and elevated here.
We're partnering with David Osumi-Sullivan, who was on the call and had some great comments in the previous panel, as well as Hussein and his team as part of the cellular genetics group at the Wellcome Sanger Institute. They have a tool called the taxonomy-- oh no, I'm going to forget what the D in there. Data tool, maybe. Apologies, David.
JOHN SATTERLEE: Two more minutes, and then we're going to wrap it up, okay?
TYLER MOLLENKOPF: Yep. Sounds good. This is a great tool for those semantic annotations, literature annotations for doing that collaboratively. It's got links into the cell ontology. And so as that improves for the brain, this will be even more and more useful and importantly, allow space for capturing the rationale behind these annotations, because that type of provenance will be really critical for sort of a useful community-driven knowledge base. Over time, we'll be setting up to compare taxonomies and create linkages. That'll be important within species, but also across species within BICAN. We're in a pilot mode right now with the Human Mammalian Brain Atlas project focused on the basal ganglia work. But assuming that goes well, we'd love to expand to the rest of the group here.
Last thing, I'll do this really quickly. For people who want to develop tools, empower their users to do lots of these use cases, search and load list of data, specimen features for data visualization analysis, get the latest cell types, taxonomies, or, of course, CCF structures for the human and non-human primate, those are going to be huge areas of innovation by BICAN. We want to make those available. We have Brain Knowledge Platform APIs to read and write to the BICAN Knowledgebase. They're in GraphQL for flexibility, but we have Python kits available for more familiar usage. Right now, they're in early stages. So if this is something interesting, we can request access and we'll work with you, but we're working toward those being much more broadly available and documented.
JOHN SATTERLEE: Wonderful. Thank you so much. I want to thank all the speakers for this session. All right, wonderful. I hope you enjoyed your demo session breakout I did I learned a lot. I hit two of them. And now what we're going to do is start our last session for day two. Ut's going to be summary and highlights of day two and so what I'm going to do is pass things off to Tim and Lydia, who are in charge of that, and they're going to close out today. So thank you very much Tim and Lydia.
TIM TICKLE: You should see a summary document. All right, so we're going to start with a summary document, and we'll cover both keynote speakers and also the first session of the day, then we appreciate Tyler and Shoaib doing a summary for the second session as well just shortly after us thank you.
Let's see here, so to get started we had two great keynote speakers, and I will be covering the first one, Bob Grossman's keynote. And this was a really cool keynote presentation. It focused on high level landscape of data commons, centering on things that generalize between assays and clinical data, which you don't get to see that too often, so that was really cool. There were a lot of concepts that were being explored in that presentation, things like starting with platforms that are very flexible, they can be used in a lot of different ways, but they're software systems that are being used to support science and scientific communities often associated with data, for instance, workbenches. But it was interesting that that the commons concept came out of that, where we're thinking about software platforms that specifically think about collocating well-curated data, and often they're cloud-based or they're scaled in such a way that that lots of people can use them. Also thinking about software application tools and services, but we hear about those a lot. It was really cool to move into concepts of data meshes, where you're bringing a lot of these commons and repositories and knowledge bases and resources together, but it could still be a little bit of a silo. And finally to a data fabric concept, where you're thinking about how you get data in or integrate with these systems, which is really important especially if you're not trying to create silos everywhere. I really appreciate that.
So nonetheless, as we went through the presentation, there were some really cool themes that were being discussed about what might be in our future, including things like data commons as a service. I definitely agree, and see that coming a lot, but the cool application of this is that we know if you do have data commons as a service, that can number one, help reuse infrastructure and reuse the investment that we put into these things for other projects, whether we're spinning them up faster or just using them as a service, and as well allows our science-focused PIs to be focused on their science, their methodology, whatever that might be that they're doing, and not necessarily have to think about those infrastructure per se.
Next theme was thinking about the full life cycle of these systems that we put in place, and I think this a super important point that was made, that we should be thinking about the life cycle. What are the end of life and transitions of the project at the very beginning of the project to understand how much effort to put into whatever we're making, but also how do we make sure that the data can either have a long-term plan where it's at, or know where it's going for a long-term plan, or accept and be okay with the idea that there is none, and that is okay because we're going to get something out of that before the end of that term. Nonetheless, really important questions that I don't think we always ask.
Another theme was the lower levels of data commons will be commodified, and so an interesting application result of that will be that once again, what I was mentioning before, this allows us to think of what we want to think about if you are designing tools or interactive experiences. The hope, the dream is to be able to plug that into a platform as opposed to making a platform to host that, right? And so that's really exciting.
Another theme as well was standard APIs for container-based computing services. So the idea for this is that we just really want to be able to leverage APIs for a lot of the new things that are coming around, and new concepts that are coming around including things like portable computation APIs that enable AI, as we've been excited about. But how do we get those AIs or APIs or fundamental services standardized to be able to make that happen, Platform to platform interoperability, or what was called safe APIs? So those were really exciting as far as that goes.
To round this out there were questions about our specific BICAN ecosystem, because we capture data into the data ecosystem at the very beginning and not after the research is done, and then the data is deposited, but at the very beginning, as samples are identified, to be used. And so it was mentioned that that might be a data fabric approach and so something to be aware of is to make sure to keep a good focus on good standards for bringing and ingesting or integrating that data into the ecosystem at that earlier point.
Lydia, our next keynote speaker
LYDIA NG: So I'm summarizing the next key keynote, Organized brain cell types data and knowledge: the challenges and opportunities from Mike Hawrylycz. It's a great follow-up to the infrastructure discussion in the first keynote, how now can we use those type of infrastructure to help us understand brain cell types. In Mike's talk he gave a great overview of the plethora of data that BICAN will generate, many millions of cells from mice, humans, and non-human primates, but that will be also across development and also using various technologies and there’s all of those different technologies so we have to have a platform in the framework that can house all of that. What do we hope to get out of it? The output would be an annotated transcriptomic-based cell-type taxonomy in each species. In the cross-species, that's got to be paired with, because these single cells are dissociated from its position inside tissue. These are going to be paired with many local tissue spatial transcriptomic data so that you can see locally how each of these cell types are interacting and colocalize with each other, and all of these are going to be put into 3D atlases or common coordinate frameworks so it can be integrated with other data sources such as connectivity, functional imaging, that sort of thing. Then the talk goes towards what are the some of the essential ingredients that go towards knowledge bases and in particular, we need formal releases of reference and nomenclatures so that we know which reference we're talking about tools, to visualize these references tools, to map to these references, and this knowledge base to aggregate and also retrieve information that we know. Many of the slides provided overview and updates of BICAN knowledge base efforts. Highlights include the Allen brain cell atlas, python notebooks access, the cell type knowledge explorer, map my cells. Many of these were shown in the in the workshop that we just had, and also the importance of interoperability with other consortia. BICAN is going to generate the normal data and we want for other disease or dysfunction experiments to
be anchored with these normal references, and then this slide shows that the knowledge base is within the set of ecosystem with all different groups and also with the specimen portal and with our data archives as well, so it plays a central hub roll. But it really does, it's an ecosystem that relies on all of these other components and that is a lovely segue to four different themes.
TIM TICKLE: All right so we had four different themes in our session and we had amazing panelists that
were very engaged, so we thank you so much for that. For the first theme, this was focused on data access and sharing and we were lucky to have Daofeng Li give motivational slides. And those motivational slides identified some of the kinds of things that we should be thinking about as we talk through that theme, about metadata versioning access controls for instance, and then also a call out to the spirit of the session, which was getting feedback and learning from existing solutions and portals. For instance, here, where you can see ENCODE and 4DN, so that was a really great way to get us started and set the mood for what we were excited to do in that session. We started off with thinking about metadata. We talked a little bit about how we were approaching metadata, with defining minimal metadata, focused a bit right now on upstream distinguishing biological technical variation and analyses, that being a result of close collaboration between experimentalists, computational biologists, and our infrastructure teams. But there was also a point raised that is a good start. We want to also make sure to be able to report additional metadata or think about other phases of the data life cycle or analyses that might need to be supported.
There was a call out to thinking about metadata and provenance to be able to continue to drive design and standard use. So the idea here is to have a goal when we're thinking about designing or using standards, like what's the goal with the metadata, what's the goal with the provenance, what's the goal with the standard itself. The standard itself is not the goal for the data infrastructure. We want to leverage things to make the science happen, so that was an important clarifier. There's also a call out that data gets bigger, and as it gets bigger, it's going to be important. We all heard about AI and LLMs wanting bigger data, we know we’re there, or going there. Usability becomes extremely important. Local data will be in large cloud resources and people want to be able to analyze those together or between other different repositories or whatever it might be, and so we want to be able to motivate ourselves to be able to create resources that are accurate and curated. So to make it useful for people to use those things and also to, as much as possible, make it standard to use them like using international standards like GA4GH was called out, and also having codebooks describing our metadata terms is critical for usability. There's an opportunity to learn from other consortia data and data repositories what are the most common axes of variants and those should be collected. So that's a really great point that we've done this already.
So what can we learn from that and what are we seeing? And a call out to AIs and LLMs, there were a lot of people excited about that space. There were call outs to projects leveraging semi-automated pipelines to aid in annotation to increase the efficiency of humans in the loop or human animated curators or annotators. There's interest in LLMs to enable annotation mapping or engage with annotations themselves, ask questions about what the annotations mean like what is this term, or whatever it might be, or I'm sure additional things I'm not even thinking about. So just a cool place to be in. So with that in mind, there was a question that was raised of what are the requirements around data
that will make it AI and LLM ready, and we should be thinking about this. Lydia, theme two.
LYDIA NG: Okay so we are now on theme two, from features to knowledge. We started off with Satra Ghosh giving us our motivation slides. The main take-home for the slides was that we got some working definition of data information and knowledge, but the main take-home is that there are many different ways to get from data to knowledge, and it's also a cycle, and we expect things to have continuous update as we know more and new technology comes by. And the other main thing is then given that this is a cycle and people want to reproduce it, how do we encapsulate all the tools, data, and knowledge that has been published so that people can access and have this sort of trajectory. So it's a lot more than just knowledge. At the end, we want people to be able to follow the journey and reduce components of it as well. After the motivation slides, we got to talking and started conversation about what we want to get from the knowledge base. Differential gene expression, cell types and disease states were the key things that people were looking for of course from the transcriptomic cell types but, it was also pointed out that different methods will give you different results. And then we want to actually have all different methods, but we want to understand the provenance of each of these methods and preserve the different results and the ability to reuse those tools. While we're talking about transcriptomics that we know other dimensions of cell type comes into play such as connectivity and morphology, so we also need to consider the ability to link across dimensions. In any case, technology is going to evolve over time and we'll get more data information as we go along, so it's important that that existing data can always be updated and refreshed such that it can be in context of any new information that comes in, and information you need to be able to understand communicate well what the information means. Then the use of ontology is needed to aid this sort of understanding and reuse. ATOM atlas ontology and Cell Ontology were brought up as examples.
Then many parts of the conversation came into best practice and efficiencies. The number one thing that we've heard over and over is we need identifiers for everything from donors to cell types to methods because often our donors get reused for multiple purposes, cell types get revised, cell data gets reused, so it's important to give identifiers to everything so we can mix and match. And once we got all these matching and reuses, it's important to keep track of versioning and reanalysis. Also in this theme, the LLM came into play how can it be leveraged to make databases and all the metadata we need more efficient by using LLMs as tools to get information in the database. And again, this is a reoccurring theme we need to think about, continuous maintenance and update, because what we know today might be different what we know six months from now, a year from now. So again, we need more automated ways such as LLM and other tools to help us integrate data more quickly and efficiently.
TIM TICKLE: So for theme three, this was focused on the within BICAN group and project synergies while the next one focused on the external ones. So we started this off with motivational slides from GQ Zhang. We appreciate GQ bringing those, like this slide here, for instance, giving us a sense of what all the different parts and pieces of the catalog that we're working with and also the complexity that arises with it. So as far as some of the ideas that came up in this session, standards continued to be important and in fact, comments were made about leveraging FAIR standards from the very start, which makes a lot of sense. So there were comments about making sure data formats and standards enable others to use data and thinking about them for enabling usability, because that in itself is going to allow the general scientific community to engage the data even if we don't have formal partnerships. If they can just use the data and they can start doing those integration steps, and they can start doing that work, and although standards and standardization are important, it was pointed out to go overboard, and be careful because standardization is important, but they should be implemented in a way that doesn't hamper researchers. Researchers still should have a level of flexibility to do the work they need to do to ask questions and learn about things. Having a goal to drive knowledge generation is critical. Even if we have unlimited resources, there's still a need to focus our efforts to specific use cases and questions, and so this came up again that it's really important for us as a scientific community to establish what are our big goals, and what are our big drivers and big questions that we are wanting to complete. I mean we are very lucky to have a knowledge base in our data ecosystem, and so that helps focus us the data ecosystem a little bit, but fundamentally what are their use cases as well. So our community needs to inform ourselves on what are the big goals that that we have. And we've done that a little bit for sure, but keep those in sight.
LYDIA NG: And our last theme. Tim just talked about how do we get BICAN synergy. Our last theme was how to get synergy between BICAN and the community and external consortia. Jeremy Miller started us off with why we need to talk outside of BICAN through an example of looking at microglia in Alzheimer's disease. He showed U maps from three different papers, and they all suggest that the role that microglia play. But it is difficult to compare between these papers’ analyses, calling out for the need of standard names and community standards. But how do we do that? There are many different related consortia. Do we need some sort of Google translate when we're talking about brain cell types? We started talking about what are some of the consortia that we want to target for integration. The HCA, human cell atlas, is a great example where BICAN is now providing basically the brain data for the HCA project. And then the discussion goes with, well how do we make this sort of interoperability efficient? The important things is it’s a reoccurring theme in all of our themes, that the need to adopt community standards using the same format, schemas, and naming conventions. the cellxgene minimal schema was discussed and shared. We talked about that it was actually very important to have data-defined cell types for the community because these cell type definitions are highly tied to the data. So if you got this definition that's based on data and you're sharing the data, then the people using it now have everything they need to understand the data and metadata to build confidence in its usage, because they can reproduce it. And then for adoption we really need tools and a portal to make that our taxonomy, which is our key product that is useful for the people. It needs to be interactive and intuitive, and to make data easily accessible across consortia and the community. There was one last point that was put at the very end that we talked about how ontologies and standardization is very important for accurate information. However it was brought up that there is a slight misalignment of how ontologists talk about things versus how Neuroscientists talk about things, so we could have more synergy between them such that the language is well understood between the two groups.
TIM TICKLE: Thank you, Lydia. We've done our summaries for these for the first two pieces here and I think for the last session, we're asking Shoaib and Tyler to give us a little bit of a summary. Now, they just literally got out of their session, so maybe this is just a conversation about how it went, but you have about seven minutes just to talk a little bit about the session itself. Thank you.
SHOAIB MUFTI: Thank you, Lydia and Tom. And as Tom mentioned, we just came out of our session so we apologize, we didn't get enough time to capture it formally, but still Tyler started a document so I think we still have some notes, so Tyler will go through it. Maybe I will just say a couple of quick words. First of all, the presentations and demos were great, and breakouts were awesome. We had a lot of participation there. I counted that we had over 100 people in four breakout sessions, which is pretty amazing, and I thought at the start normally you get a lot of people in the breakout, but those 100 people stayed throughout the breakout, so that's very encouraging so see people engage, and very encouraging for us who work on the tool. So appreciate all the participation there. So Tyler, pull up the document you've been working on and we can hear some of the thoughts we captured in real time.
TYLER MOLLENKOPF: Maybe I'll just talk through it. It's more notes visually, but I can sort of summarize verbally. So thanks again to the presenters, to Jonah, Hanqing, and Mukund, and everybody who joined the breakouts. So first we heard about the cellxgene census. Cellxgene, I think many of you know, is a community resource taking in single cell data types and carrying them into standardized form. To be really useful, they've got minimal metadata standards and I think we are and will continue in BICAN to try to coordinate there. All the data is freely available and in multiple formats, and the new product Jonah showed at the census is really a data object with 50 million cells and growing, and an API to make it easy and quick to use those data. A couple notes that I heard came out of their breakout, versioning of course is critical, and I think census is really one example kind of leading the way there in an adjacent space for us, so good for BICAN to look at that further.
Multimodal data is here and growing and we're going to need standard data formats for the multimodal elements, including encouraging reanalysis. And then the model dissemination, a new feature from census, is starting to share out models that were built using that data and supporting the collaboration space around that. I'm sure going to be really key for BICAN too, something good to learn from. We heard about the brain cell data portal and some really cool ways of linking between spatial and single cell data and doing searches for cell types based on genes. It sounded like people really like the filtering options and the speed. It’s a really fast tool. I've enjoyed looking at the slides myself and there is definitely some interest in knowing upcoming features. I think that was something it sounded like Mukund heard, but probably many of us are thinking about and I'm sure there are ways for the software development teams in BICAN to make our road maps really clear.
Some folks like GQ’s group have made that easy for lots of people in BICAN and the public to know about, but knowing what features are coming up for any of these tools is probably a good area for all of us to learn from. And of course some feedback I tend to like on the brain cell data portal on improving usability. We've all got room to improve there. We saw the mouse brain portal which hosted multiomic data sets including some DNA methylation and ATAC data. Personally, I'm really fascinated by the way Hanqing has used someof the technologies like high glass to make those kind of massive epigenomics data possible. The mouse brain tool helps visualize these multimodality data through custom scatter plots and multi-panel view, and all the data is freely available for download. It sounds like a lot of the interest in the breakout was around downloading and accessing the data, which makes sense. It's really kind of novel and compelling. And then on the mouse brain platform side, I talked about a number of tools we've got built on common APIs for visualizing whole brain data, particularly the spatial is really exciting for us, and then mapping the community's own data against them or contributing knowledge to the knowledge base. Two things that stood out in our breakout discussion, one registering spatial transcriptomics data is hard but really critical, and that's going to be an even more challenging space between the human and non-human primate where the CCF are at different stages of maturity than in the mouse. And the second, cross-species comparison would be really useful. There were some really good questions about that, especially as the human brain data scales towards the whole brain like with the mouse. Using the mouse whole brain as a reference alongside for annotation, other things will be really valuable we get some good ideas from the group there.
TIM TICKLE: Thanks Shoaib and Tyler for doing that once again. It's a difficult task to do just immediately out of your session but that was a great summary.
YONG YAO: Thank you. I just want to remind you of tomorrow's workshop, the final day. We will start with panel one discussion on the brain cell atlas, one use case of neuroscience research, and the second panel will particularly address the interest in brain disorder research, and the final plenary session, John Ngai and Ed Lein are going to moderate this final session. So the goal is to really try to develop and enhance the current road map and the joint milestones developed by the BICAN. hopefully you have time to join this final day. So with that I'd like to close the meeting today. Thank you all.