2022 High Throughput Imaging Characterization of Brain Cell Types & Connectivity-Day 2, Part 5
YONG YAO: There was the whole discussion about what type of labels will be applied to human brain. But in the interest time, I think we need to move on to the next session, 3.2, about data model label visualization and summary. Our moderators are Laura Brattain and Viren Jain. Please take it away.
LAURA BRATTAIN: Great. Thanks, John. Hello everyone. So I will be leading the discussion agenda while Viren will be monitoring the chat. So we don't have a lot of time, but we have a lot to discuss, so we're going to move along nicely, hopefully. And we put down some names under each topic, but obviously, we welcome everybody to chime in at any time. So before we start, let's talk about the details. We know that the previous session talked about imaging technologies and the immediate data processing right after that. So this session we're going to focus on developing, processing, and analytic tools that can support the building of the whole brain atlases. I thought it will be great to just spend a few minutes as a team to talk about what are the main purpose and objectives that we want to get out of this discussion. And at the end of the session, we hope to have some action items identified. So we have these bookend tasks. One is at top, define the big objectives, and at the end, we have some concrete action items. So for that - I don't know - Bruce and Elizabeth, you have any thoughts to share?
BRUCE FISCHL: Sure. I can start. Actually, I was going to answer Kwanghun’s question from the last session. There's this whole range of mesoscopic features in the human brain that we have very little understanding of. There are columnar structures in the cortex that occur at multiple scales, microcolumns, mini-columns, columns, and they're too small for us to see with MRI. Maybe we can catch glimpses of columns, but they're too big to see with microscopy. And they're three-dimensional objects. They don't sit. Mike Miller was talking about kind of not using the extrinsic lattice. And they don't sit on any extrinsic plane that we put on the brain. So you'll have a column that goes through plane. And the through-plane geometry is critical to understanding the columnar architecture. And because of that, we have very little understanding of it. And so I think there's this whole range of mesoscopic structures in the brain that we have the possibility with these types of technologies we've been talking about to make progress on understanding kind of what normal architecture is and how things like epilepsy or autism maybe affect this type of columnar structure. So I think that's one big target that I think is of interest to a broad community.
LAURA BRATTAIN: Great. Anyone else?
ELIZABETH HILLMAN: So maybe I will interpret the question differently. If we're talking about the main purpose of this session, I think I see— the previous session really was dealing mostly with issues of you get these numbers that come off the microscope and you have to work with them, you have to de-noise, you have to correct for artifacts and so on, right, and then maybe push it forward into segmentation or sort of extraction of information. But down the line, the eventual thing that we want to do is actually analyze the quantitative information about the brain, right. And so, again, this relates to sort of the purpose of why are we doing this and what kinds of these do we expect to learn from this data. So I think, for me, I think it's important to realize that browsing this data, like logging on to a web portal and scrolling around on this imaging data or spinning around a rendering of the brain is not going to be— it's not going to be fun for very long, right, and it's not going to permit the kind of brain-wide analysis or interior individual analysis that we might hope to be able to do. So what would be great is to move towards a consensus of what kind of information should be extracted by data analysis pipelines that can then be shared, and how can we think about sharing that information with the right people based on sort of what they might want to do with that quantitative information.
And I think we have heard a lot about using sort of AI models, and I think that's coming up in this panel as well. But, again, I think the BICCN group has a lot of experience on mouse brain, but human brain is 2,000 times bigger, and we don't have the luxury of those genetically-encoded labels that we had before. So it's a little different. Yeah. So that was vague because I wasn't quite sure what we were going to discuss. But, yeah, I think drilling down into not just analyzing the raw images is the first step to making quantitative data accessible to wide ranges of people who can use it to actually start to answer questions about the brain and figuring out what that step should be, and how much of that we leave in the hands of the people that are going to want to come and access this data or not.
LAURA BRATTAIN: Yeah. Go ahead, Bruce.
BRUCE FISCHL: No, I was just going to say that analysis can't be planar. The history of microscopy is planar analysis, and the human brain doesn't respect these arbitrary 2D planes that we put on it, so it's got to be a 3D analysis with the right 3D.
ELIZABETH HILLMAN: But again, initial image processing and feature extraction is not the ultimate analysis we want to do with this data. That is facilitating the endpoint analysis of this data.
LAURA BRATTAIN: Yeah. That's a great point. Okay. So maybe we can dive into the second topic. I think that's a great segue to what are the current gaps in data modeling, labeling, and visualization. And these people have some slides to share. I'll go ahead to project. And, Daniel, you can go first if you are ready. Let me do that.
DANIEL BERGER: Sure. Yes. I just have two quick slides. And I think Elizabeth just nicely summarized this one. So this just shows you kind of a general framework of what we are trying to achieve as I see it. So we start with mostly volumetric images from optical microscopy, which is a set of voxel image data sets. And then we talked about how to align and stitch and co-register these maybe to some kind of standard samples that we can compare across different people or different modalities of imaging. But then ultimately we want to get this to researchers. And there's an intermediate step that we can add here which is a pre-analysis, as Elizabeth just mentioned, where we extract some of the data and preprocess it. And some of these preprocessing steps can be very costly computationally. So I think we should do them and then provide the results also to people who want to browse or analyze the data. This can, for example, be segmentation of cell bodies. This can be kind of metadata collection of genetic labels and things like that. And then ultimately we want to give this to the researchers.
But I think we both want to have an easy access, and we also want to have, let's say, a very complicated way of accessing the data— or a very powerful way to access the data. So I think what we need is a visualization tool where people can easily browse the data and ask simple questions but then also some kind of API, maybe based on Python, where people can really go into depth and do analysis, maybe with computations that are done on a server rather than on the client. And we can also think about ways that the researchers can collaborate to analyze the data together. So that should also be some kind of client/server system to form a group to do certain analysis that several people are interested in. I'm thinking of the example of the FlyWire project, where whole groups of research labs work together to reconstruct certain parts of the fly brain.
And just to talk a little bit about current prospects or possibilities in visualization, could you go to the next slide please? So the field that is driving visualization a lot I think is computer games because they want to render really complex scenes really fast, and there are a lot of people who want to do that so there's a lot of money in it. And there's a recent development, maybe you've heard of one of the game engines called Unreal Engine 5, where they introduced a new system that allows for real-time visualization of billions of polygons with realistic lighting. And this is public, and we could use it for research, and I don't think it's being used for scientific visualizations at the moment, but it's a, I think, interesting system to look into if you especially want to visualize millions, billions of objects in a small three-dimensional volumetric data set.
LAURA BRATTAIN: Great. Yeah, thanks for sharing your thoughts. There are two takeaways for me. One is we want to design the tools— the user interface tools should be designed based on the end user population. For clinicians, for the neuroscientists, they can be different— or they may want different tools. And then another takeaway I get is that we want to leverage technologies from other sectors, from gaming. And that is a great example, like a GPU was developed initially for gaming industry, and now we use it for a lot of biomedical data processing. So that's a really good point. With that, I'm going to advance to the next slide. So the speaker for this slide, please go ahead.
DAVID VAN ESSEN: Okay. My lab, David van Essen, now co-led by Matt Glaser, has been working for several decades on the brain visualization front. And the Connectome Workbench platform that's used for this demonstration has been developed for the past dozen years. It was initially focused mainly on volume and surface visualization of primates' cerebral cortex. But in recent years, we've also extended this to histological image visualization that you can see in the upper two panels. So that's a NeuN low power visualization in the upper middle. And that can be quickly zoomed to give cellular level resolution in the upper right. And we're also envisioning that this will be highly useful for spatial transcriptomics in the BICAN project. So the basic point is that there are many different types of data, different modalities that can be easily visualized and navigated, jumping from one region to another on this workbench platform. And if you go to the next slide, please, I want to make a couple of other points.
Next slide, please. If you look at the workbench screen, seeing the application, it has many capabilities in different domains, a toolbar domain for choosing which region and modality you want to choose, an overlay toolbox on the bottom to choose which layers you want to put on top of one another. And the last major point I want to focus on is what we call scene files, and the scene window is shown in the lower right. But basically, a scene file is a way to capture everything that is seen on the visualization screen, multiple tabs, text annotations, circles and arrows, all of that can be captured and saved in one scene in a scene file. And when one saves that and then revisit it the next day or sends the data to a collaborator, one can regenerate exactly what had been seen previously, and it makes it very facile for dealing with these extremely complex data sets and associated metadata.
And so this concept of scenes and scene files is not only very useful in our particular visualization platform, but we think will be useful in a broader context. And, for example, one of our BICAN partners in the Netherlands has independently invented, developed a scene equivalent concept for his own genomics type software. So we would present this as a useful general concept for dealing with multiple complex data sets in our modern BICAN world. So that's all I wanted to say at that point.
LAURA BRATTAIN: Great. Thanks, David. So these are really nice features to have. What are your thoughts on scaling up? What needs to be done in order to be able to process whole brain data— well, not process, be able to apply these features to the whole brain data set?
DAVID VAN ESSEN: Well, this data includes a whole monkey brain, and we also use it routinely for whole human brain. I think the challenge in dealing with both microscopy and MRI-based data sets— it illustrates the multiresolution challenges. And while this example illustrates literally micron to millimeter-, centimeter-scale visualization, another hugely important part of it, that perhaps you're alluding to, is when one has these vast amounts of data, you can't shift the data itself from the central source to the user in the whole ball of wax. So we really need to be thinking of remote visualization, having applications such as Connectome Workbench that could be sitting remotely where the data are stored, and then the user can have an interface to navigate and tap into it and work efficiently with drilling down into large domains of multimodal data and then bringing the visualization to the users in an efficient, fast mode of operation.
LAURA BRATTAIN: Great. Yeah, something like the Google Earth we saw earlier. All right. With that, we have one more slide for this topic. So the speaker for this slide, please go ahead.
DANIEL TWARD: Hi. I wanted to speak a little bit about multiscale visualization, which is a topic which has come up a lot of times. The standard approach here is to build image pyramids, just as was described in the Google Earth talk, where as we zoom out, we just average more and more pixels together and essentially throw away all the information from the high-resolution data when we're looking at the low-resolution data. To me, it seems that an important gap is to really connect these scales together by extracting relevant information at the fine scale and visualizing it at the coarse scale. So in the top row, I show an example here, where the left-hand side shows a zoomed-out histology image, but you really can't see anything interesting other than that it's pink. And the top right shows the same image where we've extracted texture features based on the scattering transform. Now you can really see layers of cells and differences in different areas. So this is a unsupervised approach. But as we get into modeling and labeling, we can make choices about what information we want to save and extract from high resolution and move down to low resolution. The bottom row shows a similar example, where on the left-hand side is, again, a pink histology image, but as we move from left to right, we've detected a specific structure of interest, in this case, a tau tangle in a patient with Alzheimer's disease. And as we zoom out, we can visualize the density and other statistics of these ensembles of tau tangles. So I think this is an open question about what is the best way to zoom out, and how can we choose the information we preserve using modeling or labeling techniques that are interesting for a given problem.
LAURA BRATTAIN: Yeah. Thanks. Any comments? Any questions for the speakers?
RICHARD LEVENSON: Just another comment— or one of the things that we've been doing is taking H&E stained slides and imaging them in bright field and in fluorescents, and you get 50% more information, which would be nicely fed into what you're doing, Daniel.
MICHAEL MILLER: Daniel, I wanted to make the comment, so yesterday in our session, one of the things that people said is, with AI and machine learning now, the ability to take a single measurement that has a lot of information and be able to predict many other things is really becoming viable. I really have thought that what Melotte did with scattering transforms. I mean, you just showed it, where essentially you go from a single picture but it has many resolutions in it— so from the scattering transform, you can generate a whole series of images that essentially represent many other things. I think that's really relevant. And what's really good, I think, about the scattering transform - I know it's why you used it - is that we really don't have to train. You put the brain into it, and out comes a representation of all the scale information. And he's showing that it's really related to an optimal neural net. But you don't need training makes it relevant. So it's nice to see that you put that out there. Thank you.
HANCHUAN PENG: I have a little piece of comment about this. I'm still thinking in this potential co-human brain application case, which we try to also contribute a little bit in this space by injecting a lot of kind of a dye into individual cell, and we try to do this at many different brain regions and from some sort of brain tissue— from surgical tissue, and then put together through the registration framework. And the problem actually, you guys actually just show that this multiscale, multimodality, multi-subject registration is definitely one of the key issue out there. But I, of course— okay. Because I also work on the machine learning technologies, but I want to be a little bit cautious about believing that machine learning definitely will be able to get what we want. Because I think at this particular moment that the field still kind of usually suffers from the lack of the human data. And a lot of times we do not want to actually extrapolate the limited data too much, okay, by try to yield for too much about what haven't been observed.
Especially because a lot of times we don't even have this very good way to define the branches yet, let alone generating data or other qualities, right, so gold standard data, something like that. So I will be very, very careful about the transformation between different spaces, a feature space, so different scales, and probably we will really focus on particular subarea at this particular moment, especially related to the sudden brain delays or something like that, okay, so that we'll have a good kind of a boundary. Okay. Basically, you define the boundary of the problem so that you will have a good handle to study the problem. Yeah, so that's my two cents.
LAURA BRATTAIN: Yeah. That's a very good point. And I think your comments are actually a great segue to the next topic, is how to generate ground truth, what are the key sources for ground truth, and how can we generate scalable truth data for AI development. So for that, I'm going to share some slides, and, Matt, are you ready to speak to that?
MATT MCCORMICK: Yes, yes.
LAURA BRATTAIN: Yes. Okay. Great.
MATT MCCORMICK: And this is a topic that I was discussing recently with David Feng and Lydia and Jayaram from the Allen Institute, what are the considerations we have, how do we get ground truth that we can use to verify, validate that the registration that we're doing is accurate and precise and also for using that within quality control as we try to generate these whole brain volumes, looking at is the data that we're getting, is it good, did the acquisition occur successfully, and is the content that we expect to see in a volume what we're actually getting? So next slide, please. And the main considerations that come up, kind of classes of the considerations, is what type of data do we use for ground truthing or mapping or alignment of the data? And in general, when doing registration, there are maybe two big classes of ground truth data. First, is the typical most widely used type of data that's collected are landmarks or fiducials across the two data sets you're trying to define alignment. And you say, "These two points within space, we expect them to align, and how off are they from each other?" For brain imaging, that is more challenging. Because the brain is this curved data structure, it's hard to locate these points and there's not a lot of— there's a lot of surfaces, but there's not kind of well-defined 3D points. And in general, the other type of correspondence and validation is using a segmentation that's not used as part of the registration process.
To validate the registration, you have these labeled regions, and the most typical metric there is set-based metrics, like dice metrics if you have binary labels, but you can also have other metrics. You can have multiple segmentations from experts, binary segmentations, and then there's other metrics. If you have a segmentation that maybe comes from an image and it's a deep learning segmentation, it gives you a probability density image. You can define a metric that way. But then there's also metrics that can be used for looking at the differences between surfaces, identifying points on the surfaces. Or there's information theoretic metrics that can be used, which is relevant, like Michael Miller was discussing, when we look at the data as a set of points— or relevant points that are extracted from the data, different regions defined by the cells that comprise those regions, and seeing how well aligned they are. There isn't an intermediate. And Lydia suggested that you're comparing how well a region corresponds to a set of points, so maybe an expert goes through, and for efficiency they identify points that are near the surface of that region, and you can then measure the distance between that point cloud and the expected region. So these are the types of data we consider looking at what should be using there for ground truthing. That's one consideration.
And next slide please, Laura. And other considerations that are interesting to look at are how are we taking these ground truth data and collecting it and storing it in a way that we have all relevant metadata that's stored in consistent way and in standardized formats so that we can analyze it objectively, reproducibly, and qualitatively but also use it downstream for further analysis, like Elizabeth was mentioning, the end goal of further analysis where we're getting insights from these data. Another consideration that's come up many times throughout the workshop is, where do we collect the ground truth? So we have this very large, very large space of data, and we can't collect ground truth everywhere. Or if we want to be efficient, we should be selective. And so we need to look at areas considering what is biologically relevant, meaningful for the questions we're trying to ask, what are the regions, is one consideration. And other considerations is, what can be clearly identified between samples, so multimodalities? Certain structures might not be present in both modalities to the same degree or as was mentioned, I think, in the last section, the point spread function may vary through the volume and we might want to vary where we collect ground truth.
And finally as has been mentioned in the first session, what can be sustainably and scalably acquired? So very large data set. If we're doing manual inputs, what can be done in a scalable, sustainable way, and also, how do we have tools that work with these data sets that are large? And we also need to have inputs for ground truthing at multiple scales, as been discussed before. And also, how do we do these computations efficiently, effectively, which is always a concern with the available computational resources we have to be pragmatic. So those are some thoughts on that topic.
LAURA BRATTAIN: Yeah. Thank you so much for sharing these thoughts. I think putting them all just together in a few very succinct slides, that is very helpful. Any quick comments before we switch to unsupervised learning?
JOSHUA WELCH: Yeah, I had a quick comment. I think these are really helpful thoughts and I wanted to point out that multimodal data provides a really important source of ground truth that can be in the form of labels that you derive from one modality and try to reconstruct from the other or even just the relationship between the modalities. And so even if you don't have perfect labels in either modality, the relationship between the data types can be a really important grounding for these types of explorations, I think.
ELIZABETH HILLMAN: There's also opportunity for data-driven discovery, right, where, you can— what it thinks the groupings or regions or stratifications are, and then use that to do hypothesis-driven work, the controls to figure out what they are.
LAURA BRATTAIN: Yeah. That's really, really on point actually. I think, Reza, are you ready to share your slides?
REZA ABBASI-ASL: Elizabeth, it was just the best segue.
LAURA BRATTAIN: Best segue. I know. I know. We can—
REZA ABBASI-ASL: It's just on my first slide. Sorry. How we can use the sort of potential in ML for scientific discovery. Yeah. I'm Reza. Thanks a lot for the opportunity. I'm a computational scientist at UCSF. My lab is fully computational. And I was at the Allen Institute before, so many of the kind of topics that I'm going to cover in the next couple of minutes, hopefully few minutes, are kind of mostly in collaboration projects or collaboration with Allen Institute. So I'm going to talk a bit about exactly that, so kind of just thinking about the potential of this ground truth data set, let's say, huge amount of data, multimodal data, how we can kind of try to use machine learning or ML to just learn a bit more about data, and then some of the unsupervised techniques because, as you can imagine, labeling will be a huge issue as we collect more and more data. But before that, I want to just quickly probably talk a bit in complement to what Matt was talking. So just taking a step back and thinking about general ML, how we can use it for scientific discovery. So we have been kind of thinking about this problem in this PNS paper I published a few years back. We kind of outlined how we think this sort of— what's the philosophy behind a data science cycle or ML pipeline for scientific discovery, in particular the neuroscience. Many of you are often using this pipeline. Starting from a scientific question, if you want to answer it computationally, finding the right algorithm or computational representation.
And then most importantly, interpretation of what you learn from that algorithmic representation. And then using that in a cycle to guide, again, your question, your data. And then kind of reiterating this process. What I really want to highlight is actually the considerations. And I guess Matt mentioned a couple of these, but many of these sometimes get neglected in this process, and I think some of them are actually really important. Stability of findings and reproducibility is always a huge consideration when it comes to using ML for scientific discovery in any actually domain. Designs usually could leverage human-in-the-loop design, where a domain expert, if it's a clinically relevant question, patients even, but mostly, neuroscientist and computational scientist to just make sure that the data is collected in a way that could be leveraged eventually with computational pipelines and also computational pipelines are relevant. So of course, models should be accurate. And then with this modern ML, deep learning era ML, we really have a lot of this— a really good accurate model. Should be relevant. Should be computable. We mentioned this. But I think it's really important to think about the interpretability of the models when it comes to scientific discovery. Many kind of tools, techniques are there to enable that or being able to build modular models, visualization as another technique to interpret these models. So a model that could be visualized has an advantage. Exploratory data analysis mostly just before building any fancy model, learning about the data and these considerations. So I think these are important.
And then specifically, if you want to talk a bit about the opportunities that this sort of pipeline would essentially enable us to have with a spatial whole grain data set. Just maybe circling back to Elizabeth's point, I'm going to just list a few questions and opportunities or maybe domains that could be pursued. And then this is essentially open to discussion. I'm going to end by highlighting one of our efforts to do one of them. So as it was discussed before, unsupervised pattern recognition, pattern discovery, segmentation, regionalization, this could be enable. ML has a huge potential in this domain. We mentioned multimodal integration as we collect more data. When it comes to this sort of large data, identifying possible interactions, networks, candidate gene, gene regulatory pathways, these sort of questions will be— of course, I'm emphasizing on candidate because these needs experimental validations. But at least this sort of exploration of large amount of data using ML would be essentially a system that could identify potential candidates. Cell type characterization, of course with multimodal integration, we'll be looking a lot of— there will be a lot of opportunities on that front. We’ve discussed a lot of preprocessing quality assurance and other, essentially, questions and opportunities that could be enabled here. So I want to quickly to highlight one of our efforts on essentially getting larger scale spatial gene expression data in whole brain. This is mouse data.
And then using an interpretable matrix factorization pipeline— and this could be any sort of unsupervised dictionary learning, and we have tried many other things too. So it could be a general framework. But then leveraging this sort of pipeline to learn about principal patterns or, in a way, parcellations in the gene expression data. And, again, I'm going to talk about multimodal integration, how we can go beyond gene expression and integrated morphology data as well. So this in collaboration with Allen Institute, Bosiljka Tasic, Hongkui Zeng, and also Bin Yu at UC Berkeley. So I just want to highlight what could be achieved, what are the potentials. These are some of the patterns that for these unsupervised, we can extract from mouse brain just using gene expression data. These are essentially fully unsupervised and we can compare this with CCF. In this case, we have three patterns that nicely tile in a way different parts of isocortex. So it kind of gives us the opportunity to just from gene expression identify subregions. We can compare this with annotations so that labeling could come into play here. I'm kind of comparing some of these PPs or principal patterns in red with CCF regions in green. We see, interestingly, that these patterns have combined specific subregions in CCF. We are looking into biological interpretability of that, how it kind of goes beyond the established ontology in mouse brain. So that's just one example, but I'm pretty sure many of us are kind of dealing— this sort of pipeline could be general analyzable for any whole brain or even smaller data sets.
I guess my main intention here was to just show how we can leverage it. I know we discussed a bit multimodal integration, so I want to also just emphasize what are the opportunities on that front. There are many, many data sets now being collected, whole brain or regions in the brain, gene expression data sets, neural dynamics, so calcium imaging, EFS data, or neuromorphology from EM, or essentially other modalities. And we have been also working with Allen Institute on this particular data set. This is calcium imaging from one millimeter cube in mouse brain, of course, not whole brain yet. That's the ideal kind of futuristic goal. But whatever we can get from this kind of integrated with gene expression from the same tissue, Allen Institute also finished collecting EM data from the exact same tissue. So we are essentially now working a mapping of these data sets, mostly being done at the Allen Institute, and we'll be working a lot of modeling and integration of this. So that's essentially what I wanted to talk about a bit, but I want to maybe open this to discussion, and of course, I know others will be continuing to discuss this.
LAURA BRATTAIN: Excellent. Thanks for sharing that. I can envision that we'll be leveraging a lot in the unsupervised learning approach in the integration of multimodalities going forward. In the interest of time— well, as someone have any quick comments specific to the slides just presented, I'm thinking that we can just kind of synthesize what we talked about and talk about how we can identify maybe even one or two use cases, and use those as a pathfinder to get down to some really concrete action items that we can work towards as a team. Anybody want to give it a start at a kind of use case example?
ELIZABETH HILLMAN: It may be not a use case as much as an action item, but I think that the discussion about ground truth was really interesting. And I think considering what might be our— what might be the things that we want to figure out and label and extract and then coming to some consensus on what the ground truth for those things might look like, and maybe even sort of developing aggregate training sets or standards that we could reach. I'm using all those words I don't like.
HANCHUAN PENG: Yeah, I agree. I agree. For the use case, what do you mean by the use case? Use case for all this about discussed topic or is some sort of application case with the human brain, or what does use case mean here?
LAURA BRATTAIN: Yeah. Okay. So in my mind, we already have some tools developed throughout the years and we have some data collected, as you can see here, so I think now a very major thrust is how to make the current tools— adapt them to make them more generalizable to work across different data sets and make it more broadly usable and scalable to much, much larger data sets. But we cannot just try to tackle a petabyte data set right away, right. What are the building blocks to take us there?
HANCHUAN PENG: Yeah. I think for that, I can probably offer one use case because I'm the guy actually in the BICCN to do this most with this data. Okay. So yes, it is true that we have developed a lot of tool and platform software, all this kind of stuff, okay, right. So a very immediate and a very natural use case for us is try to extend what we have been able to do for the mouse brain— for the whole mouse brain to more complicated system like a nonhuman primate and even human brain. So that's one we have been doing, okay. So we already started to collaborate with a couple teams actually for this monkey and the crabby monkey, try to screen their external pattern as well as some— even kind of some sort of subsidiary pattern, like a external in particular brain area, okay, and the image it generated by the large sheets, pretty fast imaging, so imaging. And we collect the data and we try to put the distribution along the external trajectories, and then try to see how that distribution differ from what we have been able to do for the mouse brain. Yeah. So that's one case. And for the human brain part, we basically reuse the same technical platform, especially the neural reconstruction and the very large-scale brain mapping, the registration, all these things, and try to basically put together a platform.
At this moment, still screening most of the patterns for the individual neuron, but we did actually achieve the milestone that we can accumulate a couple thousand kind of like individual neurons, the images, okay, and then match them to different brain area, and then try to ask what are their potential cell type. Kind of a similar question you would ask for the whole mouse part— the BICCN part. But we start to kind of scratch the surface for the human brain. It's still pretty preliminary, but we did start to collect the data. And one interesting thing people can think about is that compare human data also with the monkey data, right. So that could also be very interesting. Yeah. A lot of unknown out there, right. I want to be very, very cautious about that. I think still very, very preliminary, so pretty big space for people to explore. And for the data, one part, yes— and that's a tremendous amount of data. It's definitely more than one petabyte, okay. Because our storage is kind of— it's very expensive, so definitely need a lot of kind of infrastructure development. And it will be really nice, actually, to have a open-source consortium type of thing that could really integrate multiple different tool and the platform software all together to tackle the problem instead of kind of rely on one particular— like a vendor to provide the technology. That's just our experience to make the system be able to work.
BRUCE FISCHL: Another use case would kind of be in the other direction, which is from ex vivo human to in vivo human, that is how can we use the data that we've collected as part of BICAN to improve our analysis of in vivo human imaging data? And I think there's a broad sort of techniques. I mean, we talk about unsupervised techniques, but there's the broad set of— and the self-supervised and semi-supervised techniques that can be helpful here because we're generating lots of information that's not going to be generally available to us in vivo. And so how can we leverage that information? That development is kind of small for the data sets to analyze the hundred thousand data sets in UK Biobank to find genes or subtypes of Alzheimer's or this kind of thing.
LAURA BRATTAIN: Thanks. Claire, you raised your hand.
CLAIRE WALSH: Yeah. Hi. Thanks very much. I was just thinking about a potential other use case. But, yeah, I think Bruce's one is great. But just another one as someone with a relatively newer modality to the kind of whole brain human— the whole brain imaging field. Adapting the existing tools that exist to newer modality, so modality HiP-CT, like a synchrotron tomography, is quite challenging. The tools don't work out the box, and there's a lot of adaptation needed. So that's a potential use case, is making the tools more easy adaptable to new modalities that come along.
LAURA BRATTAIN: Yeah, that's another good point. Great. So how do we go from here? What do we envision we should be doing for the next 3 or 5 years or even 10 years? Any thoughts on that? What do we think the top action items we should put down—
HANCHUAN PENG: Five years?
LAURA BRATTAIN: —for the next 3 years, 5 years, and 10 years?
HANCHUAN PENG: I think technology actually evolve really, really quickly, and so actually it's very difficult to predict, including at least for— I cannot speak for others, right, but at least for our own project, the progress sometimes surprise ourself. Of course, we actually also encounter a lot of frustration. So, okay, a lot of things we think they should work, but they don't. But a lot of things actually, we give them too much rope. Some are just surprising actually interesting. For example, for the clearing thing actually, people talk about spend lot of time— it actually relate to this particular things. And I didn't really work on the clearing thing myself before. But very recently, one particular team approach us and start to actually really test their clearing agent, some magical agent on our tissue, and found, oh, it's just magic, okay. They don't have any kind of defamation. Non-linear defamation is completely removed. Okay. That's very surprising. And combine with that, I can see maybe something you would need five years to do, probably next two years become kind of impossible. All right. But I also want to be a little bit careful about that. When you actually have new this type of technology out there, a lot of times, you actually also introduce previously unpredictive kind of artifact out there, and then you will need resource to deal with the new problems. So it's really difficult to predict at least for our case. Yeah.
RICHARD LEVENSON: So I'd like to point to another area that's developing faster than the speed of light, and that's the tools like ChatGPT, which is a sort of model for how you can take hundreds of gazillions of text data and distill it in such a way that you can ask it questions and get very sensible answers. And, I mean, obviously, this is very different, but we are— or this group is going to be generating data of huge complexity, but also with the same kind of, at a very high level, the statistical connectivity that drives ChatGPT, which is words next to other words. And wouldn't it be great if we had a tool where civilians could ask a question and get a reasonable answer?
LAURA BRATTAIN: That's a interesting idea, yeah. I actually, over the weekend, I had a lot of fun playing around with ChatGPT. And I realized maybe a lot of our abstracts or papers could be helped by ChatGPT.
RICHARD LEVENSON: Exactly.
LAURA BRATTAIN: Yeah. But yeah, your point is well taken. Again, it goes back to one of the speakers talk, is you should want to leverage technologies in other fields and apply it to this neuroscientific area.
Yeah, go ahead.
HANCHUAN PENG: There is the one thing actually— related to ChatGPT, there is one thing actually not really mentioned here and not really discussed about, although a lot of people also have been working on in this particular domain, is try to gamify the technology, okay, so the reason why you want to gamify it is try to make this scientific processes, especially the big data analytics, more fun, okay, and so that this can become more motivating. But of course, there are a lot of details how to actually really make that work. But the thing is that, right, so if you really can actually make this entire process become more motivating to get more people involved, then, even individual actually hand over a small piece of data. But all together, they actually start to have this greater power to really overcome a lot of kind of challenging situations. I think the I wire— I think yesterday someone mention that yesterday— is an excellent example, but as far as I know, some team actually start to make really exciting new games. And we also try to experiment on that a little bit, and actually recently try to connect ChatGPT with our game engine so that basically, when you ask someone to just talk using natural language, and then they can start to build this image processing pipeline automatically for you by kind of recombining a lot of different modules. Okay. We develop based on our platform. I'm not sure if that will really work or not, but that sounds so like interesting experiment, okay, to do and try to make some fun out of this big data processing, which a lot of times is very expensive and a lot of time very, very boring.
LAURA BRATTAIN: Yeah. Interesting. Yeah. Imagine we can develop tools that can excite high school students to play computer games while at the same time annotate for us. That would be cool. Another aspect is leveraging these deep learning networks that are used— or have been used for chat bot or gamify technology and see if we can apply those directly to process biometric data. And another aspect we should definitely look into as a team. Right. I think we're running up on— I think we're supposed to stop at 3:10. So we have one more minute. Who wants to jump in to kind of give us a nice summary of where we are and where we should be going?
YONG YAO: Well, there's a lot of interesting discussion in the chat space. Maybe if you are interest just to continue in parallel to the session.