2022 High Throughput Imaging Characterization of Brain Cell Types & Connectivity-Day 2, Part 4
YONG YAO: Okay, in the interest of time, let's move on to the session one, imaging data preprocessing. Moderators Adam Glaser and Yongsoo Kim, please take it away.
YONGSOO KIM: Thank you, Yong. Let me start screen share. Okay, so yeah. Thank you for the opportunity to lead session 3-1, image data processing. And Adam Glaser and I'll moderate this session, and here's a list of panelist. We have an excellent group of people. Here's the note-takers. And we have four subtopics. First one is image acquisition led by Kwanghun Chung, and topic two is image processing led by Dawen Cai, and topic three is image registration by Mike Miller. And we introduced last-minute changes as a use case. William Yang will demonstrate single-cell construction using electron microscope data. From here, Kwanghun.
KWANGHUN CHUNG: Okay, so here are the three questions that Yongsoo and Adam suggested we discuss regarding image acquisition. The first question is where are we now, what are the current strategies for acquiring large-scale human brain data sets? Second question is, what requirements does the target scale impose on data acquisition parameters and quality control? And the third question is considerations for online pre-processing routines. Next slide. So we can start with the first question, where are we now, what are the current strategies for acquiring large-scale human brain data sets? Just to facilitate discussion, I just listed three mode of imaging: epifluorescence imaging of thin sections - it has been widely used for special transcriptomics techniques and traditional histology - and one-photon, two-photon point scanning imaging of thin sections or thick cleared slabs, and the third mode of imaging is light-sheet microscopy of thick cleared slabs, which is becoming increasingly popular. And we can discuss all these mode of imaging, or we can focus on the light-sheet imaging modality. Any comments?
YONGSOO KIM: So I just want to ask quick questions about particular image size, since this session is more about data processing. For the human light-sheet data, if you image at cellular resolution, like let's say micron resolution versus synaptic resolution, sub-micron resolution, what is kind of predicted data size?
KWANGHUN CHUNG: So if we image whole human brain hemisphere as single cell resolution about one micron isotropic voxel size, the data size about I would say 100 to 500 terabyte per brain hemisphere, and it takes about a week, two weeks. This is when we imaged about four- to five-millimeter-thick tissues. So one brain hemisphere, if we slice it into five-millimeter-thick slabs, we usually get like 50 slabs, and with one microscope we can image 1 to 5 slabs every day. If we increase the resolution— another advantage of using light microscopy and light-sheet imaging is we can switch the objective and get higher-resolution images quite easily, but the challenge there is that, as we increase the resolution the data size becomes much larger, and also the data acquisition time increases because the higher NA, higher magnification objectives have small field of view. So there is a trade-off between the resolution we can get and also the data size and our imaging time. Perhaps Elizabeth can add more.
ELIZABETH HILLMAN: I'm not on the panel. I'm not allowed to speak.
KWANGHUN CHUNG: Okay. Any other questions, comments? So I'm checking chat.
PAVEL OSTEN: You're saying 100 to 500 terabytes per entire hemisphere under one micron. That seems probably two to three PB per brain for a channel as well.
KWANGHUN CHUNG: And also it depends what kind of data compression we use.
YONGSOO KIM: I think Pavel made some comment. Pavel, can you speak up a little bit more?
PAVEL OSTEN: Yeah. I was just wondering on this data size, it seemed— I don't remember that we are doing so many chunks of data. It seemed small, to do 100-500 TB, but then if it's compression-wise and a one-micron voxel per whole hemisphere, but I might be also wrong. Maybe it is not. So maybe that's correct. That seems reasonably managed.
KWANGHUN CHUNG: Yeah. One thing that we have found is that, unlike mouse brain, the cell density is actually lower in human brain. And the cell size is larger, so single-cell resolution in human brain imaging may require lower resolution than single-cell imaging in mouse brain.
PAVEL OSTEN: So do you have a feeling what would be sort of an— do you think two micron a voxel would be sufficient? Could that make a very big difference on—?
KWANGHUN CHUNG: Yeah. If we want to just detect all nuclei and classify the molecular expression, I think two micron might be sufficient. But if we want to characterize some basic morphology of cell body shape, then a higher resolution would be better.
PAVEL OSTEN: So it probably could be— how much data do you actually have that you've collected sort of and run analysis on them? I mean, can you just do a comparative between 0.5, 1, 2, 3 micron voxels and just sort of quantitative representation of the trade-offs?
KWANGHUN CHUNG: Pavel, could you repeat your question again?
PAVEL OSTEN: I'm wondering if you have enough data to do sort of quantitative representations of the trade-offs based on resolution. Because I think that would be really useful for everybody to know that if I want dendrite, I need this. If I want soma, I need this. If I want—
KWANGHUN CHUNG: Yeah. I mean I can certainly put together some numbers and share with the community. I don't have it right now. But what we do is for single-cell resolution imaging, we used 2X objective, 0.3 NA and that's sufficient. And for imaging morphological details of neural and non-neural cells, right, to further classify their kind of functional— understand their functional state based on cell morphology. For example, in the case of microglia and astrocyte, we use 8X or 15X objective. And for imaging individual fibers, we do tissue expansion. So we expand tissue 4 to 5 linearly, and then we use 15X or 8X objective, and that preserves all the individual fibers then we can trace them. If you want to image individual synapses, then we use 1.3 NA, about 60X objective with like fourfold expansion. And that gives us sufficient resolution to detect all the synapses. But the data size is huge, so we cannot really image a large volume.
UNKNOWN: Can I ask, I guess people's opinions, when we saw Xiaowei and Hongkui's talk yesterday, it was very impressive, but then they hadn't measured every individual coronal slice as far as I understood. I think there was space between those sections. And we've been talking about human brains the last few sessions as we're going to sample every single voxel in every single micron. But how much do we get if we— can we compromise on z-sectioning? Can we miss every 9 out of 10 planes? Can we miss 99 out of 100? What are people's opinions about that?
PAVEL OSTEN: That's a really good question. And again, I think ideally it would be, take the data set that exists at the high resolution and full extent and just downsample it and see the trade-offs, quantitative trade-offs in terms of, all right, I know that I have this many cells in this volume. If I take half of that volume at intervals, if I do five millimeters and then skip five millimeters, what am I losing? Or if I do one millimeter and skip one millimeter and what's that? I don't know if— Kwanghun have you thought of sort of looking into down sampling or reslicing or figuring out what's the trade offs on— do we have to image everything? Do you have to image everything?
KWANGHUN CHUNG: Yeah. That's really good question and we haven't really studied the trade off, but yeah, for applications that require really high-resolution imaging, like MAR-FISH then in the case subsampling is kind of the only way to cover the entire mouse brain or a large volume of human brain. So light-sheet imaging of clear tissues at relatively low resolution, like single-cell resolution, for example, because we can kind of scan five-millimeter, four-millimeter thick tissues with kind of one shot, and because we don't have to deal with many thin sections, right, if we cut human brain hemisphere into four- to five-millimeter thick slab, we only get 50 slabs, right? So handling 50 slabs and imaging multiple slabs with very limited manual labor— with very little manual labor is actually quite easy and straightforward. So for this kind of single-cell resolution, low-resolution imaging of large volume with high-speed microscope, I think covering the entire volume could be actually easier than trying to do subsampling. There's not much gain by doing subsampling.
PAVEL OSTEN: So but then you— I think you have one brain, right? So you have one brain, and that's great, but now you want to see how that changes across different people, which will be one of the most really exciting questions that can come out of that, is what is the variability between people and taking it into the disease. So if you can, based on your one brain or two brains that you image in a full XYZ say, "Well, if I now do 100 brains with the same marker, I just need to do one-quarter of that because that still gives me all the data that I have for statistically comparing that I'm getting the same numbers." Then you cut yourself a lot of— you reduce the amount of the work. We can reduce it probably very significantly for comparative studies, because this is setting up the framework for doing everything you want to do with respect to genetics disease, and it's like the first human genome was nice, but it told us Jim Watson has genes and Craig Venter has genes, but it doesn’t tell the variability, and—
KWANGHUN CHUNG: Yeah. I mean what you just suggested makes total sense, and that's what actually we and many others are doing. So certainly we want to do high-resolution imaging to categorize cell morphology, synapses, and also image many molecules, but we can not get all the information from the entire human brain. So what we've been doing is to just facilitate multimodal data integration between MRI and— like imaging, we just image whole brain with small number of markers, right? We talk about this in the previous sessions. Imaging small number of markers in large number of whole brains and imaging large number of markers in small number of whole human brains or small brain regions. So merging this focused high content and more low content but brain-wide of course seems to be the way to go.
YONGSOO KIM: Sorry to cut it in. I think we don't want to be in this topic for too long. Would you mind moving to the next slide?
KWANGHUN CHUNG: Yeah. Let's move to the next slide. The second question is— next slide, please. Okay, the second question is what requirements does the target scale impose on data acquisition parameters and quality control? We discussed this briefly already. But here is basically the resolution range all the way from 10 centimeters that covers entire brain down to nanometers to see synapses, and we can get all this information using different objectives and also using tissue processing techniques like expansion microscopy. But there is really trade off. For example, working distance. We can easily image four to five millimeters with low NA objectives, long working distance objectives to resolve all single cells. But to image cell morphology at high resolution or synapses, we have to further slice the tissue. We cannot image five millimeters, say four millimeters thick because tissue transparency or the residual differences in diffractive index causes spherical aberration and that really affects the image quality when we use high NA objectives to image fine features. So we have to think about all these things.
YONGSOO KIM: Want to look at the next and then—
KWANGHUN CHUNG: And yeah, the third question is considerations for online pre-processing routines. I think if we don't have much time we can skip this because it is obviously less important.
YONGSOO KIM: I think this one Dawen is going to cover. I think David Kleinfeld raised hand. David? David, I think you're muted.
DAVID KLEINFELD: Yeah, KC. You raise exactly a point that came up yesterday, but if people don't mind my mentioning it, there seems to be a little bit of discord or divergence of opinion between exactly how aberration free the samples are. So I mean I know this is sort of a boring topic, but it's the absolutely critical topic for making progress in imaging. It'd be great to establish standards, maybe what researchers need to show about the point spread function - and Elizabeth brought this up - of their sample as a function of depth, as a function of off-axis distance, standard things you would learn about in an Optics 101 course at the University of Arizona.
YONGSOO KIM: Thank you.
KWANGHUN CHUNG: Yeah. Good point, David.
YONGSOO KIM: Yeah. Should we move on to the next one? Thank you, Kwanghun and Dawen.
DAWEN CAI: Sure. Yeah. So the next progress— sorry, the next step after we acquire an image, obviously, we need to kind of start to deal with the image data itself. So to guide the discussion for this section that I divided this process into three different topics. I didn't do slide by slide, I just throw it all out so that everybody can see everything all at once. Let's just read through this. I think the first step is the pre-processing. You can also call it post-acquisition processing. So obviously, for all the fluorescent-based or light-field-based microscopy that we have these flatfield issues. Basically, the illumination field is not homogenous, so the center normally is brighter, and then the edge of the image is normally dimmer then you'll see these burning effects if you put the tiles next to each other.
And then also, there's a 3D imaging quality degradation from the surface to deeper into the tissue. Regardless, it's because if the sample is not clear enough or the optical design has more prone to aberration. So first is the intensity normalization. And then after these corrections, I would think for individual tiles that we need to register these tiles also, between the spectral channels, I mean, nowadays, we can see more and more imaging modalities are using multi-spectrum to increase their throughput so that for the same round of image, we normally get somewhere between two, to four, even eight spectral channels now. Then if we wanted to kind of reliably saying which marker is colocalized, correlated with another marker, obviously, the spectral distortion between channels, that need to be adjusted or corrected.
And then registration also comes between adjacent tiles when we make a huge atlas map. And then for these multi-round imaging modalities and the imaging alignment between different rounds or registration between different rounds. And then, finally, for each brain data set, we need to register that to CCF. So that's the first part. But the pre-processing I would consider it's kind of independent to the feature extraction, but it's critical also for getting the right feature extraction if we were calling a feature as multimodal and multiplexed analysis. Should we go through all of these all at once, or we just do it one by one?
YONGSOO KIM: Can we start with image pre-processing? Because I think online compression has been discussed before and these seem to be much-needed things to handle the human large-scale data. Can we start with that?
ADAM GLASER: Yeah. Oh, sorry. I was just going to mention Viren had a bunch of comments on compression. So maybe after Dawen answers, Viren, if you have any thoughts or insights sort of on how this has been applied to EM, that would probably be very useful.
VIREN JAIN: I mean I can just quickly talk about it now, if that makes sense. Sure. So, yeah, I mean we've been looking at this issue mostly in electron microscopy, although a little bit in light microscopy as well where we have these petascale EM data sets. And compressing them is actually non-trivial in terms of getting good compression rates because the data is extremely dense, right? It's not sparse at all and there's content basically everywhere. And you also have to be careful not to degrade the image content or quality in a way that then reduces downstream analysis like segmentation, synaptic identification, etc.
So one strategy that we have found that works pretty well is to first denoise the data using a machine learning approach. So we actually acquire two versions of just a small portion of the data at a fast speed, let's 50 nanoseconds in an EM, versus slow, 400 nanoseconds, and then we train a model to basically denoise based on those two different versions of the data. And then once you have such a model, you can apply that to any other data you have from similar acquisition. That basically just removes the shot noise, which is obviously difficult to compress since it's, by definition, random. And then you can use a standard compression approach like JPEG or AVIF or whatever and get almost 17X compression, which is much, much better than what you would get without doing that denoising stuff. So that's just one anecdote from the EM land. You can apply something similar to light microscopy, of course. But you should get much, much bigger gains there given the sparsity of the data and so on.
DAWEN CAI: Yeah. So maybe I should just add in addition to what Viren just said that we also tried this on light microscopy images, and that works pretty well if you have the denoising step been done. And even for lossless compression, you can get 10 folds of compression rate because like Viren already pointed out, the light microscopy data is much more sparse, much sparser. In this compression, if you use the proper file formats and the case in our book is that we use HDF5-based real-time compression with the filter plug into that, we can perform near to real-time acquisition and real-time compression performance. Of course, it depends on the actual compression. But I think that the denoise step itself, computation-wise is not necessary being low cost since that does take a lot of computation—
YONG YAO: Dawen, could you speak up or closer to the microphone?
DAWEN CAI: Can you hear me right now?
YONGSOO KIM: Yes. Your voice is kind of low.
DAWEN CAI: So what I was saying is that for compression, the denoising step, computation-wise, that actually leads to the third step, scalability measure. But maybe also, put it into here is that— so scalability also, in terms of imaging efficacy. If this data can be on the fly compressed, denoised and compressed, and that would definitely save a lot of storage even for 2 weeks. And also, in general, increase the imaging throughput because you do need to wait until the data have been processed and then you start the next round. So we're drafting the manuscript right now to kind of try to solve this as a package that the backend package just supports the microscope once you get the data stream coming out, and you can do this on the fly. But these are all very possible way to do so.
YONGSOO KIM: Thank you. And shall we move on to feature extraction, and then scalability?
DAWEN CAI: For the feature extraction part, I mean, I think Viren probably you have more say about this, and especially for the computation cost, right? Scalability, actually, every single step can be tied to the third part. It's a scalability, basically, processing time and speed of computation resource requirement to do all of these. We do this in our individual labs or after the raw compression, send it to a bill, and then everything is processed there, what kind of model should we do? I think, Google at this point has a whole ecosystem, everything in house probably have a lot more experience. I would really love to hear what Viren said about, at least for registration.
VIREN JAIN: Okay, are we talking about computation related to just registration of the data? Or are we talking about segmentation, feature extraction, and so on or everything?
DAWEN CAI: Yeah. Registration. Let's go through registration.
VIREN JAIN: Yeah. Registration, that's an interesting challenge. I mean, for the human EM data set, which Daniel presented, that was 280 million individual 2D images, which have to be computationally combined into a single volume. And it's very difficult, because there's all kinds of artifacts to deal with, plus obviously the scale. That said, I do think because of the challenges there and in other related data sets, the tools that are now available to do stitching and registration at large scale are really pretty decent, actually. Just the off the shelf things. There's something called BigStitcher from Janelia. We have developed a tool called SOFIMA, scalable optical flow image alignment. And these tools are from the get-go designed to scale and use different forms of information in the images. In some cases, they're identifying classical features like SIFT or something like that. In other cases, they're computing optical flow from one slice to the next in order to find correspondences. And there's been a lot of work put in to make these things robust.
So I actually think we're in a decent place now, certainly much better than 5, 10 years ago. These things tend to work with light microscopy data sets as well, at least in our hands. So I think it's— I'm actually pretty optimistic at this point. That said, it does require some processing capabilities. And I think, again, the sparsity of the light microscopy data, that merits some additional engineering to make sure you're not over computing on places where there's no data for example. Simple things there, but it does need to sort of be addressed. But compared to the overall cost of the projects that we're talking about here in terms of microscopy, the potential cost of storage, all of that kind of stuff, having a few GPUs attached to any workflow doesn't seem really prohibitive to me. I mean it seems like that's well worth the expense of the hardware as well as figuring out whatever engineering to make that work. Because you're saving so much time and real money actually in storage and other things by doing that.
DAWEN CAI: Great, thanks. I think we should move on, people.
YONGSOO KIM: Then nicely segue to the next segment, which is registration challenges. Mike?
MICHAEL MILLER: Thank you, yeah. Thank you. And I think I'm going to pick up on some of the things that Viren just mentioned, and also in the chat, Peter brings up stochastic sampling. So I think the challenges as we move to molecular scales, never mind electron microscopic— to the molecular scales is just that we really have to think about mapping simultaneously at multiple scales. So we've written papers now about how to build maps that are at every scale. So shown in this picture essentially is 100-micron tissue shown on the left, particles that represent tau pathology in the medial temporal lobe from histology in the middle. And we basically have to describe sort of a three-layer transformation that respects the global structure which— you see the global structure of layer 2 and layer 6 folding in the cortex, and of course these are the global structures that FreeSurfer has been manipulating for a long time. But that's really at the millimeter scale, surfaces at the millimeter scale. But simultaneously, we have to understand the particles themselves.
And so as we think about multiscale brain mapping, it really pushes us into not only having tissue-scale mapping, but we must have particle-scale maps, and so that's really the direction that we've gone. We're trying to build particle-scale mapping that is consistent with the tissue scales. And so that takes us from particle representations, which are extremely sparse because you're only at the place where the cells are, and there's lots of empty space, but there's lots of particles. And as you go up in scale to core scale you then move to regular lattices essentially images.
And so the next slide shows an example of this. I just wanted to illustrate it. So let's think of the atlas on the left as the continuum, and maybe at 10 microns, that would have about 500 million. We can think of a voxel as a particle. We call them voxels, but let's think of them as 500 million particles. In the middle column, essentially is a sparsification where we can imagine if we're at 10 microns and we want to represent every cell, that would be about 5 to 10 million particles or cells. This really is the strategy put forward by Elizabeth. She calls them point clouds. But you'd already have 100 to 1 data reduction from manipulating the image. And so if we had particle maps that could work directly on the particles without going to the image lattice, we would be saving a huge amount. And that's the idea.
And so here in the middle column are three examples. Where if you ask the question, I want to do something that's consistent with tissue and I want to be at 100-micron resolution, which would be the high-field MRI that I was showing you, you might only have 14,000, 20,000, 30,000 particles to represent the 100 microns. Of course, that's a lot of dimensions in your map. Every point carries three dimensions. So 20,000 particles would be a 60,000-dimensional map, which really very few people are computing these days.
Now I want to link it to what Peter said because it really takes us to the geometry of the brain, which was really emphasized by Bruce and Michael and David in the previous session. You could imagine that these particles can be optimally placed. So now you can place the particles since they're not regular. You could place them at the place where surfaces are, where layers are. So if you're interested in the cortical fold, you could have a greater representation of the cortical fold. And you wouldn't have all these regular lattice tiles that are really not interesting if you're interested in something like layer two or layer four. So we think this is important. And I think the relation to stochastic sampling is that you could essentially generate or draw these particles. These are samples, and then they become the places where you represent the information, and they're, in some sense, optimal.
So let me finish with one more slide because— actually two more slides, I lied. So let's go back one. So I just wanted to say, what are the data structures that we see as we go to really molecular maps. And there's an additional data structure that we didn't have at the tissue scale. At the tissue scale, there's a trivial notion of density at foreground-background. Is the brain there or is it background? But now we see that there's really an important role played by cell density. And the cell density sits in parallel to another function. And that function is essentially a high-dimensional function that encodes the information at each point for each particle. So here I'm showing you, MER-FISH, where we have maybe 200 or 300 cell types that might have genes, and we essentially have an empirical probability law at each point. So we have a field of probability laws. Each particle that's represented by cell density in the middle carries this high-dimensional feature. And you have to manipulate both of them. And then, of course, as we think about what we're manipulating, we're manipulating essentially geometric change when we build mappings to atlases. So shown here, essentially, is this particular section, put onto that particular section of the atlas. And that's the coordinate transformation that we need to go onto the atlas as shown on the bottom left. So those are the three data structures that we have to carry. And there will be many more as we represent more geometry.
So let me finish. The other thing that we think is important as we move to these representations is we have to have mappings from atlas to atlas. And I think of atlases like cartoons. Atlases often don't carry very rich information. So we have to go from very rich information like MER-FISH and other things, onto atlases which are like cartoons or caricatures, and then we have to go between these caricatures. We have to be able to go from one partition, shown in the left is the Allen Atlas, shown in the middle top is another partition. They're not the same. But it turns out that there is really a minimum energy mapping that brings the labels, the ontology in the middle, onto the labels, the ontology on the left. And we have to be able to do that. We think there are going to be many atlases obviously because we're in this very high dimensional space of functional features at every point. It's not simple like a single contrast or a 3 by 3 DTI matrix. It's very high dimensional, so there are going to be lots of different atlases. Here are two. One's Allen and one's Kim. So with that I'll finish with those slides. Thank you, Yongsoo, for giving me the time.
YONGSOO KIM: Thank you. Anybody have a comment in this registration part?
HONGWEI DONG: Hi Michael. I have question regarding your particles. So these particles, are they one-to-one corresponding to real cell, or that's just simulate the dots?
MICHAEL MILLER: Yeah, they can be. So if we go down to 10 microns in resolution and want to manipulate the 10-micron map, then you could imagine— for the mouse we have there's about 7 million particles. So they would be one cell, one particle. But more generally, they could be an abstraction. So they could be— it really has to do with the resolution you want to operate at. So you could even take those 10 million cells and say, "I want to sparsify that and not manipulate all of them." You could then calculate a subset of particles that would optimally represent those cells. And it might be 50,000 cells. You might have more cells in CA1 or CA2 for whatever reason, and you might have fewer in CSF or other locations. So even though you had lots of particles to start with, you could have a coarser representation. So we view it in both directions.
I definitely think, Hongwei, thinking about your work, when you get to the tissue scales, the particles are much more like a voxel. They carry a big island of tissue. And so they're a collection, they're an aggregation of the cells that go in to represent them. And you can almost think of it like the pyramid that we saw in the previous talk. Only in this setting, the mappings are not linear, so it's not really directly multigrid. It's more general than that. But it's like that. So a particle could be an aggregation of cells, or within a cell an aggregation of RNA, or we think of it as going higher to tissue. So thank you for that.
HONGWEI DONG: All right, thank you. Hanchuan.
HANCHUAN PENG: So I actually kind of share the same question. I just a little bit curious about what are the actual physical anatomy of the particle? Are they just some sort of landmark, very, very densely sampled landmark corresponding to the salient features in the brain or something else?
MICHAEL MILLER: Yeah, you can start there. So generally we start with particles at the places where the RNA ends if we have that. But then we often go to aggregation, so we know where the cells are. But think of it in MRI and DTI, a particle might be a voxel if you want to work on a regular lattice, but it might not be because, if you think about what Bruce has been doing with FreeSurfer, there the particles are sort of glued to the surface, so they're triangles. They're not a regular lattice. So you have this aggregation of— you could have regular lattice sites then you could have the triangles that represent the surface geometry.
HANCHUAN PENG: So the particles, they do have the relative location with respect to each other, so they are kind of more like a point on the mesh, right?
MICHAEL MILLER: That's right. That's right, and you can— I was saying to you that we've done this so that when you get to 100 microns, the map is completely consistent with having a regular mesh and mapping MRI, and so you can sort of see in the lower right, if you have lots of particles, it sort of looks like a continuous object, right? So yeah. So exactly as you were saying, Kwanghun.
HANCHUAN PENG: Yeah, I think that make sense actually kind of consistent with what I thought, yeah.
YONGSOO KIM: Thank you. I see that Dawen's hand is raised, right? I just want to give at least nine minutes to the last part. Dawen, do you mind if you can talk with Mike offline? The last segment is a use case by William Yang. William.
WILLIAM YANG: Yeah, thanks, Yongsoo, for giving me the chance to present this. So I want to actually present some work that's done by this collaborative group with Hongwei and also Jason Cong and Daniel Tward, and Giorgio Ascoli. Next slide, please. So the goal here is really how to process this large volume data set image in a very sparsely and brightly labeled neurons, single neurons in the brain? So this is using these MORF mice that we generated from BRAIN Initiative and labels hundreds to thousands of neurons, single neurons with their complete morphology visualized. Next slide.
I just want to focus on— we have two different pipeline for imaging, but I want to focus on the light-sheet pipeline, so we use SHIELD to actually clear the brain and automatically stain the MORF reporter, and then image with this light sheet and 4X and 15X. Next slide, please. So image at 4X we can already visualize the individual dendrites as well as the axon terminal. Next slide. But now we can actually do this isotropic 15X imaging with 0.42 micrometers per pixel at XYZ. It takes about 40 hours. The data size is really big, 6.5 terabytes, but after compression, post-processing, actually by Hongwei's lab it's about 1.7 terabytes. So the idea what we want to do is really try to be able to access and actually analyze this data set using local computer instead of upload onto some kind of a cluster or a cloud. That's needed because currently there is no easy pipeline to actually analyze individual neuron morphology and reconstruct them, so we need to access them, and also this will save a lot of time and money. Next slide, please.
So to this end, our collaborator Karl Marrett in Jason Cong's Lab developed this algorithm called Recut. So what they really take advantage of this sparse computing and also parallel processing sort of idea because most of the images in this giant light sheet does not have any data. But we need those really bright labeled single neuron, from dendrites to axon to axon terminal that are brain wide. So here in this algorithm, we actually have a sort of a— we implemented a unit for morphological detection. The most important thing is the sparse footprint - I'll tell you which program - which allow readily sort of a compression several hundredfold. And also readily access and retrieve data in the local computer. So including compression, proofreading, analysis data locally, as compared to upload. So right now we can actually compress the terabyte scale light-sheet data per brain hemisphere to a few gigabytes, and they can be analyzed locally. Next slide.
So one of the sort of key idea here is using this hierarchical volumetric compression, but also use this VDB grid, which can reduce over 500X lossless compression, and then, enable— yeah, next slide, please. Sorry. So the nice thing about VDB is actually originally developed for animation by DreamWorks. So there's a lot of available high-performance programs and tools for visualization you can use. And also the program's designed to be multi-threaded. So basically, the performance depends on the density of labeling. So the higher the density then the more time it would take to reconstruct. Next one.
This is my last slide. I also incorporate, as I say, incorporate these models for soma detection, neuron detection, and also segmentation, skeletalization. And also we can do proofreading within Recut. So this basically also, you can incorporate additional programs such as topological footprinting, and it's open source. I have a link and also preprint in the corner. And we think this solution could be helpful to others who are interested in sparse labeled, signals within this large brain volume imaging data set. That's it. Thanks for the opportunity also.
YONGSOO KIM: Thank you. Any questions or comment to William?
HANCHUAN PENG: I have a question. So the question is that— so this is actually used for mouse, right? So how is going to be kind of— what do you envision an approach potentially kind of applicable for the human brain too?
WILLIAM YANG: Yeah, I think it's a really good question. So the program is actually— because it's designed to be computationally, simple and efficient. So it's truly scalable with the computational resource. So if you imagine taking human data set with some of the sparse data as we discussed earlier. For example, Kwanghun mentioned if you use cell body or morphological level, the data is quite sparse. So we think actually this could be potentially used, applied to look at the more sparse but large volume data set from human as well.
HONGWEI DONG: Can I just add one point follow William's comments. Apply to human the— what's really matter here is just how to label. As long as we can label the human cell's morphologies, and then this program can be used, can be applied pretty much.
HANCHUAN PENG: Thanks.
YONGSOO KIM: I think something like this can be really helpful to reduce size because what William demonstrate is six terabytes to a few gigabytes after feature detection registration. I think there might be a way to handle some of the large-scale human data. I'll offer up the floor for any of the questions, to any other additional comments in the remaining time.
HANCHUAN PENG: I actually do have a question for one of the previous presenter. So I wasn't really sure about when we actually really talk about the human brain imaging, what are the actual object actually people have been thinking about, have been kind of implicitly kind of imply out there, right? So, say, "Oh yeah, I have a half-human brain use the light sheet to do this a couple weeks or whatever," but what exactly you want to image out there, right? So if you don't really actually have anything get the labeled. It's kind of a pointless actually to even spend the time. But the problem is just as Hongwei just pointed out, actually one of the major issues is that you actually want to sparsely label something out there, and the labeling becomes a major kind of a challenger. In my view, actually major issue out there is not the imaging, but actually how to actually label the meaningful object you want to study, so. And for human tissue, that actually is— it's not very well defined because you cannot use a lot of this genetic technology out there. So I would love to hear a little bit more discussion along that particular line. What people want to label out there, and how to actually really get the things you want, right? And yeah.
YONGSOO KIM: Anybody from the labeling? I think there was breakout sessions to discuss this labeling part. Anybody from that breakout session may comment.
KWANGHUN CHUNG: Maybe I can start. So the advantage of proteomic imaging is that if you have any antibody for any target protein, you can basically label any cell types, any morphological features, any sub-cellular organelles, including synapses, axonal projections and dendritic arbors, cell bodies. So at low resolution we can basically label the cell bodies using cell type-specific antibodies. At high resolution, we can visualize individual axons, dendritic arbors. We can do reconstruction of labeled neurons, their morphology. And at really high resolution, super-resolution, we can even image individual synapsis. So without using genetic labeling, we can get a lot of information in many different scales.