2022 High Throughput Imaging Characterization of Brain Cell Types & Connectivity-Day 2, Part 6
YONG YAO: The next session. So session three about the data infrastructure, data storage, scalable computing, and disseminating is moderated by Satra and Harry. Please take it away.
SATRAJIT GHOSH: Hi, folks. It's been a long day, and we are perhaps the only session remaining between you and taking a break or doing other things. So we'll try to keep it as brief as possible and kind of cover some of the elements that we want. You've already gone through this last two sessions thinking about the kinds of data and the kinds of processing that you need to do. We are coming at it at the back end of this. To do all of these things you need infrastructure, and so one of the questions is, how do we achieve that? So I'm going to share a few slides, and then Harry will share a few questions, and then we have a set of questions for our panelists and anyone here in the crowd will share the notetaking doc. Please add new questions there, and Harry and I will do our best to moderate this session.
Can people see the slides?
ALEX ROPELEWSKI: Yes.
SATRAJIT GHOSH: Fantastic. All right. So this session was around— the whole session was around data analysis pipelines and data handling, and we are going to be focusing on infrastructure as it pertains to these components. The overarching questions that I think Yong sent out originally were around how to make data useful and accessible, what kinds of image processing and analysis techniques are more appropriate, what can be done computationally with the scales of data we might be getting at, what are the requirements. And you've seen many of these elements that have been covered in the previous sessions.
I wanted to just put a little bit of numbers on data a little bit. We've talked about a lot of data over the last couple of days. So the top table shows kind of data sizes per sample, and I leave it to your imagination to extrapolate. If you like 1,000 brains, you multiply it by 1,000. If you'd like 100 markers, you multiply it by 100. And you will get certain numbers that will start reaching the exabyte scales of data that we are talking about. You've also seen Elizabeth's line, and I think I'm just repeating that specific line over here, which is about three petabytes of data for a whole human brain covering eight channels. And that's one of the numbers you might want to use, and again, multiply in various ways to get at the scales of data you might be looking at.
By comparison, Wikipedia has about 20 gigabytes of data in a snapshot of the text data. Gets to about 10 plus terabytes with its history and about 200 plus terabytes with media. That whole components dwarves in comparison to the kind and expectation of data we are expecting. The Earth Engine, as Simon talked about earlier, has about 70 petabytes of public data and CERN has about 420 petabytes of public data, with one of the new experiments slated to reach about 600 petabytes of data. So we are seeing an explosion, and we can learn from a lot of these elements and fields in how we might handle some of the elements of our data.
This is a fairly standard storage infrastructure. This is a slide I stole from Alex who got it from Hewlett-Packard at some point in time. It shows different kinds of cold and hot storage and compute capabilities. It doesn't go into details, and this is a very generic infrastructure that might exist for some of the things we want to do. I wanted to bring up about a slightly different architecture from a slightly different enterprise— Pangeo, which is for geosciences. This is a community bringing together a bunch of different options around computing with very large-scale data in the cloud. And they're driving it from many different components: how to store data, what kinds of formats to use, how do I create interfaces to data so that I could compute on it as well as visualize it at scale? And how do I compute on it at scale in the cloud and other places? And could this be created in local infrastructures and cloud infrastructures and national infrastructures? So they are addressing many of these questions.
The motivations behind Pangeo are very similar, I think, to our motivations. We're going to be getting very big datasets. We need to deal with them. There's a technology gap between the technological sophistication of various industry solutions and scientific software. Reproducibility as we have a fragmented space of software, tools, and environments becomes a question. How do we think about rigor and reproducibility in this context? So Pangeo hopes that their kind of consortium community infrastructure will help address that. And this is kind of put in their mission statement, "To cultivate an ecosystem in which next-generation open-source analysis tools for ocean, atmospheric, and climate sciences can be developed, distributed, and sustained. These have to be scalable and these solutions should leverage the existing expertise outside of the geoscience community." I think if I could replace geoscience with neuroscience we would fit into this space for what we want out of this infrastructure.
The Open Microscopy Environment is another community that has been doing some of this, specifically in the microscopy space as it has transitioned between 2D images of things to nD images of objects over time. And many of these communities are dealing with these issues of cost, access, compute, derivatives, standards, and governance. So when we are talking about infrastructure, we have to talk about a lot of this. So with this I'm going to pass this on to Harry to cover some of the questions and considerations. And Harry, I'll move the slides.
HARRY HAROUTUNIAN: Great. Yeah. So these were among some of the questions that we felt were relevant to what we've been discussing in the last 48 hours or so. Given that we expect from that first slide that Satra showed, near zettabyes and near exabytes worth of data— and it certainly boggles my mind as to how we'll store and access the data. But it's not just storage of the data, right? It's the transport of that data to where we want it to eventually reside. And then we come to the processing of that data. How does it get harmonized? How does it get quality controlled? How does it get sanitized all within that space? And then what are the search tools that will be available to the community? I'm essentially thinking about labs are producing the data, but the user base is vastly greater than just producing a lab. How does that data get archived? And in archiving, can we afford to use some of the data? In what format do we store the data? Do we store it in its original, unadulterated form, or do we store it in post preprocessing? And that, of course, puts us into the realm of standardizing formats and cross-platform compatibility. For example, those of us in the digital pathology realm are dealing with different people producing similar data, but on different platforms that don't necessarily talk to each other. Can we move to the next slide?
SATRAJIT GHOSH: Yes. I just did. Can people see it? I'm sorry.
HARRY HAROUTUNIAN: Yep. So the data formats are going to be coming from anatomical studies, from connectomics studies that David was talking about. And hopefully, those will be reduced to actual cellular level connectomics. And then we have the entire universe of -omics, from proteomics, to transcriptomics, to genomics, to metabolomics. And then we add on top of that data coming from mice and rats and nonhuman primates and humans. And as we've heard from hundreds of brains, in total, with different levels of fidelity. Given that all of this data is going to be converging on an infrastructure that we don't have yet, can there be common annotation standards? How will the user base access the data? And who's going to pay for it? The users, the producers, a little bit of both, or can their public private partnerships with some of the, hopefully, bigger players, the cloud-based entities like Google or AWS or Microsoft or whatever? So those are the questions that we'd throw open. Next slide, I think, yep. So where does the data reside? With individual labs or at a central location? Who and how can the data be accessed? Where will the metadata reside? What level of granularity will that metadata have how many brains?
We're potentially within the whole BICAN ecosystem talking about hundreds. How do we scale up from one to hundreds? What's a realistic throughput and what's a realistic time scale? And what kind of infrastructure can support these approaches? I think we've already discussed who the audience is, and it seems like the audience is the entire neuroscience and clinical community. And something that's so close and dear to my heart is how do we archive? How much of the data is in hot storage? How much of it is in cold storage? Do we ever discard data and stick to distilled data, or do we plan on long-term and reanalysis of the raw data? So we'll just open it up for discussion by the panelists.
SATRAJIT GHOSH: And I'm going to stop sharing. I'll post each of these questions one at a time in the chat. And hopefully, there are many people on the panelists who have developed various kinds of infrastructures, have dealt with many of these questions. We would love to hear from you about your experience and the kinds of things that you kind of bring to each of these questions. So with that, I'll stop sharing so that everybody can see each other. Jonathan, you have your hand up.
JONATHAN SILVERSTEIN: Yeah. I wanted to just show an example in the HuBMAP community just for a couple of minutes to sort of continue.
SATRAJIT GHOSH: Okay. Fantastic.
JONATHAN SILVERSTEIN: Yeah. So let me do that. I'm going to just share the model by which this is done because it is by far from the solution for all these things, but it is a solution for many of these things. And this is a typical architectural structure with the resources, the APIs, and applications that HuBMAP has developed over time. That's a hybrid cloud approach. The blue things are at the Pittsburgh Supercomputing Center, very scalable and on prem to avoid things like egress charges, storage charges, these things, on top of some computing costs that are highly utilized. The pink are subscription to Globus so we can get high performance data transfer, security, and some other things at service cost without rebuilding them all ourselves, using institutional IDs to manage all those things. We have no identities to manage. And then the orange ones are actually in AWS as high availability types of assets. So you can see for example the file storage and compute resources are raw stores on the high performance computing centers with APIs that allow data ingest and data assets that then get utilized by all of these other different tools, whether they're other APIs, such as assigning things— UUIDs, searching amongst the graph database of provenance searching within those for particular metadata features, other ontology features that feed into those things. So the blue is this mix between the high performance local, hybrid assets. And the orange are the high availability things like the portal, like the test visualization off of these assets.
Azimuth, which is a tool that essentially distills data, if you will. It recalculates cell types in certain tissue types, and it provides those additional data back to the assets. And then all of these calls of everything run through a gateway that can be public orientated or private so that we can have all of the construction members doing all of these things, visualizing data, moving data, doing all these things behind their own consortium space, and then a process to declare things are available for publication. So that way when it's approved for publication out to the community, it's just flipping a switch that the access rights on the entire thing from data through process data through visualization is now in the publicly accessible space rather than in the private space so it can be tuned up, optimized and so forth. So I just wanted to show this. This is operating now for the HuBMAP and a similar model for the Cellular Senescence Network. The data scales are not at the scales that VERN needs, but. VERN, oh my gosh, I just threw back 20 years for those of you who didn't know that. But that's about the last time I was involved in this particular community. Wow, that's an interesting neural connection that just happened there.
But in any case, this BICAN community, this type of approach absolutely scales to these very large things, assuming that you do some of the things that were mentioned earlier like blocks of data that can be accessed rather than entire files. So that's all it wants, maybe petabytes in size. Notably also, the computing and processing components sit in the same infrastructure as the data. And so this does keep those together where the high available services that have to move data out or the visualization things are run in the cloud so we don't have egress clouds, so as things burst into the cloud and then they vaporize as opposed to going the other direction. So for those of you that are familiar with these kinds of things, what I've said will give you a lot of insight. For those not, it's a big diagram with a bunch of lines and arrows and I apologize for that. But hopefully, this gives you a sense of the approach that we've taken to a number of these different substantial challenges that were mentioned.
HARRY HAROUTUNIAN: So Jonathan, in the BICAN ecosystem, does this suggest that the blue will reside with each of the data generators?
JONATHAN SILVERSTEIN: So that's a great question. One of the questions that we had, that we did not have to address was having individual data sets so large that they wouldn't be able to be moved centrally. So we have none that are so large that at the generation site that they can't be moved. And so this blue here file store and compute resources are at Pittsburgh Supercomputing Center and are centralized. From there, people are brought in to use the computing and the data. And then, of course, there are many applications to pull data out and utilize it in other formats or even just download data for local use for smaller sets of things. There's actually probably the common use case is people pulling data from it. But they don't have egress ch8arges because it's not in the cloud. Right? And so this is fairly highly significant and allows people to go into whatever platform they choose. So this is a centralized model. The data sets are of modest size. They're of tiny size relative to some of the things that's been discussed today. But they are being brought in centrally.
ALEX ROPELEWSKI: Well, and this is one thing. This is Alex. I want to bring up too. I mean, the one thing—
JONATHAN SILVERSTEIN: Although, it wouldn't be necessary. It wouldn't be necessary. Go ahead, Alex. Go ahead.
ALEX ROPELEWSKI: Yeah, the one thing I want to point out is that both the bill data and the HuBMAP data, literally both reside in the same machine room. So there's a great opportunity through what we're doing to make data available through sort of a similar system that we've developed for HuBMAP. But the thing that I— the bill data system came before HuBMAP, and we took a lot of the experience we had with that, and with Jonathan, he kind of supercharged some of those ideas into a more robust and comprehensive infrastructure. The other point that I want to make that I like about the HuBMAP model is that the high availability services are in the cloud. And if we think about what we've discussed a lot this week is things like search. That's something that people are going to want to do regardless of where they are. One of the first things they're going to do to be able to find the data they want to deal with. So those types of resources are in the cloud. So there's high availability and people can access them rapidly. We've also talked about abstract representations of data. For example, taking image data, representing it as a point cloud, as Elizabeth likes to mention. That operation in and of itself reduces the amount of data that we would have dramatically. And that may make financial sense to put that data in the cloud rather than the full 300 petabytes of data. Storing data locally, particularly if it's not used very often on tape, is very cost effective as well. So if the data is not being used frequently, being able to move it off to some sort of cold storage or tape really does save a lot of money in the long run.
And there's also been some discussion going on in the chat with what do we do with the raw data. Do we throw it away? Do we keep it? I think, perhaps, a consensus would be that the appropriate place for raw data, however we define it, it could be slightly processed. It could be the data off the microscope, which I think is a decision that the community needs to make, that can be stored on lost cost storage and brought into storage to analyze whenever people need to do that.
SATRAJIT GHOSH: Thanks, Alex. Brock, you have your hand up.
BROCK WESTER: Yeah, yeah. I just wanted in a way build off what Alexander was saying with regard to what's usable for the community. So it looked like— I don't know. Johnathan, do you want to put that diagram back up? It's a good example to kind of facilitate some of the discussion. But it looks like you're materializing resources. You have a Neo4J instance. You have a SQL database. Those are the things that the users are going to want to hit for running their queries and searches. And we are exploring kind of a similar setup for the archive that we oversee.
SATRAJIT GHOSH: Go ahead.
BROCK WESTER: Yeah, yeah. I mean, again, we're looking at ways to make data more fair in a way. But I'm curious about— since this has been a resource that's been up for a while, have you all been tracking activity at these various nodes? Because that would be instructive to— it would give us insights into what would be necessary for availability. And obviously, you all have thoughtfully put certain things in the cloud for high availability and certain things in storage based on costs. But user patterns— we track things. Our archive tracks things, and we do proactively move things around if needed. There's some intelligence hearing in some of these cloud ecosystems. But I don't know. I think it would be very valuable to take at look at what you all have been tracking through the system.
JONATHAN SILVERSTEIN: No. I think that those are really terrific thoughts and demonstrate some understanding of what I was getting at. Let me respond and point out two small things. One is so the Neo4j instance manages the ontology database as well as the entities which is Provenance, following the W3C Prov standard. So entities have actions that generate other entities and so forth. And that's all managed in Neo4j because it's a graph of untold complexity rather than a relational system. In fact, all the user access, and in fact, all the portal access is through APIs. So we don't allow anything above here to get at these resources. So we've formalized that and insisted on that from the beginning. There may be some exceptions. I mean, it depends whether you— this is an API but it has some privileged access. There's some privileged things that the system administrators and things have to do, of course, but nearly everything is via APIs. All applications are through APIs, and that's why you can see there's nothing crossing. But that means that all of that counting that you're talking about has a great place to do all the counting which is in the APIs. Unfortunately, and this is drawn a little bit funny because things come in through the gateway. They go through the APIs and they go back outside. In building this diagram, we could never figure out how to address that exact issue in a diagrammatic way, so it's a little bit broken but the key to the— is a gateway that has tokens that are recognized by Globus Auth. So every single one of these is actually going through the exact same gateway.
Unfortunately, because we built the gateway early and we had to make some decisions to get a portal up of this complexity within a year and a half, we said we'll get to some of the counting stuff later. When we got to some of the counting stuff we decided we wanted to use much more robust features, the API gateways from AWS, and switched over to that. So we've lost some time and opportunity to do all that counting that you've described. But I think in months ahead we'll have months worth of it on a slightly more mature system with an updater that people are using it. So we will have all those numbers. We don't have them now.
SATRAJIT GHOSH: So maybe I can piggyback on that slide again, Jonathan. In terms of APIs, I think we heard earlier today from Simon about Earth Engine and how people can operate on these data through APIs. And that's a bit of a change in, I think, the neuroscientific space of where people want to get into a computer and do things on a computer with data as opposed to running things through an API. And part of the challenge there, and this is why I want to open this up for discussion over here, is what sits behind the API in terms of processing, is where the community is still evolving on all kinds of fronts as we heard in the last session. Can you bring, from your experience, how your communities are dealing with the changes of processing, let's say, in this vertical line that you show, how it identifies cells—
JONATHAN SILVERSTEIN: Yes.
SATRAJIT GHOSH: —and how those processes are brought together to the broader community accessing it through the API?
JONATHAN SILVERSTEIN: Yeah. That's an excellent question. So it gets to this whole pipeline management and Airflow that we're using. And there's some mention of Nextflow and some other things in some of these. So I think there's two ways to look at that, and I'd love to have Alex weigh in as well because he's much closer to some of these particular pieces that you're mentioning. But one short answer is, to the extent that things can be turned into batch operation, you have a much greater advantages. So if you can put things in workflow languages, CWL, for example, and then into Airflow, you can create arbitrary complexity. One thing I didn't mention is that in fact, everyone of these things with the exception of the file store itself and the Globus subscriptions are in Docker containers. So we actually spin up an orchestration of Docker containers to run the entire infrastructure, local and in cloud. That gives you great capabilities. So you can build a workflow in a series of Docker containers, construct them in CWL, and the whole thing will run end to end for whatever you can. So if you can do things in batch mode, you sort of have no limits because you can put these things that are running batches against your APIs in front, instead of in behind, as you mentioned, if you think about it, that in front and behind you were mentioning. So that's unbelievably powerful.
In interactive mode, you're doing different kinds of things in general, more visualization, more utilizing distilled data, these other kind of things. Although, you can still do things like an Azimuth calculation on the fly with— this is tiny amounts of data, right? So I think there's this balance of the batch versus the immediate that is addressed in here, and that is workable. And it's nothing specific to the fact that it's a lot of genomic data. And there's a lot of imaging data here as well. They're all being process through these CWL workflows using things like image pyramids to advantage them in the visualizations and so forth. And Alex, do you want to say more about that? He's getting into an area that gets slightly out of my—
ALEX ROPELEWSKI: Yeah. The only other thing I would add on to that is our experience using workflow language, CWL, and containerized workflows has really enabled us to democratize what the workflows are. Get the same workflows that run on local infrastructure, access the data directly, also run in AWS, and nothing needs to be changed in order to enable that. So the way that the tools are packaged is really, I think, the way to go forward with any of these systems. What's a little bit unclear is enabling third-party sort of plugins. There's been some people who've kind of asked us can we get to the point where they can basically just give us the container, and we can run it through our system? So I guess the answer there is more we're closer to getting that to work on AWS. There are some security issues that we still need to resolve to enable that locally.
JONATHAN SILVERSTEIN: Yeah. And I would add to that— Alex, you really stated very well the democratization. And one of the reasons we ended up with something like this was because the team that was awarded to run HuBMAP was five different teams and five different awards. And we had to figure out how to work together and work on Azimuth versus work on the portal versus work on infrastructure in totally different teams at different universities. And so we converged on these packaging that he's described. The packaging is a little bit of an additional effort. It's gotten much easier. Each of the teams have been able to do it. And it is a step that gets you one additional step from just picking up a tool and using it right off. We are building environments. There is an SDK for this to work in more of a Jupyter Notebook/Python R type of framework. All of those things are coming and being added in here in the appropriate places, but we're not completely there yet. I mean, we are there. They are produced. They do things, but the number of things they do don't really apply to the question of doing your full calculation computation at the level that we've been talking about the last couple days.
SATRAJIT GHOSH: Thank you, Jonathan. Matt?
MATT MCCORMICK: Just thinking through this very valuable data that's going to be acquired over for many years, and also it's going to be hard to store, it reminds me of the Alexandria and how valuable those books were, and they were stored in one location. And then, of course, they had the fire, and we lost a lot of human knowledge, and what we can do here to maybe prevent losing this valuable data. And I'm wondering if there's any thoughts on how feasible it would be to have redundant storage of this data?
JONATHAN SILVERSTEIN: This is all redundant. And actually, the tooling's all redundant, too. It can be run local or a cloud. It's a hybrid of microservices. You can choose to put them wherever you want. So you can put them— I mean, it's not perfectly easy to rebuild things exactly the same way in both, and so you make some changes. But redundancy and distribution's extremely high in this particular model. That's one of the reasons to use it.
HARRY HAROUTUNIAN: Well, Is that also applicable to cold storage, the longevity of data on tape?
JONATHAN SILVERSTEIN: Well, the cold storage that we're using, in this case, is Deep Glacier.
ALEX ROPELEWSKI: Yeah. And I can also say that tape in general, you're talking a 20-year lifespan per tape, at least that tends to be what's quoted as a usable life period. And you can also, if you want to get redundancy and, say, one that's— a tape storage that's hot and a tape storage that's completely somewhere else, you can do that as well. It's just basically twice the cost.
SATRAJIT GHOSH: So maybe I can piggyback on both Matt's questions and some of the chat that's going on and kind of thinking about— this spans the space from samples to data in this particular context. And I know we're talking about the digital infrastructure over here. If somebody wants to chime in on the sample infrastructure for the tissue sample infrastructure, feel free to kind of talk about that side of storage and redundancy there. On the digital side, perhaps Matt was referring to other decentralized solutions that might come into play. And I wanted to kind of ask where some of those elements could play to provide both redundancy and scalability simultaneously as opposed to centralized storage solutions.
JONATHAN SILVERSTEIN: Yeah, I think the main point is that centralized is a decision to make everything work, but they can live anywhere. And so one of the questions was at the beginning, do we even want to move all the data centrally? This model can be done with the pointers of the files to anywhere. You just have to have their access through some kind of asset API that pulls them from there. So I think part of the whole hybrid microservices architecture that we went after, it's not our invention, we just were disciplined about implementing it in the way that many for-profit businesses do in terms of all of their humongous assets. We were just very disciplined to do it with those rules and basis. There's books written on microservices architectures. Not so many that are hybrid cloud, but that's just an additional distribution framework. So in the digital one, I think it's highly scalable, highly distributable, and you can make choices wherever you are to mirror things and do other stuff. So this kind of approach enables that.
I will just say for HuBMAP, and I'll try and get out of the way because there's lots of other panelists, we don't do anything in terms of specimen. So the Hive and HuBMAP leaves all the specimens at the sites where they're working. We've done nothing for a limb system. We require everything that's going to come in that is a data set with the full provenance from the sample. And the very first thing we have them do is get a digital object identity for that sample in our central system. And they're fully responsible for how that corresponds to their local limb system and where the stuff actually is. And HuBMAP itself doesn't address any of it. There're plenty, plenty of full archives, of course, that do, it's just outside of scope for HuBMAP.
SATRAJIT GHOSH: Thank you. I'm going to move on a little bit to the realistic throughput question. What is the expectation in this community at how fast data will be generated over the next five years? Are we talking 20 petabytes a year? Are we talking 5 petabytes growing up to 100 petabytes a year?
HANCHUAN PENG: Yeah. That's a great question. I think for next five years, definitely, we'll be exceeding 20 petabyte, yeah.
SATRAJIT GHOSH: A year, or over five years?
HANCHUAN PENG: Five years, probably totally. That could be a lower bar, okay? I think I'm a little bit conservative, okay? I mean, probably 20 petabytes of meaningful data. But if you just— the raw data, a lot of kind of redundancy, that probably will exceed 100. I say this because I'm aware of some of the very large-scale projects. I know they're very, very large amounts data. So 20 petabytes, I think, for five years is relatively conservative actually. Yeah.
WILLIAM YANG: Yeah, actually, I also agree with Hanchuan. I think the technology that I just mentioned, I think a lot of labs are doing that. They are completely scalable and democratizable. So I think each brain is about 7 terabytes of raw data, and you can just readily scale to many different brains or different disease or developmental stage, so I think there will be a lot of image and related data, at least from the mouse side.
SATRAJIT GHOSH: And Elizabeth and others, do you want to talk about the human side in terms of potential throughput?
Oh, if you're speaking, we can't hear you.
ELIZABETH HILLMAN: I'm not. I'm having such fun in the chat. I wasn't sure if we were talking about computational stuff or physical samples, which I just threw in there just to be annoying, but since we're here and it's almost 4 o'clock.
SATRAJIT GHOSH: We were just talking about the expected throughput of data. I mean, yes, at the end of five years we might get to about 200 petabytes of raw data, but what is the realistic throughput because that might influence the way the infrastructure is created to serve that process and stuff?
ELIZABETH HILLMAN: I mean, if we were to do it, we would start with a relatively small number and ramp up, right, because you never buy a computer that you need next year, this year, right, because you'll get more RAM and more memory next year, right? So I think ramping up and learning as we go and being ready to take advantage of new technologies as they come along with the hope that five years from now it's possible to process five times more data in a day than it is to do it this year.
HANCHUAN PENG: Yeah. There's one thing, I wish people don't mind actually, I'll talk a little bit about is that while we have been really thinking about, I mean, the community, right? So thinking about generating a lot of data, but I think equally important is that when you also think about who is the actual meaningful consumer of the data, right? So we need to have kind of a quality scientist who can actually consume the data, make sense of the data, not just, say, you design some sort of machine then you think that the machine consume the data. But we actually need human to consume the data. I kind of confused of the purpose why we actually doing all this science and the big science, okay? So it will be nice to have some discussion about that as well.
SATRAJIT GHOSH: So related to that part, maybe I can, again with the infrastructure bent of this panel, maybe I can toss in the notion that we've already heard about different standards and formats of data storage across the sessions. We've also talked about different ways of processing the data. What can infrastructure do to help consolidate that space? Is there a way, whether infrastructure-wise or consortium-wise, where we can get a pathway or a principle way to agree on some of the advantages and disadvantages of standards processing? Because I think, Han Chuan, to the point that you raised about making data useful to the scientists, we would also need some agreement on the processing or the outputs of it in certain ways to make it useful. Is there a pathway towards doing that through the lens of infrastructure? Could that help in some ways?
HANCHUAN PENG: Yeah.
WILLIAM YANG: Let me add a corollary to that. If we build it, will they come?
HANCHUAN PENG: Yeah. I think absolutely. So, yeah. So the better infrastructure definitely makes the data more accessible, right? So that make it much easier for people, first, to see the data. If you cannot see the data— to see the data, and then try to understand the data and understand the scope and implication of the data, and yeah, make sense of the data. But still that's still not about the actual consumer, okay? My point is that we should also kind of have a meaningful training program, okay, for this particular initiative, something to train qualitative scientists instead of just jumping on the data. That is probably what I really want to say, yeah. But I agree with you that the infrastructure definitely is very important. But myself is a infrastructure builder, so I cannot overemphasize it, okay? I completely with you.
ALEX ROPELEWSKI: Well, I think one of the important things, if we think about making the data more accessible, is it really goes back down to the representation that we use for the data. For example, if we take this point cloud representation, we could have all that information for a brain in a spreadsheet format or strip it into JSON, and that would be fine. For other queries, one would need to have different types of abstract representations for the data to be able to search against it. So if we have a better understanding for the particular data types how people would need the data organized to answer their questions, that would be helpful. And maybe I'm being a little too generic here.
BROCK WESTER: Well, yeah. That's why I was asking about tracking of activity to see what would be most useful. And a lot of us could probably rattle off a lot of potential solutions that could address some coat concerns or performance concerns but might make certain scientists grumpy. If you're only using lossy compression for the image data that's available to users, and you put all the lossless, compressed image data in Glacier and don't really give it to anyone or provide access, is that going to serve a need for a number of users? And if you make both available and see which ones are accessing what based on performance— because the highly available lossy compression versus the low available, high latency lossless compression— some people might find ways to do their science based on what's made available. And we'd want the community to weigh in on these implementations and see what happens over the course of this program and over connects which our archive is predominantly focused on. But I think the user activity tracking, I think, to your point, is going to be key. And we should pivot based on what happens and then to the chat— maintain these working groups to continue weighing in as these things evolve.
SATRAJIT GHOSH: Elizabeth?
ELIZABETH HILLMAN: I'll just say, I said it all the way at the beginning when we were younger than we are now, feature extraction I think is better than sharing compressed, not very good or down-sampled data. This is why I really think it's important that we consider what those features are and what we may be able to with the data because it actually also dictates how we collect the data in the first place. And so, yeah this sort of point cloud thing has sort of come up here and there, but the idea of sharing sort of vectorized machine readable quantitative data as a first pass to as many people as want it— and they can actually use that data to refine, "Okay, well, I'd like to take a closer look at this brain region in these channels," and then you can order that data up and work with it. The whole down sampled version of the data you just— I said this earlier as well, right, the nucleus is a feature. It's like saying I want to find everybody's faces. But if I blur the screen, to the point I can't tell that there are faces, I'm lost. You can have nuclear density, but that's it. You can no longer start to say cell type or anything. And so you might render data at a lower resolution, but the information is massively lost compared to, as I said, feature extraction where you can really do things that are meaningful with that data. But it's small enough to download to your own server or your own hard drive even and work with it, which makes it much more widely accessible than something you'd have to log on and work on a server or own your own multi petabyte repository to download the data and work with it locally.
SATRAJIT GHOSH: Thank you. Elizabeth. I know we're coming up on time. There is still a pending question in the chat around data discarding, and so I don't know if you'll answer that question today. But I think we first need data to know what data exists before we think about what data to discard, and at this stage of at least the current generation of human data, we don't have a lot of data to look at even right this minute to say what we would like to discard. I think some of those plans can be put into place. For at least human data, we are in a very data-starved space, right this minute. So there I would say let's wait and get some data to figure out some of the plans. I know many of the people who run archives and other things have plans based on access to datasets. But, I'm not sure we have a clear plan on how long to keep data at this point in time. I saw a hand up there.
HARRY HAROUTUNIAN: That goes to Brock’s points, right, that with usage data, those kinds of decisions will become simpler?
SATRAJIT GHOSH: Right. We're coming up on the hour. I know it's been a long two days of lots of bits and bytes, a high bitrate of information, not bits and bites necessarily. So I want to hand it over to Yong in terms of wrapping up the workshop.
YONG YAO: Yeah, so this is a great workshop and thank you a lot for participating in the discussion.